Wikipedia’s data dumps

Data engineers are looking for source of data they can play with. Sometimes it is just about something big enough to test performance tweaks. I am using public Wikipedia’s data dumps to experiment.

Where

If you follow official documentation you will find a link to the latest dump. I would recommend you to consider mirrors to fetch all the data.

What

The data is exported as XML so you need to learn how to work with it or even how to transform it to the format your tool can deal with. Generally the format is easy to understand, so you won’t get lost.

Why

Playing with the data you don’t own and you cannot influence is a great way of expanding your technical skills. Playing with the data you don’t know may improve your analytical skills.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store