Wikipedia’s data dumps

Data engineers are looking for source of data they can play with. Sometimes it is just about something big enough to test performance tweaks. I am using public Wikipedia’s data dumps to experiment.


If you follow official documentation you will find a link to the latest dump. I would recommend you to consider mirrors to fetch all the data.


The data is exported as XML so you need to learn how to work with it or even how to transform it to the format your tool can deal with. Generally the format is easy to understand, so you won’t get lost.


Playing with the data you don’t own and you cannot influence is a great way of expanding your technical skills. Playing with the data you don’t know may improve your analytical skills.