Wikipedia’s data dumps

Adrian Macal
Nov 9, 2020

--

Data engineers are looking for source of data they can play with. Sometimes it is just about something big enough to test performance tweaks. I am using public Wikipedia’s data dumps to experiment.

Where

If you follow official documentation you will find a link to the latest dump. I would recommend you to consider mirrors to fetch all the data.

What

The data is exported as XML so you need to learn how to work with it or even how to transform it to the format your tool can deal with. Generally the format is easy to understand, so you won’t get lost.

Why

Playing with the data you don’t own and you cannot influence is a great way of expanding your technical skills. Playing with the data you don’t know may improve your analytical skills.

--

--

Adrian Macal
Adrian Macal

Written by Adrian Macal

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.

No responses yet