AWS Glue Table from XML files

A table is a logical collection of rows. The AWS Glue may create such not materialized entity from many data sources. One of them are old-school XML files. Let’s try to import and read 27 files with total size of 35GB.


The source of the data is Wikipedia. I will read all the files to S3. Each file contains multiple rows of the following format:


To create a table we need to define new database, a classifier and a crawler. The classifier defines how the data looks like and the crawler inspects the data, classifies it and updates table metadata.


The crawler ran just 1 minute and correctly recognized files with 36M rows. Unfortunately querying the data directly with Athena does not work.


I tried first to crawl single 35GB file, unfortunately it did not work.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store