AWS Glue Table from XML files

A table is a logical collection of rows. The AWS Glue may create such not materialized entity from many data sources. One of them are old-school XML files. Let’s try to import and read 27 files with total size of 35GB.

ETL

The source of the data is Wikipedia. I will read all the files to S3. Each file contains multiple rows of the following format:

Table

To create a table we need to define new database, a classifier and a crawler. The classifier defines how the data looks like and the crawler inspects the data, classifies it and updates table metadata.

Running

The crawler ran just 1 minute and correctly recognized files with 36M rows. Unfortunately querying the data directly with Athena does not work.

Limitations?

I tried first to crawl single 35GB file, unfortunately it did not work.