AWS Glue Table from XML files
A table is a logical collection of rows. The AWS Glue may create such not materialized entity from many data sources. One of them are old-school XML files. Let’s try to import and read 27 files with total size of 35GB.
ETL
The source of the data is Wikipedia. I will read all the files to S3. Each file contains multiple rows of the following format:
Table
To create a table we need to define new database, a classifier and a crawler. The classifier defines how the data looks like and the crawler inspects the data, classifies it and updates table metadata.
Running
The crawler ran just 1 minute and correctly recognized files with 36M rows. Unfortunately querying the data directly with Athena does not work.
Limitations?
I tried first to crawl single 35GB file, unfortunately it did not work.