XML is one of the supported Snowflake formats for data loading. I am going to check how Snowflake is dealing with such files. I will compare the data with my previous experiments with AWS Glue.
I have already prepared S3 bucket with offloaded data from Wikipedia’s FTP. There is 27 files with total size around 35GB and according to AWS Glue crawler results, the entire dataset should contain around 36M rows.
Snowflake can access S3 data directly from a stage. It can be even temporary stage with temporary credentials.
Let’s query the total number of rows. I expect 36M rows because AWS Glue Crawled reported it.
Surprisingly the query returned 96M rows, all are unique. Why did the AWS Crawler return something else?
I need to transform XML data to CSV because Athena does not support it. I created using AWS Glue Studio the following script.
The execution of the script on two workers took over one hour and created 27 CSV files with 800MB of total size (only ids).
I checked the CSV values directly in Snowflake. Exactly like with XML I created a stage and counted and verified the numbers.
- The AWS Glue Crawler may return wrong number of rows
- The Snowflake database can query XML files pretty fast
- The AWS Glue Spark Job needed 73 minutes to transform 35GB of XML to CSV which Snowflake could read and probably transform in 5 minutes