Having tested tools like AWS Glue + Athena, EMR and Snowflake in context their performance to query XML files, I decided to give a try very, very naive approach of converting XML files to JSON rows in pure Python.
I am still on Wikipedia file which has a lot of following rows:
which should be converted to JSON
XML in Python
What I need is the XML reader, I do not want to read all XML as DOM into memory, it would crash my application. I need to read it node by node and take care of not leaving relationships between nodes. There is lxml package with nice iterparse function to handle it perfectly without any memory leaks. Let’s create another component for my binary pipeline:
And its connection in the pipeline:
I converted 35GB of XML into 19GB of JSON in 19 minutes on c5.2xlarge machine. The bottleneck was CPU. It means it can be either optimized or distributed. The entire code may be found here.