Learning Rust: Parsing big XML files

Adrian Macal
5 min readMay 21, 2023

--

Cracking the XML Giant: How to Explore Large XML Files in Seconds, Not Hours.

XML is an older format that remains in use. You may not encounter it in modern APIs (though it may appear in fat Java SOAP Web Services), but some tools still find it beneficial as an export format. One notable example is Wikipedia, which exports all of its content to immense XML files. Today, I will demonstrate a highly efficient method for reading these files.

The XML documents are so established that I learned about them more than 20 years ago. Transitioning to XML from its predecessors was a significant step, promising numerous advantages like human and machine readability, structure and flexibility, as well as platform and language independence.

I recall older C# tutorials focusing on two primary methods of reading XML documents:

  • XmlDocument, by loading the entire document into memory, returning a nested DOM object from which you can access any node at any time.
  • XmlReader, by parsing the document node by node, delegating the task of meaningful document reconstruction to the caller.

Modern developers typically default to the first method. It’s incredibly straightforward to use. When a document is 1kB or 1MB, this approach is perfectly acceptable. For documents exceeding 100MB, it’s still feasible but results in a substantial memory footprint.

Consider dealing with Wikipedia dumps, where each file’s size ranges between 1GB and 100GB. Today, I am going to use an XML reader to efficiently understand what’s inside a 37GB file.

Crate: quick-xml

Quick-xml is a type of XML reader. It behaves similarly to C# XmlReader, but it’s considerably smarter. It reminds me of a project I did for fun in .net six years ago. At that time, I realized that managing encoding and string conversion consumed a lot of time, even for the nodes I didn’t care about. I observed that any UTF-8 encoded XML can be decoded in binary mode. This is possible because all markup characters are present in ASCII, and skipping high UTF-8 characters is relatively straightforward.

The crate appears to benefit from a similar tactic — it enables forward-only iteration over all nodes, with each node decoded at the byte level. After acquiring such a node, you can request its conversion into a string. If you wish to ignore the node, no string conversion occurs, saving precious CPU resources. The result is impressive performance.

Reading 37GB file

I had already manually downloaded Wikipedia’s file and decompressed it locally. My goal was to swiftly understand the schema of the file, meaning I wanted to view occurrences of each path from the root to node name or attribute name. I wrote the following code:

The code goes through a hardcoded large XML file while constantly updating two pieces of information:

  • path, which works like a stack showing the way from the root to the current node
  • seen, which is a dictionary that keeps track of all the paths we have already come across

Adding to the counters and handling attributes are put into their own functions. This way, I can use the same bit of code in different places:

When the code runs, it generates something like the attached text in just 12 seconds. It’s exactly what I wanted to achieve and it’s really impressive that it was created so quickly using only a single CPU.

"mediawiki": 1
"mediawiki" / "page": 259
"mediawiki" / "page" / "id": 259
"mediawiki" / "page" / "id" / @text: 259
"mediawiki" / "page" / "ns": 259
"mediawiki" / "page" / "ns" / @text: 259
"mediawiki" / "page" / "redirect": 113
"mediawiki" / "page" / "redirect" / @attribute / "title": 113
"mediawiki" / "page" / "revision": 525890
"mediawiki" / "page" / "revision" / "comment": 430730
"mediawiki" / "page" / "revision" / "comment" / @attribute / "deleted": 111
"mediawiki" / "page" / "revision" / "comment" / @text: 430619
"mediawiki" / "page" / "revision" / "contributor": 525890
"mediawiki" / "page" / "revision" / "contributor" / "id": 362808
"mediawiki" / "page" / "revision" / "contributor" / "id" / @text: 362808
"mediawiki" / "page" / "revision" / "contributor" / "ip": 163064
"mediawiki" / "page" / "revision" / "contributor" / "ip" / @text: 163064
"mediawiki" / "page" / "revision" / "contributor" / "username": 362808
"mediawiki" / "page" / "revision" / "contributor" / "username" / @text: 362808
"mediawiki" / "page" / "revision" / "contributor" / @attribute / "deleted": 18
"mediawiki" / "page" / "revision" / "contributor" / @text: 1414552
"mediawiki" / "page" / "revision" / "format": 525890
"mediawiki" / "page" / "revision" / "format" / @text: 525890
"mediawiki" / "page" / "revision" / "id": 525890
"mediawiki" / "page" / "revision" / "id" / @text: 525890
"mediawiki" / "page" / "revision" / "minor": 147557
"mediawiki" / "page" / "revision" / "model": 525890
"mediawiki" / "page" / "revision" / "model" / @text: 525890
"mediawiki" / "page" / "revision" / "parentid": 525649
"mediawiki" / "page" / "revision" / "parentid" / @text: 525649
"mediawiki" / "page" / "revision" / "sha1": 525890
"mediawiki" / "page" / "revision" / "sha1" / @text: 525721
"mediawiki" / "page" / "revision" / "text": 525890
"mediawiki" / "page" / "revision" / "text" / @attribute / "bytes": 525890
"mediawiki" / "page" / "revision" / "text" / @attribute / "deleted": 169
"mediawiki" / "page" / "revision" / "text" / @attribute / "xml:space": 524534
"mediawiki" / "page" / "revision" / "text" / @text: 524534
"mediawiki" / "page" / "revision" / "timestamp": 525890
"mediawiki" / "page" / "revision" / "timestamp" / @text: 525890
"mediawiki" / "page" / "revision" / @text: 5311056
"mediawiki" / "page" / "title": 259
"mediawiki" / "page" / "title" / @text: 259
"mediawiki" / "page" / @text: 527039
"mediawiki" / "siteinfo": 1
"mediawiki" / "siteinfo" / "base": 1
"mediawiki" / "siteinfo" / "base" / @text: 1
"mediawiki" / "siteinfo" / "case": 1
"mediawiki" / "siteinfo" / "case" / @text: 1
"mediawiki" / "siteinfo" / "dbname": 1
"mediawiki" / "siteinfo" / "dbname" / @text: 1
"mediawiki" / "siteinfo" / "generator": 1
"mediawiki" / "siteinfo" / "generator" / @text: 1
"mediawiki" / "siteinfo" / "namespaces": 1
"mediawiki" / "siteinfo" / "namespaces" / "namespace": 30
"mediawiki" / "siteinfo" / "namespaces" / "namespace" / @attribute / "case": 30
"mediawiki" / "siteinfo" / "namespaces" / "namespace" / @attribute / "key": 30
"mediawiki" / "siteinfo" / "namespaces" / "namespace" / @text: 29
"mediawiki" / "siteinfo" / "namespaces" / @text: 31
"mediawiki" / "siteinfo" / "sitename": 1
"mediawiki" / "siteinfo" / "sitename" / @text: 1
"mediawiki" / "siteinfo" / @text: 7
"mediawiki" / @attribute / "version": 1
"mediawiki" / @attribute / "xml:lang": 1
"mediawiki" / @attribute / "xmlns": 1
"mediawiki" / @attribute / "xmlns:xsi": 1
"mediawiki" / @attribute / "xsi:schemaLocation": 1
"mediawiki" / @text: 261

Final thought

Packages like quick-xml remind me that we can keep making software better and smarter. It’s too easy to waste computer power, but finding or using simple solutions can be hard. Let’s keep looking for them.

The code you can find here: https://github.com/amacal/learning-rust/tree/reading-big-xml

--

--

Adrian Macal

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.