Handling big files without loading them into memory requires processing them in chunks. I am going to show you how to sort 10GB JSON file in pure Python in non-distributed environment.
Generally the idea is to split big JSON file into consistent chunks (512MB) which you can sort in memory. Them persist them temporarily back in S3. And finally merge them by streaming into final big JSON file.
To make things simpler for me I am going to use my binary pipeline for the implementation. The actual code (not pseudo) looks like the following snippet:
The binary pipeline exposes following components needed to sort 10GB file.
- S3Download —takes S3Object and outputs binary stream
- NDJsonChunk — takes binary stream and outputs it in consistent chunks
- ForEachChunk — takes binary stream and for each chunk calls steps
- NDJsonIndex — takes binary stream and outputs key, data pairs
- QuickSort — takes key, data pairs, sorts them and outputs key, data pairs
- NDJsonFlush — takes key, data pairs and outputs binary stream
- S3Upload — takes binary stream, writes it to S3 and outputs S3Object
- WaitAll — waits for all dictionary data and outputs it
- MergeSort — takes stream, for each item calls steps and merges them
- Singleton — waits for all data, ignores it and produces single value item
- S3List —takes S3Prefix stream and outputs all S3Object matching it
- S3Delete — takes S3Object stream, deletes it and outputs it
- DictDebug — takes dictionary stream and prints it
I ran the code on 10GB in ECS container and the whole process took around 20 minutes. The entire code is available here.