Sorting S3 10GB with 1GB RAM

Handling big files without loading them into memory requires processing them in chunks. I am going to show you how to sort 10GB JSON file in pure Python in non-distributed environment.

Algorithm

Generally the idea is to split big JSON file into consistent chunks (512MB) which you can sort in memory. Them persist them temporarily back in S3. And finally merge them by streaming into final big JSON file.

Pipeline

To make things simpler for me I am going to use my binary pipeline for the implementation. The actual code (not pseudo) looks like the following snippet:

Components

The binary pipeline exposes following components needed to sort 10GB file.

  • S3Download —takes S3Object and outputs binary stream
  • NDJsonChunk — takes binary stream and outputs it in consistent chunks
  • ForEachChunk — takes binary stream and for each chunk calls steps
  • NDJsonIndex — takes binary stream and outputs key, data pairs
  • QuickSort — takes key, data pairs, sorts them and outputs key, data pairs
  • NDJsonFlush — takes key, data pairs and outputs binary stream
  • S3Upload — takes binary stream, writes it to S3 and outputs S3Object
  • WaitAll — waits for all dictionary data and outputs it
  • MergeSort — takes stream, for each item calls steps and merges them
  • Singleton — waits for all data, ignores it and produces single value item
  • S3List —takes S3Prefix stream and outputs all S3Object matching it
  • S3Delete — takes S3Object stream, deletes it and outputs it
  • DictDebug — takes dictionary stream and prints it

Results

I ran the code on 10GB in ECS container and the whole process took around 20 minutes. The entire code is available here.

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store