Sorting S3 10GB with 1GB RAM

2 min readNov 29, 2020

Handling big files without loading them into memory requires processing them in chunks. I am going to show you how to sort 10GB JSON file in pure Python in non-distributed environment.

Algorithm

Generally the idea is to split big JSON file into consistent chunks (512MB) which you can sort in memory. Them persist them temporarily back in S3. And finally merge them by streaming into final big JSON file.

Pipeline

To make things simpler for me I am going to use my binary pipeline for the implementation. The actual code (not pseudo) looks like the following snippet:

Components

The binary pipeline exposes following components needed to sort 10GB file.

S3Download —takes S3Object and outputs binary stream
NDJsonChunk — takes binary stream and outputs it in consistent chunks
ForEachChunk — takes binary stream and for each chunk calls steps
NDJsonIndex — takes binary stream and outputs key, data pairs
QuickSort — takes key, data pairs, sorts them and outputs key, data pairs
NDJsonFlush — takes key, data pairs and outputs binary stream
S3Upload — takes binary stream, writes it to S3 and outputs S3Object
WaitAll — waits for all dictionary data and outputs it
MergeSort — takes stream, for each item calls steps and merges them
Singleton — waits for all data, ignores it and produces single value item
S3List —takes S3Prefix stream and outputs all S3Object matching it
S3Delete — takes S3Object stream, deletes it and outputs it
DictDebug — takes dictionary stream and prints it

Results

I ran the code on 10GB in ECS container and the whole process took around 20 minutes. The entire code is available here.

Sorting S3 10GB with 1GB RAM

Algorithm

Pipeline

Components

Results

Written by Adrian Macal

No responses yet