Dealing with a lot of key/value pairs is not easy. First, huge data does not fit memory; secondly, disk offloading slows down. It is already the third iteration of my playground. Let’s sum up what I built so far.

It’s very important part of the system. Because the system relies on huge memory allocation, I decided to abandon garbage collection as much as possible. The key/value binary data resides in one of the following locations:

  • in-memory block, stored in continuous byte array; data is appended or modified in place; may be allocated; extraction returns just reference to it
  • file builder…

Recently I made a draft of hash table implementation storing binary data in the memory able to deal with 200 millions entries in 15 GB. The first version made a baseline for next improvements. I am going to tackle it in this article.

The 15 GB memory footprint is no-go. The goal is to store 1 billion entries, but I cannot afford 5 times more memory. It would be simply too expensive. Let’s think about offloading some data to disk. In order to know what could be stored on much slower disk, I need to understand how memory is utilized.

Data structures are part of software we use every day. The most common structure is key/value map to get or set values by key. Today I am going to play a bit with it on bigger data set.

I am not going to build any replacement to existing core dictionaries, maps or hash tables. As a final outcome I would like to have such data structure with hash table behavior able to handle 1 billion keys. And it is not going to happen in this article.

I am starting with very simple approach: memory only. Let’s build a structure with…

Contributing to project requires setting up development environment. Joining a team requires more installations. Working in multiple teams makes it even more complex. I had a dream that I do not need to do anything. It came true.

VS Code

VS Code already offers really nice experience. You can install extension for any language and it works almost out of the box. Unfortunately most extensions require already configured build tools to support each programming language. The rescue comes as another extension:

After installing it you are required to have only working docker to use this feature. How to use it? It is…

The ECS Task can be started from the other machine and can be waited till completed or failed. The logs are streamed directly to CloudWatch. The waiting process can poll and show them while waiting for the task.

The problem occurs if I run 20 or more tasks in my binary pipeline. I know something is working and even progressing, but I do not want to see just ECS task started and complete messages. I don’t want to check logs manually in the AWS CloudWatch neither. …

The code may behave differently than developer expects. I mean not the correctness, but the performance. The performance bottleneck may be easy to found by running the profiler. cProfile is the way to do it in Python.

I am going to run my binary pipeline code twice using the profiler. First time in my local computer where the main bottleneck is my internet connection. The second time, on AWS ECS node, where the bottleneck should be CPU.

I ran locally the script as I used to do, but wrapped it around the cProfiler:

pipenv run python -m cProfile -s tottime…

Handling big files without loading them into memory requires processing them in chunks. I am going to show you how to sort 10GB JSON file in pure Python in non-distributed environment.

Generally the idea is to split big JSON file into consistent chunks (512MB) which you can sort in memory. Them persist them temporarily back in S3. And finally merge them by streaming into final big JSON file.

To make things simpler for me I am going to use my binary pipeline for the implementation. The actual code (not pseudo) looks like the following snippet:

The binary pipeline exposes…

Date or time ranges are common attributes in the database. Quite often something starts and ends. The concurrency may these beings be defined as how many of the are happening at the same time. Let’s calculate it.

There is two tables with following columns: id, start_at and end_at. The first table contains 11M rows, the second one almost 400M.

Theoretically the problem can be reduced to discreet values and the grouping may be applied. If we know all the timestamps when the value may change (when it starts or ends) we can build a timeline to use for evaluation using…

File systems tend to expose creation and modification dates of each entry. AWS S3 is not a file system, but exposes “Last Modified” date which is a bit confusing, because S3 object is not modifiable, but can be overwritten.

The goal of the experiment is to figure out how it really behaves, especially in the multipart upload scenario. I have an output file which I am going to write in 128MB parts using my slow internet connection. It is going to take several minutes to upload 1GB.

The process created multipart upload at 14:00, the first part completed at 14:01…

The Python’s old school way towards concurrency is multiprocessing module which spawns new processes. The asyncio module is a step to modernity. I will migrate my small data pipeline and compare the performance.

The classic multiprocessing can be triggered by the Pool. You do not need to take care of anything just to pass list of tasks you want to run concurrently.

The modern asyncio brings the same simplicity what multiprocessing. You need to have a Pool of threads and schedule some tasks. The threads are good enough if you are going to run IO intensive tasks.


Adrian Macal

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store