Dealing with a lot of key/value pairs is not easy. First, huge data does not fit memory; secondly, disk offloading slows down. It is already the third iteration of my playground. Let’s sum up what I built so far.

It’s very important part of the system. Because the system relies…

Recently I made a draft of hash table implementation storing binary data in the memory able to deal with 200 millions entries in 15 GB. The first version made a baseline for next improvements. I am going to tackle it in this article.

The 15 GB memory footprint is no-go…

Data structures are part of software we use every day. The most common structure is key/value map to get or set values by key. Today I am going to play a bit with it on bigger data set.

I am not going to build any replacement to existing core dictionaries…

Contributing to project requires setting up development environment. Joining a team requires more installations. Working in multiple teams makes it even more complex. I had a dream that I do not need to do anything. It came true.

VS Code

VS Code already offers really nice experience. You can install extension for…

The ECS Task can be started from the other machine and can be waited till completed or failed. The logs are streamed directly to CloudWatch. The waiting process can poll and show them while waiting for the task.

The problem occurs if I run 20 or more tasks in my…

The code may behave differently than developer expects. I mean not the correctness, but the performance. The performance bottleneck may be easy to found by running the profiler. cProfile is the way to do it in Python.

I am going to run my binary pipeline code twice using the profiler…

Handling big files without loading them into memory requires processing them in chunks. I am going to show you how to sort 10GB JSON file in pure Python in non-distributed environment.

Generally the idea is to split big JSON file into consistent chunks (512MB) which you can sort in memory…

Date or time ranges are common attributes in the database. Quite often something starts and ends. The concurrency may these beings be defined as how many of the are happening at the same time. Let’s calculate it.

There is two tables with following columns: id, start_at and end_at. …

File systems tend to expose creation and modification dates of each entry. AWS S3 is not a file system, but exposes “Last Modified” date which is a bit confusing, because S3 object is not modifiable, but can be overwritten.

The goal of the experiment is to figure out how it…

The Python’s old school way towards concurrency is multiprocessing module which spawns new processes. The asyncio module is a step to modernity. I will migrate my small data pipeline and compare the performance.

The classic multiprocessing can be triggered by the Pool. You do not need to take care of…

Adrian Macal

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store