As expected from data warehousing solution Snowflake offers many connectors compatible with industry standards in each programming language. As modern data warehousing solution Snowflake offers REST API agnostic to any programming language.

Generally when I started using Snowflake I chose default option and applied ready to use Python connector. It…

Dealing with a lot of key/value pairs is not easy. First, huge data does not fit memory; secondly, disk offloading slows down. It is already the third iteration of my playground. Let’s sum up what I built so far.

Memory management

It’s very important part of the system. Because the system relies…

Recently I made a draft of hash table implementation storing binary data in the memory able to deal with 200 millions entries in 15 GB. The first version made a baseline for next improvements. I am going to tackle it in this article.

Bottleneck

The 15 GB memory footprint is no-go…

Data structures are part of software we use every day. The most common structure is key/value map to get or set values by key. Today I am going to play a bit with it on bigger data set.

Goal

I am not going to build any replacement to existing core dictionaries…

Contributing to project requires setting up development environment. Joining a team requires more installations. Working in multiple teams makes it even more complex. I had a dream that I do not need to do anything. It came true.

VS Code

VS Code already offers really nice experience. You can install extension for…

The ECS Task can be started from the other machine and can be waited till completed or failed. The logs are streamed directly to CloudWatch. The waiting process can poll and show them while waiting for the task.

Idea

The problem occurs if I run 20 or more tasks in my…

The code may behave differently than developer expects. I mean not the correctness, but the performance. The performance bottleneck may be easy to found by running the profiler. cProfile is the way to do it in Python.

Goal

I am going to run my binary pipeline code twice using the profiler…

Handling big files without loading them into memory requires processing them in chunks. I am going to show you how to sort 10GB JSON file in pure Python in non-distributed environment.

Algorithm

Generally the idea is to split big JSON file into consistent chunks (512MB) which you can sort in memory…

Date or time ranges are common attributes in the database. Quite often something starts and ends. The concurrency may these beings be defined as how many of the are happening at the same time. Let’s calculate it.

Setup

There is two tables with following columns: id, start_at and end_at. …

File systems tend to expose creation and modification dates of each entry. AWS S3 is not a file system, but exposes “Last Modified” date which is a bit confusing, because S3 object is not modifiable, but can be overwritten.

Goal

The goal of the experiment is to figure out how it…

Adrian Macal

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store