Building Hash Table, Part 2

Recently I made a draft of hash table implementation storing binary data in the memory able to deal with 200 millions entries in 15 GB. The first version made a baseline for next improvements. I am going to tackle it in this article.

Bottleneck

The 15 GB memory footprint is no-go. The goal is to store 1 billion entries, but I cannot afford 5 times more memory. It would be simply too expensive. Let’s think about offloading some data to disk. In order to know what could be stored on much slower disk, I need to understand how memory is utilized.

Offloading

Let’s start simple and offload only values (66% of memory usage). It should free a lot of memory and let’s hope it won’t harm performance too much. In the current implementation keys and values are stored in the same memory blocks in pairs. Pairs should be split into dedicated purpose blocks.

Storing data on disk will be handled by file block, which is responsible for building data, moving it to file and serving data directly from file on demand.

The important question is when I should trigger the offload. Let’s hard code it as “the number of values blocks exceeded number 25”. Long term it should be expressed in number of occupied memory or number of kept entries.

What I am going to offload once the threshold is crossed? I am going to offload only the rows which were never accessed in read mode. Ideally such rows are “hot” and are welcome to be accessed again.

Benchmark

I ran the benchmark as follows. Basically it tries to insert 500 millions random key/value pairs in 10 millions batches. After each batch it tries to read 10 millions random keys. Each batch reports the memory usage and runtime.

Results

I ran benchmark on two disk types (SSD and HDD). I also ran it in special case when I do not want to read the data in order to measure its impact on overall benchmark duration. The code is available here.

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.