Recently I made a draft of hash table implementation storing binary data in the memory able to deal with 200 millions entries in 15 GB. The first version made a baseline for next improvements. I am going to tackle it in this article.
The 15 GB memory footprint is no-go. The goal is to store 1 billion entries, but I cannot afford 5 times more memory. It would be simply too expensive. Let’s think about offloading some data to disk. In order to know what could be stored on much slower disk, I need to understand how memory is utilized.
- hash overhead (1 GB); the hash table structure, simple array holding indices of candidate list heads; currently the size of the array is proportional to the number of items; each item needs 4 bytes
- linked item (3 GB); the linked listed structure; simple array holding link list nodes of all entries; each node contains 4 bytes pointer to next node; the size of each node is 16 bytes
- data (11 GB); the actual data for keys (800 MB) and values (10 GB); the keys are accessed very often by each read and write in order to compare them; otherwise values are only accessed by reading them once the key comparison confirmed its equality
Let’s start simple and offload only values (66% of memory usage). It should free a lot of memory and let’s hope it won’t harm performance too much. In the current implementation keys and values are stored in the same memory blocks in pairs. Pairs should be split into dedicated purpose blocks.
Storing data on disk will be handled by file block, which is responsible for building data, moving it to file and serving data directly from file on demand.
The important question is when I should trigger the offload. Let’s hard code it as “the number of values blocks exceeded number 25”. Long term it should be expressed in number of occupied memory or number of kept entries.
What I am going to offload once the threshold is crossed? I am going to offload only the rows which were never accessed in read mode. Ideally such rows are “hot” and are welcome to be accessed again.
I ran the benchmark as follows. Basically it tries to insert 500 millions random key/value pairs in 10 millions batches. After each batch it tries to read 10 millions random keys. Each batch reports the memory usage and runtime.
I ran benchmark on two disk types (SSD and HDD). I also ran it in special case when I do not want to read the data in order to measure its impact on overall benchmark duration. The code is available here.