Building Hash Table, Part 2

Bottleneck

The 15 GB memory footprint is no-go. The goal is to store 1 billion entries, but I cannot afford 5 times more memory. It would be simply too expensive. Let’s think about offloading some data to disk. In order to know what could be stored on much slower disk, I need to understand how memory is utilized.

  • hash overhead (1 GB); the hash table structure, simple array holding indices of candidate list heads; currently the size of the array is proportional to the number of items; each item needs 4 bytes
  • linked item (3 GB); the linked listed structure; simple array holding link list nodes of all entries; each node contains 4 bytes pointer to next node; the size of each node is 16 bytes
  • data (11 GB); the actual data for keys (800 MB) and values (10 GB); the keys are accessed very often by each read and write in order to compare them; otherwise values are only accessed by reading them once the key comparison confirmed its equality

Offloading

Let’s start simple and offload only values (66% of memory usage). It should free a lot of memory and let’s hope it won’t harm performance too much. In the current implementation keys and values are stored in the same memory blocks in pairs. Pairs should be split into dedicated purpose blocks.

Benchmark

I ran the benchmark as follows. Basically it tries to insert 500 millions random key/value pairs in 10 millions batches. After each batch it tries to read 10 millions random keys. Each batch reports the memory usage and runtime.

Results

I ran benchmark on two disk types (SSD and HDD). I also ran it in special case when I do not want to read the data in order to measure its impact on overall benchmark duration. The code is available here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adrian Macal

Adrian Macal

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.