Building Hash Table, Part 3

3 min readApr 30, 2021

Dealing with a lot of key/value pairs is not easy. First, huge data does not fit memory; secondly, disk offloading slows down. It is already the third iteration of my playground. Let’s sum up what I built so far.

Memory management

It’s very important part of the system. Because the system relies on huge memory allocation, I decided to abandon garbage collection as much as possible. The key/value binary data resides in one of the following locations:

in-memory block, stored in continuous byte array; data is appended or modified in place; may be allocated; extraction returns just reference to it
file builder block; stored in continuous byte array; data is only archived; it may not be allocated; extraction returns just reference to it
file disk block, stored in pure file; data is dumped using the builder; it cannot be appended or modified; extraction returns transient read data

All three locations are transparently managed by high level component and utilizes memory pool to efficiently reuse already allocated memory.

MemoryPool:
- block(index) -> MemoryBlock
- builder(index) -> FileBuilderMemoryManager:
- allocate(size) -> Allocation
- extract(node) -> Tree | Value
- archive(node)MemoryBlock:
- allocate(size) -> Allocation
- extract(node) -> Tree | ValueFileBuilder:
- allocate(size) -> Allocation
- extract(node) -> Tree | Value
- flush() -> FileBlockFileBlock:
- extract(node) -> Tree | Value

Hashing

The core concept of hash table relies on hashing and buckets. The number of buckets is always a power of 2, therefore it is very important that the hashing equally distributes values. Additionally it brings one benefit that re-hashing may be done in place, because each bucket can only be split into two entries.

# Let's imaging 4 hashes:A: 0b.00011001.11101000.10010010.11101100 ->    434.672.364
B: 0b.10100101.10111011.10010010.11101100 -> -1.514.433.812
C: 0b.10001010.01110110.10010010.11101100 -> -1.971.940.628
D: 0b.11111100.10000110.10010010.11101100 ->    -58.289.428# 16 bits bucket size will put all hashes into one bucket10010010.11101100: A, B, C, D# Extending buckets into 17 bits, splits bucket into 2 buckets0.10010010.11101100: A, C, D
1.10010010.11101100: B# Further extending into 18 bits, splits buckets as following00.10010010.11101100: A
11.10010010.11101100: B
10.10010010.11101100: C, D# 19, 20 bits does not change distribution, but 21 bits does01000.10010010.11101100: A
11011.10010010.11101100: B
10110.10010010.11101100: C
00110.10010010.11101100: D

The buckets are abstracted as continuous array. In order to easy extend it, but not reallocate memory, it is built out of smaller same size arrays. Each bucket entry is just a 4 bytes number pointing at tree node or nothing (zero).

Trees

The hash entry is not implemented as linked list anymore. It became advanced structure containing information about it size, key data and value descriptors. Such structure is called a tree. If any key/value was accessed by get operation, it is marked as accessed. This information is stored per tree and is used in the archiving process.

Tree Node:
- Block       # 2 bytes
- Offset      # 4 bytes
- Accessed    # 2 bytesTree Structure:
- n           # 1 bytes, number of entries
- Lengths     # n bytes, the length of each key
- Descriptors # n*8 bytes, describes each value
  - Block     # 2 bytes, block where value resides
  - Offset    # 4 bytes, offset of the value in the block
  - Length    # 2 bytes, the length of value in the block
- Keys        # binary data of all n keys, one by one

The available memory cannot hold all key/value pairs. It is important that some data is offloaded into disk when memory is exhausted. Such process is called archiving and is triggered when memory in blocks exceed defined threshold. When it happens, the system tries to get rid of 20% of data in the following order: non-used values, used values, non-used trees, used trees.

Once archiving is finished, the memory should be rewritten to reclaim wasted space. It requires scanning all trees and moving affected trees/values from reclaimed block to other blocks.

Benchmark

One hour run generated following results. Each run had assigned expected memory usage. I measured how many rows I was able to insert, how much actual memory was used and how was the speed in thousands operations per second. The benchmark was running batches in which 1 million random rows were inserted/updated and 1 million rows were randomly accessed. The key size was 4 bytes.

memory | rows | used   | speed
---------------------------------
 4096M | 325M |  5057M | 198 kop/s
 8192M | 434M |  9153M | 246 kop/s
16384M | 523M | 17343M | 304 kop/s

Codebase

As always the code is available here.