Data structures are part of software we use every day. The most common structure is key/value map to get or set values by key. Today I am going to play a bit with it on bigger data set.
I am not going to build any replacement to existing core dictionaries, maps or hash tables. As a final outcome I would like to have such data structure with hash table behavior able to handle 1 billion keys. And it is not going to happen in this article.
I am starting with very simple approach: memory only. Let’s build a structure with one basic operation to add or update key/value pair in a very raw binary format:
void Set(byte key, byte value)
I am very afraid of garbage collection so I am going to work only with pure arrays and I won’t create a lot of reference objects. The structure will handle three types of memory allocations:
- data store, the key/values will be stored in the continuous 64 MB data blocks to avoid small arrays allocations
- candidate list, details of stored pairs, like next candidate, block number or offset within block
- hash array, the heart of distributing keys, just pointing at first element in the candidate list
Let’s imagine I have already allocated 64 MB byte array. If my key/value are also byte arrays I can try to store them by just copying them into the big array. I only need to know where to copy it to and remember it.
Knowing where the key/value was copied, I need to store the block number, the offset and key/value lengths. Because hash table may have conflicts I also need to “reference” next candidate (therefore it is called a candidate list).
The final array is the array pointing at candidate list index. The offset indicates hashed key modulo array length and the value points at index in the candidate list (head). It is implemented as array of integers, has the size of some prime number and needs to be rebuilt/rehashed when number of hashed items crosses array size.
I ran the benchmark putting random 200 million key/values. It took around 3 minutes and consumed 15 GB of RAM. The hash table had very low number of collisions (up to 8, measured by the maximum length of candidate lists). The data store size reached 11 GB, where overhead of hash array and candidate list was around 4 GB.
The implementation is available here.