Learning algorithms: Count estimators
Dealing with large data sets can be challenging. I am going to show you how count estimators can help you accurately estimate frequencies even with limited memory and computing power.
The pigeonhole principle is a fundamental concept that has stuck with me since I first heard about it over 20 years ago in a math class. The principle states that if you have more objects than places to put them, at least one place must have more than one object. This simple idea has broad implications in computer science, specifically in the design of algorithms, such as hash functions, bloom filters, and other probabilistic data structures.
Count estimators are algorithms that provide an approximate count of the number of occurrences of a particular item in a dataset, without actually counting them. We will explore two popular count estimators and how they can be used to analyze network traffic data in real-time.
We’ll start by examining Hash Map, a commonly used data structure that provides an ideal solution for accurate counting. However, as the size of the dataset grows, memory limitations and processing power can make it difficult to use. To address these challenges, we’ll explore two popular count estimators: Count-Min Sketch and Count Sketch. These algorithms use hash functions to generate probabilistic count estimates, making them much more memory-efficient and scalable.
Finally, we will demonstrate how both algorithms can be used to monitor real-time network traffic to detect hosts transmitting a large amount of data. We will explore how to implement these algorithms and demonstrate their effectiveness in handling large and complex network traffic datasets.
Hash Map
The Hash Map is a data structure that allows us to insert and retrieve key-value pairs in constant time and with a memory usage of O(n). This makes it an ideal solution for situations where we need to manipulate data quickly and efficiently. For example, let’s say we have five items to store in our Hash Map: Cat (1), Horse (15), Dog (5), Whale (2), and Bird (7).
To insert all five items (Cat, Horse, Dog, Whale, and Bird) into a Hash Map, we can create an array of size m = 3. For each item, we compute the hash of the key and then use the hash value to compute the array index by taking the hash modulo m. This allows us to find the “pigeonhole” for each element. Ideally, good hashing will evenly distribute all items among the table, meaning that some elements may be placed in the same bucket, some buckets will remain empty, and most items will be alone in their respective buckets.
Since some elements may share the same bucket in a Hash Map, we need to store not only the value but also the key to be able to distinguish them. This is known as a collision, and it means that we may need to traverse an entire list to compare the keys and determine if a key matches the one we are looking for. Interestingly, even in buckets that contain only one item, we still need to compare keys, because the hash alone is not enough to determine equality.
The Hash Map is highly scalable and can efficiently handle millions of key-value pairs, while still maintaining excellent performance. However, there are key areas where memory is consumed: the hashing array, the keys in the bucket, and the values stored in the Hash Map. Even though the primary purpose of the data structure is to store the values, we need to store much more information than is actually needed.
For example, if we want to store one million pairs of IP addresses and the corresponding total number of sent bytes, we would need around 4MB for the hash array (1M x 4 bytes), 4MB for values (1M x 4 bytes), and potentially 15MB for keys (1M x 15 bytes). It makes more or less 25MB of our precious memory usage. Despite this overhead, the Hash Map is still an efficient and highly effective data structure for storing and manipulating key-value pairs.
Count-Min Sketch
What if we removed all keys from the Hash Map structure and stored all values together in each bucket? While this would significantly reduce memory usage, it would also compromise the structure’s exactness by removing its key property, but still it would allow to update or query the values. When it is done, we would have the very first version of the Count-Min Sketch (depth=1 and width=3).
In the visualization, we use only 3 buckets, which store only the summed-up values within each bucket as integers. This means that we no longer know exactly how many of each key we have. When we try to query the structure for a particular key, the value returned will be the sum of all the values within that bucket. For example, when we ask for the total number of Horse (15), the structure will return 24, which is a bit more than the actual value of 15. The same incorrect answer would be returned when querying the structure for Whale (2), Bird (7), or any other key that is not present in the structure but whose Hash A% 3 == 0.
The solution to improve this approach is relatively simple: create another hash table with the same number of buckets but a different hash function. By using a different hash function, we expect to distribute the items differently among the buckets. Let’s see the second hash table in action.
After implementing the second hash table, we can observe that the distribution of items among the buckets is different than before. The items are still combined randomly, but in a different way than with the previous hash function. By adding 3rd hash table, we can see that we now have a Count-Min Sketch with depth = 3 and width = 3. Depth describes how many hash functions are used; width describes number of buckets per hash function.
To obtain the final count for a specific key, we can follow this approach. Let’s say we are looking for the key Bird (7). For each row, we need to calculate the hash value of Bird (7) using the corresponding hash function and then take the result modulo the number of buckets (3 in our example). Then, we need to find the minimum count of all the buckets that correspond to the calculated hash value. This minimum count is our estimated count for the key Bird (7). We found Bird (7) = min(24, 15, 27).
The reason for taking the minimum count is that each cell, in which Bird (7) is present, contains a number that is at least as large as the count of Bird (7) in the cell, but it can also be much larger. By taking the minimum value, we can reduce the effect of overestimation and obtain a more accurate estimate for the count of the key.
Here is complete list of all estimations:
- Cat (1): min(1, 15, 3) = 1
- Horse (15): min(17, 24, 27) = 17
- Dog (5): min(5, 15, 27) = 5
- Whale (2): min(24, 17, 3) = 3
- Bird (7): min(24, 15, 27) = 15
It is important to note that the Count-Min Sketch structure can perform better with higher depth and width, which comes at the cost of increased memory usage. Increasing depth will also result in higher CPU usage due to the computation of more hashes, but it can also reduce the error more significantly than increasing the width. We can choose the depth and width based on the accuracy required and available resources.
Count Sketch
The Count Sketch algorithm is similar to the Count-Min Sketch, but it addresses the issue of multiple colliding items in the same bucket potentially boosting the final estimation. To solve this problem, the Count Sketch introduces a form of randomization, where each item is expected to be added with a different sign, causing that infrequent items are very likely to be reduced to zero after being summed up. In this way, it provides a more accurate estimation of the very frequent items, while still using a compact data structure with the same memory requirements. Let’s visualize previous set of animals in the new data structure.
The visualization of data in Count-Min Sketch and Count Sketch is quite similar, with two main differences. First, in the Count Sketch, some items are placed at the bottom of the cell to indicate their negative contribution to the cell value. Second, the final value of a cell may be negative, positive, or zero, even if some elements are present in the cell. This approach helps to balance out the values. To obtain the final result, we need to find all candidates in each row (in the same way as we did in Count-Min Sketch) and take the median (not minimum), additionally inverting the sign if applicable.
Here is complete list of all estimations:
- Cat (1): median(-11, -1, 1) = -1
- Horse (15): min(3, 6, 13) = 6
- Dog (5): min(-3, 5, 11) = -3
- Whale (2): min(-13, -6, 1) = -6
- Bird (7): min(-6, -3, 11) = -3
Count-Min Sketch vs Count Sketch
Both the Count-Min Sketch and Count Sketch data structures are capable of estimating values. However, in the given above examples, Count Sketch did not perform well due to the small dataset used. These structures are better suited for larger datasets. There are some key observations worth noting:
- Count-Min Sketch never underestimates. If it returns x, it means that there were no more than x elements in the dataset, and possibly fewer. If it returns 0, it is safe to assume that the element was not seen in the dataset.
- Count Sketch, on the other hand, may overestimate or underestimate. If it returns 0, we cannot determine if the element was seen or not in the dataset.
Implementation
Both implementations relies on hash multiple functions. We could create single hash family accepting depth and width, which could hash any string key. The code uses SHA1 to generate digest which is later divided by modulo width.
Having Hash Function Family, we can approach Count-Min Sketch. Implementation relies on 2D array of shape (depth, width). It allows to increment the estimator and estimate the key.
The implementation of Count Sketch is very similar to previous one. It introduces additional sign bit taken from hashing function, which is finally used as multiplier.
Network Traffic Analyzer
Let’s say we have a hypothetical network traffic where we aim to identify hosts that generate an unusually large amount of traffic. This information can help optimize network performance, especially when some hosts produce more traffic than others. We can assume a threshold for what makes “too much” traffic. To build a realistic simulation, we need to define and implement several components, such as the Network Packet or Traffic Generator. These components will work together to generate data being used to identify culprits.
The Network Packet data class is a representation of a packet sent between two hosts, including information about the source, destination, and size of the packet. The Network data class is a data structure that contains all the hosts in the network.
To generate a list of known hosts, the Network Generator component randomly picks some distinct IP addresses.
To make the simulation more realistic, we can use the Web Traffic Generator component to generate packets where a few hosts are designed to serve web traffic to a large number of clients. Similarly, we can use the Uniformed Traffic Generator component to simulate regular traffic among all the known hosts.
We can even create more fancy Mixed Traffic simulator to combine Web Traffic with Uniformed Traffic.
Once having a stage for benchmarking let’s think what we could actually benchmark. We could execute tests covering following dimensions:
- Traffic Type: Mixed Traffic vs Web Traffic
- Algorithms: Hash Map, Count Sketch, Count Min Sketch
- Depths: from 1 to 10
- Widths: 100, 200, 500, 1k, 2k, 5k, 10k, 20k, 50k, 100k, 200k, 500k
Benchmark
To see both data structures in action we can benchmark them within Network Traffic Analyzer. I prepared two quite similar benchmarks:
- web network benchmark — which have 5 web servers and around 1M clients transferring data from and to servers only. It generates 10M packets.
- mixed network benchmark — with 5 web servers, around 1M clients transferring data from and to servers and also making regular data transfer between themselves. It also generates 10M packets.
The goal of the benchmark is to be able to detect heavy hitters among all IP addresses (based on the sum of all incoming and outgoing bytes). There is a big assumption, that 5 web servers will stick out and they actual traffic will be easily exposed. We could do it in the real-time. In the result section we will focus only on the first test.
Results
The benchmark produces few metrics:
- web network client metric — represents the maximum estimated number of bytes sent/received by all 1 million clients in the network, which had regular web traffic
- web network server metric — represents the minimum estimated number of bytes sent/received by all 5 servers in the network, which generated heavy traffic
- web network other metric — represents the maximum estimated number of bytes sent/received by all 100 clients in the network, which never sent or received any data
The purpose of the first metrics is to demonstrate the estimating transferred bytes by the heaviest client IP in the network when divided by the ideal number (taken from the Hash Map estimations) — so ideal value of the metric would be 1. The following charts display the metric for various combinations of depths and widths. The Count Sketch algorithm appears to provide more accurate estimates, but requires more memory to start estimating without false positives. In contrast, the Count-Min Sketch algorithm works well with less memory, but tends to overestimate more frequently. The false positives and overestimations are related to high traffic generated by 5 servers.
The second metric does the same, but demonstrates the estimating transferred bytes by the smallest server IP in the network when divided by the ideal number (taken from the Hash Map estimations) — so ideal value of the metric would be also 1. Looking at the chart we can notice that both solutions were quite accurate from the very beginning.
The last metric related to unseen IPs demonstrates that both sketches may say something happened even if nothing happened. The value is expressed in number of seen bytes. We need to remember that expected value is 0. Count Sketch does much better, but have a bit bigger area of false positives.
Conclusion
The article highlights that both Count Sketch and Count-Min Sketch structures can be implemented with relative simplicity, and that in the given benchmark, these structures performed well in detecting heavy hitters. However, the choice of the structure and its parameters largely depends on the specific use case and goals of the analysis. The key to using these structures effectively is understanding how to optimize their parameters for a particular scenario.
The entire code from the experiment is available in the GitHub: https://github.com/amacal/learning-algorithms/tree/count-estimators