Aggregating Non-Additive Percentiles

Adrian Macal
4 min readJun 8, 2023

--

Have you ever faced the challenge of computing the median from one billion rows? What if you could pre-aggregate it but at the cost of accuracy?

Every day, millions of transactions are created in various systems, each with multiple numerical attributes, like amounts. To make sense of this vast volume of data, we often rely on summarized statistics exposed in some data models. The mean, the maximum, the minimum, these are all straightforward to calculate, even for large datasets. But what about percentiles, especially the median?

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group falls. For example, the 20th percentile is the value below which 20% of the observations may be found. The median is the 50th percentile, the middle point where half of the observations are below, and half are above.

The challenge comes in computing the median for very large datasets, like a billion rows. It is not easy because the median is not an algebraic function. This means we can’t divide the dataset into smaller chunks, compute the median for each, and combine the results to get the median of the whole dataset. This inability to pre-aggregate the data complicates things and requires us to have all data points accessible in memory at once.

Building an exact solution is computationally expensive and, in the case of a billion rows per day, practically impossible. Nonetheless, let’s try to shape it.

Imagine a straightforward scenario where we collected the following values in a day: [164, 215, 273, 331, 387, 440, 496, 550, 609, 667, 722, 773, 823, 831]. If we aim to compute the median, we need to find the value in the middle. With our 14 values, the median will be the average of the 7th and 8th elements: (496 + 550) / 2 = 523. If we had another day of data, we would need to combine both lists, sort them, and execute the same algorithm.

  • Row 1: [164, 215, 273, 331, 387, 440, 496, 550, 609, 667, 722, 773, 823, 831]
  • Row 2: [170, 224, 283, 345, 400, 460, 520, 581, 640, 702, 760]

The medians of both rows are different (523 and 460). If we combine both lists, sort them, and find the 13th element out of 25, we get a median of 496. We could follow this method even for one million rows, although it would require hefty list aggregation, extensive sorting, and pinpointing the correct element.

Estimation Instead of Exactness

The alternative approach is to estimate with a little assumption that the data is similarly distributed in every pre-aggregated row. The algorithm requires precomputing different percentiles from the entire dataset. With step size of 20, we get following values: p0 = 164, p20 = 275, p40 = 420, p60 = 565, p80 = 712, and p100 = 831. Let’s visualize both rows in computed ranges.

both rows with their values within intervals

We received something which is called histogram. Histogram is kind of representation that organizes data into intervals / bins and shows the count of occurrences of each interval. In our case we got:

  • Row 1: [3, 2, 3, 2, 4]
  • Row 2: [2, 3, 2, 3, 1]

What we now have are fixed-size arrays which we can aggregate, which works exactly the same way as in vectors: Row 1 + Row 2 = [3+2, 2+3, 3+2, 2+3, 4+1] = [5, 5, 5, 5, 5]. Accidently everything is 5.

aggregated rows

Having rows aggregated you can say that 5 items out of 25 are lower then 275, or 16th, 17th, 18th, 19th and 20th element is between 565 and 712. How could we determine the median out of such aggregation? The median will be the 13th element which is in the 3rd bin (420, 565). The bin starts with the 11th item and ends with the 15th item. We could apply interpolation and compute proportional distance of the 13th item from the upper and lower bound. In this case it will be average: (565+420) / 2 = 492.5. Is it correct? No! The ideal value is 496. Is it acceptable? Maybe. It depends on your use case.

Using this approach you can compute any percentile of any dataset. What you need to consider is:

  • Number of predefined intervals — I used 5 predefined intervals — the more intervals the more accuracy.
  • Data is evolving, either you will recompute all of rows, or apply another strategy to deal with “distribution drift”.

--

--

Adrian Macal
Adrian Macal

Written by Adrian Macal

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.

No responses yet