Working in SQL with date ranges
Date or time ranges are common attributes in the database. Quite often something starts and ends. The concurrency may these beings be defined as how many of the are happening at the same time. Let’s calculate it.
There is two tables with following columns: id, start_at and end_at. The first table contains 11M rows, the second one almost 400M.
Solution #1 — Naive
Theoretically the problem can be reduced to discreet values and the grouping may be applied. If we know all the timestamps when the value may change (when it starts or ends) we can build a timeline to use for evaluation using between operator. The performance depends on number of rows and the actual concurrency level.
The example below tries to aggregate 11M rows to figure out the maximum concurrency. I wrote “tries”, because it works for 1k rows, but has serious issues with 11M, which is visible in the Snowflake’s query plan. The Cartesian join produced already 300M rows within 30 minutes. I stopped it.
Solution #2 — Windowing
The next iteration was to create kind of stream of events with a balance (1 if started and -1 if stopped) and apply accumulative sum to aggregate it. It worked quite performant, which is confirmed in the query plan.
The solution unfortunately requires internal sorting to do required window function. And it is the price of 73%. The 400M rows were processed in the Snowflake database in 15 minutes using the smallest warehouse (XS).