Hardcoding in SQL performs better
Can hardcoding in SQL really be the secret to better performance? How to find the right balance between performance and abstraction?
Context
Imagine a big dataset with website visits, and you want to study a specific geographic area. You cannot filter only by area, but you must also think about other unique factors for that location, like the website’s launch date. This is because the data might have visits from before the website was live in that specific area. Finally, you want to count the number of impressions in all areas after applying all rules that are unique to each area.
Solution #1
Suppose we have the following query. A less experienced developer might write it like this. The query simply selects and aggregates data from the table and filters it using a straightforward hardcoded implementation. It does exactly what is required.
select locale, count(1)
from gallery_impression
where (locale = 'US' and date >= '2022-02-25')
or (locale = 'ES' and date >= '2020-04-10')
or (locale = 'DE' and date >= '2022-05-13')
group by locale
The solution is simple but has some maintenance challenges. Whenever a new locale is added, all similar queries have to be modified. However, the query is highly efficient, as confirmed by the execution plan. The query spends most of its time scanning the table (82%).
Solution #2
A more experienced developer might notice the maintenance issue and introduce a configuration table to address it. This table could have just two columns, such as locale and launch date, to efficiently filter out unwanted rows dynamically.
select gi.locale, count(1)
from gallery_impressions gi
join configuration_geography cg
on cg.locale = gi.locale and gi.date >= cg.launch_date
group by gi.locale
The solution remains relatively simple and is now more maintainable. When a new locale is added, just add it to the configuration table, and all queries will consider the change. However, the change has introduced complexity to the execution plan. An additional join is now needed, which consumes a significant amount of time compared to the table scan. This is the cost of the abstraction.
Solution #3
What if we could combine the best of both solutions? What if we could have a configuration table and still use a hardcoded query with excellent performance? Fortunately, this is possible using DBT. With simplicity and ease, we can generate a query without any join operations while still utilizing the configuration table.
{% call statement('get-locales', fetch_result=True) %}
select locale, launch_date
from {{ ref('configuration_geography') }}
order by country_code
{% endcall %}
select locale, count(1)
from gallery_impression
where 1 = 0
{% for row in load_result('get-locales').data %}
or (locale = '{{ row[0] }}' and date >= '{{ row[1] }}')
{% endfor %}
group by locale
Conclusion
There is always a balance between performance and abstraction. From one side we want to make our code performing excellent, but on the other side we love abstraction which offers more flexibility. Finding the right balance is not easy, but it’s important find it to create easy-to-maintain solutions.