Next on my list of top performance killers in Synapse Dedicated SQL Pools, is Partitioning. Partitioning is too often overused in Synapse. Let’s first talk about when you would use partitioning and then how to use it effectively.
Partitioning should only be applied on a table with a very large number of records and even then, it should be done very sparingly. Synapse Dedicated SQL Pool is an MPP platform that shards the data into 60 distributions therefore your data is already physically partitioned or distributed 60 times. If your table has a clustered columnstore index on it, each one of your distributions is already divided into rowgroups that should be optimally compressed with ~1 million records per rowgroup. When we do the math on this, (assuming we have good even distribution across all 60 distributions and optimal compression per rowgroup) we need 60 million records to justify a single partition. So to further partition the table (using the partition syntax at table creation), you would need 120 million records to justify a second partition. Remember that a clustered columnstore index is built to handle multiple rowgroups. Each rowgroup has a dictionary which helps facilitate rowgroup elimination at query runtime and therefore partitioning isn’t necessary even when there are 120 million records. Therefore, I typically don’t recommend even beginning to talk about partitioning until your table has at least 600 million records (at this size you should have a maximum of only 10 partitions).
What happens when there are too many partitions?
When there are too many partitions for the quantity of data in a table. Query processing will force context switching in TempDB to complete a query. Essentially, too many threads must be opened to read too little data. This is very inefficient, and it culminates in very poor performance.
Some antipatterns and other common usages for partitions in data warehousing:
- I don’t have a clustered columnstore index on my table.
There are reasons why you may not have a clustered columnstore index on a large table. These can include the need for a LOB datatype (such as varchar(max)) or you have a very write heavy workload that demands a heap round robin table (this will definitely perform terrible for query scans, seeks, joins, and aggregations). In this scenario, your data is still distributed amongst 60 distributions and therefore the number of partitions you use should be limited more than you would typically have in a regular SQL Server but the impact of over-partitioning is not as severe.
- I want to leverage partitioning as a load strategy.
I am typically not a fan of leveraging partitioning as a load strategy in Synapse. Yes, you can easily reload a particular partition and switch out an existing partition for a new one with minimal effort but unless you have the data quantity to justify partitioning in the first place, the benefit of managing partitions is lost when you consider the ease of use that a CTAS and rename can provide. Both partition switching and CTAS/rename processes will require split second exclusive table locks and therefore there is little to no benefit of one over the other. Additionally in either case, statistics must be either updated or recreated each time.
So, wrapping up on the topic of partitioning. Unless you have a very large table (measured in record count) or you have a large table without a clustered columnstore index, don’t bother partitioning. Even when you do implement partitioning, implement partition boundaries based upon no fewer than 60 million records per partition. To help you identify tables that you may have over-partition you can use the query below*.
*One important note about querying node level DMVs like the one below: you cannot simply join on object_id’s. Node Level DMV’s must be joined by using [pdw_table_mappings].[physical_name]. This is a common mistake which will return an incorrect result.
If you are reading this and you now find yourself needing to clean up your tables that have been over-partitioned, you can check out this older post about using CTAS to change the table partitioning, distribution, indexes, etc… Common Data Warehouse Functionality Redefined By CTAS – Microsoft Tech Community