Statistics in Synapse Dedicated SQL Pools is without question the most important and the most frequently overlooked reason for poor query performance. SQL Server is forgiving in that autocreate and autoupdate statistics are frequently enabled but Synapse only offers autocreate statistics. Autoupdate statistics is not an option available in Synapse today. This leads to a few critical questions around statistics…
What is the role of statistics in query performance for Synapse Dedicated SQL Pools compared to SQL Server? What’s the difference?
Statistics in SQL Server determine whether it is more or less efficient to do certain operations when reading the data. These operations may include join types, lookups, parallelism, etc… In Synapse Dedicated SQL Pools, statistics perform an identical function within each distribution but also (and more importantly) determine data movement operations between distributions to complete a query. These data movement operations include broadcasting or shuffling data to collocate it to complete joins and aggregations.
It is important to note a few things at this point as it directly impacts query performance in Synapse Dedicated SQL Pools:
- AutoCreate statistics will automatically create statistics when a query could use this information to create a more efficient query plans based upon query filter and join predicates only and will create single column statistics only.
- Autocreate statistics is done synchronously meaning that queries may encounter performance hit on first run when stats need to be created prior to running a query. It is best to create and maintain statistics in advance of regular workloads. This does mean that developers need to understand and anticipate the workload requests coming from clients.
- While setting autoupdate statistics is not an option in Synapse Dedicated SQL Pools, it is critical to performance that they are maintained. In fact individual distributions each have autoupdate statistics enabled by default but the DW engine does not – that is because the query optimizer for Synapse exists on the control node where no data is persisted.
- When you issue the update statistics command, stats are aggregated from individual distributions to be stored on the control node to be used by the optimizer and determine the DSQL plan. Thus, accurate statistics on the control node are critical to the most optimal query plan.
- Every table has a default statistic on it even if stats have never been created on the table. That table level statistic is set to 1000 records and is used by the optimizer to determine optimal query plan to limit data movement. This means if you have never updated statistics after table creation, the query optimizer is creating query plans based upon the expectation that there are only 1000 records in the table, regardless if there are 0 records or 1 billion records in the table. As you can imagine, this can cause some very bad data movement and very bad performance. This is probably the most frequent cause of poor query performance in Synapse Dedicated SQL Pools.
How and when should I maintain statistics in Synapse?
The how of maintaining statistics is straightforward. Updating statistics is very easy – it is the same as SQL Server: UPDATE STATISTICS [schema].[table]; This syntax is simple and covers all statistics on a single table which is a great place to start when doing tuning in a blanket fashion.
There is one caveat to consider and with large tables in particular: your sampling percentage. If not specified, your update statistics command will use a sampling percentage that is non-linear to the size of your table. While allowing for default sampling works well in most scenarios, issues arise when there is potentially low cardinality and therefore skew to the data. This is a scenario where the default sampling is not enough, and you have to use a larger sampling size. Increasing sampling size requires more time to perform maintenance but it will pay off for your user queries against that table.
The question of “when” I should update statistics in Synapse is not so easy to answer as to how. The obvious answer is to update them when they are no longer accurate. Most documentation points you to checking when statistics were last updated but that doesn’t tell us the ratio of change on the table since they were last updated, and this is where you can run into trouble. Some will say that it is important to update statistics when they are more than 20% inaccurate but even then, this may not be sufficient. For Synapse, the best practice guidance provided is to update after every data load, but this may not be possible for very large fact tables with billions of records and updating statistics can take significant time. In this case, a schedule should be implemented for slowly changing dimensions and large fact tables (perhaps weekly) depending on the ratio of change.
How to Check Accuracy of Statistics:
If you are a DBA and/or responsible for maintenance, you will likely want a query to check accuracy of statistics. Remember that the important part of stats in Synapse Dedicated SQL Pools is the accuracy of what the control node believes is the record count (as well as histogram, density, etc…) compared to what the sum of the record counts actually is across all distributions. If the percent difference is greater than X (maybe 20%), you would want to update statistics. The best way to get the difference is via a script like the one below (Note that this can run a very long time in large environments. As always this is meant to be an example of how to query DMV’s and is not necessarily a production ready script.)