Skip to content

Daniel Crawford's Data Blog

Tips for performant database designs

Menu
  • Home
  • Blog
  • Resources
  • About
  • Contact
Menu

Synapse Fundamentals for Performance Tuning – Statistics

Posted on August 12, 2022December 9, 2022 by Daniel Crawford

Statistics in Synapse Dedicated SQL Pools is without question the most important and the most frequently overlooked reason for poor query performance.  SQL Server is forgiving in that autocreate and autoupdate statistics are frequently enabled but Synapse only offers autocreate statistics.   Autoupdate statistics is not an option available in Synapse today.  This leads to a few critical questions around statistics…

What is the role of statistics in query performance for Synapse Dedicated SQL Pools compared to SQL Server?  What’s the difference?

Statistics in SQL Server determine whether it is more or less efficient to do certain operations when reading the data.  These operations may include join types, lookups, parallelism, etc…  In Synapse Dedicated SQL Pools, statistics perform an identical function within each distribution but also (and more importantly) determine data movement operations between distributions to complete a query.  These data movement operations include broadcasting or shuffling data to collocate it to complete joins and aggregations.

It is important to note a few things at this point as it directly impacts query performance in Synapse Dedicated SQL Pools:

  1. AutoCreate statistics will automatically create statistics when a query could use this information to create a more efficient query plans based upon query filter and join predicates only and will create single column statistics only. 
  2. Autocreate statistics is done synchronously meaning that queries may encounter performance hit on first run when stats need to be created prior to running a query.  It is best to create and maintain statistics in advance of regular workloads.  This does mean that developers need to understand and anticipate the workload requests coming from clients.
  3. While setting autoupdate statistics is not an option in Synapse Dedicated SQL Pools, it is critical to performance that they are maintained.  In fact individual distributions each have autoupdate statistics enabled by default but the DW engine does not – that is because the query optimizer for Synapse exists on the control node where no data is persisted. 
  4. When you issue the update statistics command, stats are aggregated from individual distributions to be stored on the control node to be used by the optimizer and determine the DSQL plan.  Thus, accurate statistics on the control node are critical to the most optimal query plan.
  5. Every table has a default statistic on it even if stats have never been created on the table.  That table level statistic is set to 1000 records and is used by the optimizer to determine optimal query plan to limit data movement.  This means if you have never updated statistics after table creation, the query optimizer is creating query plans based upon the expectation that there are only 1000 records in the table, regardless if there are 0 records or 1 billion records in the table.  As you can imagine, this can cause some very bad data movement and very bad performance.  This is probably the most frequent cause of poor query performance in Synapse Dedicated SQL Pools.

How and when should I maintain statistics in Synapse?

The how of maintaining statistics is straightforward.  Updating statistics is very easy – it is the same as SQL Server:  UPDATE STATISTICS [schema].[table];  This syntax is simple and covers all statistics on a single table which is a great place to start when doing tuning in a blanket fashion. 

There is one caveat to consider and with large tables in particular: your sampling percentage.  If not specified, your update statistics command will use a sampling percentage that is non-linear to the size of your table.  While allowing for default sampling works well in most scenarios, issues arise when there is potentially low cardinality and therefore skew to the data.  This is a scenario where the default sampling is not enough, and you have to use a larger sampling size.  Increasing sampling size requires more time to perform maintenance but it will pay off for your user queries against that table.

The question of “when” I should update statistics in Synapse is not so easy to answer as to how.  The obvious answer is to update them when they are no longer accurate.  Most documentation points you to checking when statistics were last updated but that doesn’t tell us the ratio of change on the table since they were last updated, and this is where you can run into trouble.  Some will say that it is important to update statistics when they are more than 20% inaccurate but even then, this may not be sufficient.  For Synapse, the best practice guidance provided is to update after every data load, but this may not be possible for very large fact tables with billions of records and updating statistics can take significant time.  In this case, a schedule should be implemented for slowly changing dimensions and large fact tables (perhaps weekly) depending on the ratio of change.

How to Check Accuracy of Statistics:

If you are a DBA and/or responsible for maintenance, you will likely want a query to check accuracy of statistics.  Remember that the important part of stats in Synapse Dedicated SQL Pools is the accuracy of what the control node believes is the record count (as well as histogram, density, etc…) compared to what the sum of the record counts actually is across all distributions.  If the percent difference is greater than X (maybe 20%), you would want to update statistics.  The best way to get the difference is via a script like the one below (Note that this can run a very long time in large environments. As always this is meant to be an example of how to query DMV’s and is not necessarily a production ready script.)

Loading

Categories

  • Architecture Patterns
  • Fabric
  • Performance Tuning
  • Synapse
  • Top 10 Performance Considerations

Archives

  • July 2023
  • January 2023
  • December 2022
  • November 2022
  • September 2022
  • August 2022

Recent Synapse Videos

Microsoft Fabric is a SaaS platform that lets you easily connect, process, and visualize data, all within a single, secure environment. Instead of maintaining your own security infrastructure, Fabric takes care of it for you, leveraging Microsoft’s resources and expertise to keep your data protected, patched, monitored, and compliant. 

By moving your data to the cloud and working with analytics tools like Power BI, Data Factory, and the next-generation Synapse, you get built-in encryption for data at rest and in transit, along with disaster recovery options. This means you can confidently focus on building data solutions without worrying about the underlying security. 

 

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Product Group presents: Security in Fabric Data Engineering
Fabric Apache Spark Diagnostic Emitter for Logs and Metrics is now in public preview. This new feature allows Apache Spark users to collect Spark logs, job events, and metrics from their Spark applications and send them to various destinations, including Azure Event Hubs, Azure Storage, and Azure Log Analytics. It provides robust support for monitoring and troubleshooting Spark applications, enhancing your visibility into application performance. 

 What Does the Diagnostic Emitter Do? 

The Fabric Apache Spark Diagnostic Emitter enables Apache Spark applications to emit critical logs and metrics that can be used for real-time monitoring, analysis, and troubleshooting. Whether you’re sending logs to Azure Event Hubs, Azure Storage, or Azure Log Analytics, this emitter simplifies the process, allowing you to collect data seamlessly and store it in your preferred destinations.  

 

Blog post: https://blog.fabric.microsoft.com/en-US/blog/announcing-the-fabric-apache-spark-diagnostic-emitter-collect-logs-and-metrics/  

 

🎙 Meet the Speakers: 

👤 Guest Expert: Jenny Jiang, Principal Product Manager | Fabric Data Engineering 

LinkedIn: https://www.linkedin.com/in/jenny-jiang-8b57036/ 

👤 Host: Estera Kot, PhD, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 

Twitter: https://twitter.com/estera_kot 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Spark Diagnostic Emitter for Logs and Metrics
The T-SQL notebook feature in Microsoft Fabric lets you write and run T-SQL code within a notebook. You can use T-SQL notebooks to manage complex queries and write better markdown documentation. It also allows direct execution of T-SQL on connected warehouse or SQL analytics endpoint. By adding a Data Warehouse or SQL analytics endpoint to a notebook, T-SQL developers can run queries directly on the connected endpoint. BI analysts can also perform cross-database queries to gather insights from multiple warehouses and SQL analytics endpoints. 

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Qixiao Wang, Principal Program Manager in Microsoft | Fabric Spark 

Linkedin: https://www.linkedin.com/in/%E5%90%AF%E9%9C%84-%E7%8E%8B-24368233/  

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Like this video? Don't forget to hit the 'Like' button and share it with your network! 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Microsoft Fabric Product Group presents T-SQL support in Microsoft Fabric Notebooks
AutoML (Automated Machine Learning) is a collection of methods and tools that automate machine learning model training and optimization with little human involvement. The aim of AutoML is to simplify and speed up the process of choosing the best machine learning model and hyperparameters for a given dataset, which usually demands much skill and computing power. 

  

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Misha Desai, Senior Program Manager  

 

Misha is a Senior Product Manager based in Seattle, WA, specializing in model tracking, training, and governance within the Fabric Data Science team. 

Linkedin: www.linkedin.com/in/misha-desai-6034a362 

 

👤 Host: Estera Kot, Senior Product Manager at Microsoft and a member of the Fabric Product Group. She holds the role of Product Owner for Apache Spark-based runtimes in Microsoft Fabric and Synapse Analytics. Estera is a Data & AI Architect and is passionate about computer science. 
LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Low Code AutoML UI in Microsoft Fabric Data Science
With high concurrency mode, we’re bringing a game-changing way to orchestrate your data ingestion and transformation processes in enterprise workflows. Notebooks in pipelines now leverage shared, high-performance sessions, combining speed with cost-efficiency—all while maintaining top-tier security. 

 Imagine a pipeline with five notebooks, each running 5 minutes. Normally, the 3-minute Spark start time per step would push your pipeline to 40 minutes. With high concurrency, the total runtime drops to 28 minutes—a 30% speed improvement. 

Unlock faster workflows, lower costs, and a seamless data journey with high concurrency mode. Get ready to experience the next level of pipeline orchestration! 🎉 

https://blog.fabric.microsoft.com/en-US/blog/introducing-high-concurrency-mode-for-notebooks-in-pipelines-for-fabric-spark/  

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
High Concurrency Mode for Notebooks in Pipelines for Fabric Spark
Load More... Subscribe