Skip to content

Daniel Crawford's Data Blog

Tips for performant database designs

Menu
  • Home
  • Blog
  • Resources
  • About
  • Contact
Menu

Synapse Fundamentals for Tuning – Partitioning

Posted on September 23, 2022September 28, 2022 by Daniel Crawford

Next on my list of top performance killers in Synapse Dedicated SQL Pools, is Partitioning.  Partitioning is too often overused in Synapse.  Let’s first talk about when you would use partitioning and then how to use it effectively.

Partitioning should only be applied on a table with a very large number of records and even then, it should be done very sparingly.  Synapse Dedicated SQL Pool is an MPP platform that shards the data into 60 distributions therefore your data is already physically partitioned or distributed 60 times.  If your table has a clustered columnstore index on it, each one of your distributions is already divided into rowgroups that should be optimally compressed with ~1 million records per rowgroup.  When we do the math on this, (assuming we have good even distribution across all 60 distributions and optimal compression per rowgroup) we need 60 million records to justify a single partition.  So to further partition the table (using the partition syntax at table creation), you would need 120 million records to justify a second partition.  Remember that a clustered columnstore index is built to handle multiple rowgroups. Each rowgroup has a dictionary which helps facilitate rowgroup elimination at query runtime and therefore partitioning isn’t necessary even when there are 120 million records.  Therefore, I typically don’t recommend even beginning to talk about partitioning until your table has at least 600 million records (at this size you should have a maximum of only 10 partitions).

What happens when there are too many partitions?

When there are too many partitions for the quantity of data in a table.  Query processing will force context switching in TempDB to complete a query.  Essentially, too many threads must be opened to read too little data.  This is very inefficient, and it culminates in very poor performance. 

Some antipatterns and other common usages for partitions in data warehousing:

  • I don’t have a clustered columnstore index on my table. 

There are reasons why you may not have a clustered columnstore index on a large table.  These can include the need for a LOB datatype (such as varchar(max)) or you have a very write heavy workload that demands a heap round robin table (this will definitely perform terrible for query scans, seeks, joins, and aggregations).  In this scenario, your data is still distributed amongst 60 distributions and therefore the number of partitions you use should be limited more than you would typically have in a regular SQL Server but the impact of over-partitioning is not as severe.

  • I want to leverage partitioning as a load strategy.

I am typically not a fan of leveraging partitioning as a load strategy in Synapse.  Yes, you can easily reload a particular partition and switch out an existing partition for a new one with minimal effort but unless you have the data quantity to justify partitioning in the first place, the benefit of managing partitions is lost when you consider the ease of use that a CTAS and rename can provide.  Both partition switching and CTAS/rename processes will require split second exclusive table locks and therefore there is little to no benefit of one over the other. Additionally in either case, statistics must be either updated or recreated each time.

So, wrapping up on the topic of partitioning.  Unless you have a very large table (measured in record count) or you have a large table without a clustered columnstore index, don’t bother partitioning.  Even when you do implement partitioning, implement partition boundaries based upon no fewer than 60 million records per partition.  To help you identify tables that you may have over-partition you can use the query below*.

*One important note about querying node level DMVs like the one below: you cannot simply join on object_id’s. Node Level DMV’s must be joined by using [pdw_table_mappings].[physical_name]. This is a common mistake which will return an incorrect result.

If you are reading this and you now find yourself needing to clean up your tables that have been over-partitioned, you can check out this older post about using CTAS to change the table partitioning, distribution, indexes, etc…  Common Data Warehouse Functionality Redefined By CTAS – Microsoft Tech Community

Loading

Categories

  • Architecture Patterns
  • Fabric
  • Performance Tuning
  • Synapse
  • Top 10 Performance Considerations

Archives

  • July 2023
  • January 2023
  • December 2022
  • November 2022
  • September 2022
  • August 2022

Recent Synapse Videos

Microsoft Fabric is a SaaS platform that lets you easily connect, process, and visualize data, all within a single, secure environment. Instead of maintaining your own security infrastructure, Fabric takes care of it for you, leveraging Microsoft’s resources and expertise to keep your data protected, patched, monitored, and compliant. 

By moving your data to the cloud and working with analytics tools like Power BI, Data Factory, and the next-generation Synapse, you get built-in encryption for data at rest and in transit, along with disaster recovery options. This means you can confidently focus on building data solutions without worrying about the underlying security. 

 

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Product Group presents: Security in Fabric Data Engineering
Fabric Apache Spark Diagnostic Emitter for Logs and Metrics is now in public preview. This new feature allows Apache Spark users to collect Spark logs, job events, and metrics from their Spark applications and send them to various destinations, including Azure Event Hubs, Azure Storage, and Azure Log Analytics. It provides robust support for monitoring and troubleshooting Spark applications, enhancing your visibility into application performance. 

 What Does the Diagnostic Emitter Do? 

The Fabric Apache Spark Diagnostic Emitter enables Apache Spark applications to emit critical logs and metrics that can be used for real-time monitoring, analysis, and troubleshooting. Whether you’re sending logs to Azure Event Hubs, Azure Storage, or Azure Log Analytics, this emitter simplifies the process, allowing you to collect data seamlessly and store it in your preferred destinations.  

 

Blog post: https://blog.fabric.microsoft.com/en-US/blog/announcing-the-fabric-apache-spark-diagnostic-emitter-collect-logs-and-metrics/  

 

🎙 Meet the Speakers: 

👤 Guest Expert: Jenny Jiang, Principal Product Manager | Fabric Data Engineering 

LinkedIn: https://www.linkedin.com/in/jenny-jiang-8b57036/ 

👤 Host: Estera Kot, PhD, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 

Twitter: https://twitter.com/estera_kot 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Spark Diagnostic Emitter for Logs and Metrics
The T-SQL notebook feature in Microsoft Fabric lets you write and run T-SQL code within a notebook. You can use T-SQL notebooks to manage complex queries and write better markdown documentation. It also allows direct execution of T-SQL on connected warehouse or SQL analytics endpoint. By adding a Data Warehouse or SQL analytics endpoint to a notebook, T-SQL developers can run queries directly on the connected endpoint. BI analysts can also perform cross-database queries to gather insights from multiple warehouses and SQL analytics endpoints. 

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Qixiao Wang, Principal Program Manager in Microsoft | Fabric Spark 

Linkedin: https://www.linkedin.com/in/%E5%90%AF%E9%9C%84-%E7%8E%8B-24368233/  

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Like this video? Don't forget to hit the 'Like' button and share it with your network! 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Microsoft Fabric Product Group presents T-SQL support in Microsoft Fabric Notebooks
AutoML (Automated Machine Learning) is a collection of methods and tools that automate machine learning model training and optimization with little human involvement. The aim of AutoML is to simplify and speed up the process of choosing the best machine learning model and hyperparameters for a given dataset, which usually demands much skill and computing power. 

  

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Misha Desai, Senior Program Manager  

 

Misha is a Senior Product Manager based in Seattle, WA, specializing in model tracking, training, and governance within the Fabric Data Science team. 

Linkedin: www.linkedin.com/in/misha-desai-6034a362 

 

👤 Host: Estera Kot, Senior Product Manager at Microsoft and a member of the Fabric Product Group. She holds the role of Product Owner for Apache Spark-based runtimes in Microsoft Fabric and Synapse Analytics. Estera is a Data & AI Architect and is passionate about computer science. 
LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Low Code AutoML UI in Microsoft Fabric Data Science
With high concurrency mode, we’re bringing a game-changing way to orchestrate your data ingestion and transformation processes in enterprise workflows. Notebooks in pipelines now leverage shared, high-performance sessions, combining speed with cost-efficiency—all while maintaining top-tier security. 

 Imagine a pipeline with five notebooks, each running 5 minutes. Normally, the 3-minute Spark start time per step would push your pipeline to 40 minutes. With high concurrency, the total runtime drops to 28 minutes—a 30% speed improvement. 

Unlock faster workflows, lower costs, and a seamless data journey with high concurrency mode. Get ready to experience the next level of pipeline orchestration! 🎉 

https://blog.fabric.microsoft.com/en-US/blog/introducing-high-concurrency-mode-for-notebooks-in-pipelines-for-fabric-spark/  

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
High Concurrency Mode for Notebooks in Pipelines for Fabric Spark
Load More... Subscribe