Skip to content

Daniel Crawford's Data Blog

Tips for performant database designs

Menu
  • Home
  • Blog
  • Resources
  • About
  • Contact
Menu

Synapse Fundamentals for Tuning – Indexes Part2

Posted on September 14, 2022September 28, 2022 by Daniel Crawford

Rowstore Index Health

For SQL Server, rowstore index fragmentation is a critical indicator for performance.  In Synapse Dedicated SQL Pools, the same holds true but high fragmentation is generally less of a performance impact since tables are distributed.  This however doesn’t mean that clustered/non-clustered indexes and heaps don’t have to be rebuilt.  Due to the way logical table indexes are distributed and are physically separate indexes on each of the distributions, there is no good way to aggregate and show fragmentation of rowstore indexes.  Therefore, a maintenance schedule should be implemented to rebuild rowstore indexes on a frequency that reflects the ratio of change in tables.  Although we can’t see rowstore index fragmentation, it is not all bad news because maintenance is much faster in Synapse.  In fact, in an ideal situation index rebuilds will only be 1/60th of the time it would take in SQL Server due to the data being evenly distributed across 60 distributions.

Clustered Columnstore Index Health

For columnstore indexes, the rules around when to rebuild are completely different and much more complicated.  Because these indexes are almost always on our largest tables, CCI health is critical to overall Synapse Dedicated SQL Pool performance.  To get straight to the point, there are 4 main reasons to rebuild your CCI indexes or recreate a table with a rowstore index instead:

  1. Small CCI’s that don’t have enough data to fill rowgroups – These should probably be converted to a clustered index instead.  Typically, these are tables with less than 60 million records but this is not a hard rule.  You can still get “good-enough” performance benefit from a CCI with even 30 million records; however, the CCI will have diminishing marginal returns of performance as the table is smaller in size.
  2. Small rowgroups – If an index’s average rowgroup size is less than 400k records, rebuild the index for better compression.
  3. Large Delete Bitmaps – This is where as much as 20% of your data has been logically deleted and needs to be removed via a rebuild process.  If you do many updates or deletes, data is logically deleted but not physically removed until the index is rebuilt. 
  4. Many open/uncompressed rowgroups – In some cases, rowgroups do not compress on their own and you may want to force them to compress; issue a rebuild statement for these indexes.  At most you should have 60 rowgroups in an open state (this is the delta store).  If you see more than 60 rowgroups in open or closed state (not compressed), you will want to rebuild the index.

It needs to be mentioned that the resources used to compress CCI’s is very important.  CCI compression relies on a memory buffer that is allocated through the workload group or resource class assigned to the rebuild statement.  Typically, it is recommended to run rebuild statements in a medium resource class or equivalent WLM group so that more records can fit into the buffer and provide better compression.  (Note that resource recommendations here are relative to the DWU’s assigned to your Dedicated Pool and the width of the CCI table and likely will need to be adjusted.)  If a rowgroup does not have optimal compression (having less than 1million records), you can check the reason why the rowgroup was trimmed in [sys].[dm_pdw_nodes_db_column_store_row_group_physical_stats].  A list of possible trim reasons are below and more details can be found here:

Description of trim_reason.

0 – UNKNOWN_UPGRADED_FROM_PREVIOUS_VERSION: Occurred when upgrading from the previous version of SQL Server.

1 – NO_TRIM: The row group was not trimmed. The row group was compressed with the maximum of 1,048,576 rows. The number of rows could be less if a subset of rows was deleted after delta rowgroup was closed

2 – BULKLOAD: The bulk-load batch size limited the number of rows.

3 – REORG: Forced compression as part of REORG command.

4 – DICTIONARY_SIZE: Dictionary size grew too large to compress all of the rows together.

5 – MEMORY_LIMITATION: Not enough available memory to compress all the rows together.

6 – RESIDUAL_ROW_GROUP: Closed as part of last row group with rows < 1 million during index build operation. Note: A partition build with multiple cores can result in more than one trim of this type.

7 – STATS_MISMATCH: Only for columnstore on in-memory table. If stats incorrectly indicated >= 1 million qualified rows in the tail but we found fewer, the compressed rowgroup will have < 1 million rows

8 – SPILLOVER: Only for columnstore on in-memory table. If tail has > 1 million qualified rows, the last batch remaining rows are compressed if the count is between 100k and 1 million

9 – AUTO_MERGE: A Tuple Mover merge operation running in the background consolidated one or more rowgroups into this rowgroup.

To check CCI health and determine if you need to rebuild an index the following query can be used as a sample.  Use the above 4 rules/guidelines to target the greatest offenders first.

Loading

Categories

  • Architecture Patterns
  • Fabric
  • Performance Tuning
  • Synapse
  • Top 10 Performance Considerations

Archives

  • July 2023
  • January 2023
  • December 2022
  • November 2022
  • September 2022
  • August 2022

Recent Synapse Videos

Microsoft Fabric is a SaaS platform that lets you easily connect, process, and visualize data, all within a single, secure environment. Instead of maintaining your own security infrastructure, Fabric takes care of it for you, leveraging Microsoft’s resources and expertise to keep your data protected, patched, monitored, and compliant. 

By moving your data to the cloud and working with analytics tools like Power BI, Data Factory, and the next-generation Synapse, you get built-in encryption for data at rest and in transit, along with disaster recovery options. This means you can confidently focus on building data solutions without worrying about the underlying security. 

 

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Product Group presents: Security in Fabric Data Engineering
Fabric Apache Spark Diagnostic Emitter for Logs and Metrics is now in public preview. This new feature allows Apache Spark users to collect Spark logs, job events, and metrics from their Spark applications and send them to various destinations, including Azure Event Hubs, Azure Storage, and Azure Log Analytics. It provides robust support for monitoring and troubleshooting Spark applications, enhancing your visibility into application performance. 

 What Does the Diagnostic Emitter Do? 

The Fabric Apache Spark Diagnostic Emitter enables Apache Spark applications to emit critical logs and metrics that can be used for real-time monitoring, analysis, and troubleshooting. Whether you’re sending logs to Azure Event Hubs, Azure Storage, or Azure Log Analytics, this emitter simplifies the process, allowing you to collect data seamlessly and store it in your preferred destinations.  

 

Blog post: https://blog.fabric.microsoft.com/en-US/blog/announcing-the-fabric-apache-spark-diagnostic-emitter-collect-logs-and-metrics/  

 

🎙 Meet the Speakers: 

👤 Guest Expert: Jenny Jiang, Principal Product Manager | Fabric Data Engineering 

LinkedIn: https://www.linkedin.com/in/jenny-jiang-8b57036/ 

👤 Host: Estera Kot, PhD, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 

Twitter: https://twitter.com/estera_kot 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Spark Diagnostic Emitter for Logs and Metrics
The T-SQL notebook feature in Microsoft Fabric lets you write and run T-SQL code within a notebook. You can use T-SQL notebooks to manage complex queries and write better markdown documentation. It also allows direct execution of T-SQL on connected warehouse or SQL analytics endpoint. By adding a Data Warehouse or SQL analytics endpoint to a notebook, T-SQL developers can run queries directly on the connected endpoint. BI analysts can also perform cross-database queries to gather insights from multiple warehouses and SQL analytics endpoints. 

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Qixiao Wang, Principal Program Manager in Microsoft | Fabric Spark 

Linkedin: https://www.linkedin.com/in/%E5%90%AF%E9%9C%84-%E7%8E%8B-24368233/  

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Like this video? Don't forget to hit the 'Like' button and share it with your network! 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Microsoft Fabric Product Group presents T-SQL support in Microsoft Fabric Notebooks
AutoML (Automated Machine Learning) is a collection of methods and tools that automate machine learning model training and optimization with little human involvement. The aim of AutoML is to simplify and speed up the process of choosing the best machine learning model and hyperparameters for a given dataset, which usually demands much skill and computing power. 

  

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Misha Desai, Senior Program Manager  

 

Misha is a Senior Product Manager based in Seattle, WA, specializing in model tracking, training, and governance within the Fabric Data Science team. 

Linkedin: www.linkedin.com/in/misha-desai-6034a362 

 

👤 Host: Estera Kot, Senior Product Manager at Microsoft and a member of the Fabric Product Group. She holds the role of Product Owner for Apache Spark-based runtimes in Microsoft Fabric and Synapse Analytics. Estera is a Data & AI Architect and is passionate about computer science. 
LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Low Code AutoML UI in Microsoft Fabric Data Science
With high concurrency mode, we’re bringing a game-changing way to orchestrate your data ingestion and transformation processes in enterprise workflows. Notebooks in pipelines now leverage shared, high-performance sessions, combining speed with cost-efficiency—all while maintaining top-tier security. 

 Imagine a pipeline with five notebooks, each running 5 minutes. Normally, the 3-minute Spark start time per step would push your pipeline to 40 minutes. With high concurrency, the total runtime drops to 28 minutes—a 30% speed improvement. 

Unlock faster workflows, lower costs, and a seamless data journey with high concurrency mode. Get ready to experience the next level of pipeline orchestration! 🎉 

https://blog.fabric.microsoft.com/en-US/blog/introducing-high-concurrency-mode-for-notebooks-in-pipelines-for-fabric-spark/  

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
High Concurrency Mode for Notebooks in Pipelines for Fabric Spark
Load More... Subscribe