Skip to content

Daniel Crawford's Data Blog

Tips for performant database designs

Menu
  • Home
  • Blog
  • Resources
  • About
  • Contact
Menu

Synapse Fundamentals for Tuning – Nesting

Posted on November 23, 2022November 23, 2022 by Daniel Crawford

Calling stored procedures from other stored procedures or referencing a view within another view can introduce performance complexities that are difficult to troubleshoot.  I will breakdown views and stored proc nesting issues individually, but the same message applies to both.  Avoid heavily nested objects in Synapse if reliable performance is a priority.  Understandably, we like the idea of nesting objects for security purposes, reusability, and as building blocks for larger processes and queries.  There are, however, other ways of accomplishing the same tasks but they do regrettably require more code to maintain better control of query plans, tempdb consumption, and memory consumption.

Views

Query engines (both SQL and DSQL) attempt to do predicate pushdowns to get a query plan that is the most efficient.  The challenge to optimizing nested views is that many times there are no statistics for intermediate datasets or there are calculated columns (think of case statements, field manipulation, or date type conversions in the select clause of the nested view) that are needed to complete the larger query that appears to only be querying a single view.  As discussed before, statistics are key to getting the best query plan and nested views work against us in tuning efforts.  Trying to detangle query plans from one view to the next is difficult, if not impossible, when nested views involve many tables, and sometimes are referenced multiple times.  The query plans that are produced from your estimated query plans or even the actual plan post execution (when reviewing execution steps from DMV’s) are confusing.  And make it nearly impossible to understand why a table is being broadcast or shuffled.  Tuning this type of query is a very time-consuming task that can have varying return on investment since the execution may not even be consistent from one execution to the next. In these cases, it is recommended to rewrite the query to meet the business need.  This means going back to the business to understand what they are trying to accomplish and then dissecting the end user query and the associated views to understand why data is being manipulated.  Rewriting the query without views also means that we are no longer reusing code… or does it?  It is likely that if we are nesting views, there is some data transformation or reusable code that should be pushed further upstream.  Nesting views typically means we are applying business logic repeatedly at time of read instead of pushing that logic into our transformations during data preparation of the warehouse.  Remember that one of the main reasons for a data warehouse is data preparation for performant analytics and reporting.  If you have nested views, you may not actually be preparing your data for read performance. 

What about security?

There are some customers that leverage views to abstract tables for end users and then apply a layer of security for data access.  In essence, they are creating a virtual datamart over the top of the warehouse. In some cases, filter criteria is applied to a view based upon the login to further restrict data access at a row level.  If you are applying filter criteria in a view based upon a username, I beg you to explore the Row Level Security features.  For anyone else that is simply using views as a way to abstract tables and apply security to tables, I have a few comments and recommendations:

1.) Review if the views are adding any benefit over applying the same security (select permissions) to the base tables.  (Remember that Row Level Security and Column Level Encryption are additional options if there is concern about fine grain data security within a table.) 

2.) Review how users are accessing the data.  If you are applying filters, manipulating the data, and applying business logic within the view specific to a particular user group, you may want to consider materializing that data in a separate table instead.  (Think of creating a physical data mart rather than a virtual data mart.)  This may go against some core design principles that are proliferated today. Engineers and architects are looking for reusability, a single source of truth, and a single copy of the data but I will point you back to my previous comment that part of data warehousing is preparing data for performant analytics and reporting.  Also, if you wanted to really adopt the “one copy of data” approach, we would just query directly from the transactional source system.  I don’t want to see data duplicated no more than the next engineer/architect, but I would rather the design be performant and non-impactful to the business than a completely sterile environment with strict one copy rules that may not benefit the business.

Stored Procedures

In Synapse Dedicated SQL Pools, I recommend taking a train car approach with small units of work that can be combined to get to your desired outcome (data transformation or query result set).  Trying to do too much all in a single stored proc leaves session level temp tables and caches intact until the parent session is deallocated.  In some extreme cases, I have seen parent stored procs run not just hours but days.  This creates tempdb and memory pressure and is compounded by poor performance of queries within the stored procs.  It is best to break up the end-to-end unit of work into bite size pieces.  Manage your own temp tables (and stats on those temp tables if needed) instead of letting the DW query engine create its own QTables for data movement.  Thoughtfully leverage temp tables in a way that will minimize long term storage. Purposefully deallocate/drop temp tables when done using them.  If you are doing a multi-step process, consider re-runnability of the stored proc.  It may make sense to execute the large stored proc in smaller pieces that are orchestrated via a Synapse pipeline to guarantee sequential execution of steps which would allow for failed processes to be rerun independently of the parent process. 

As an example, instead of creating a master stored proc to sequentially loop through reprocessing partitions in a large fact table by calling child stored procs, consider having a single stored proc with the base code to process a single partition then orchestrate the processing of partitions via a Synapse pipeline so multiple partitions can be processed in parallel.  This allows for re-runnability of specific partitions if needed as well as parallel processing.  If you are worried about data consistency for end users in between multiple stored procs being called separately or being able to rollback, remember that Synapse has a great RENAME feature which could allow you to segregate the processing of your data from your production end-user consumable tables.  You can transform/process your data in a set of tables identical to your serving layer and isolate them for just processing.  When processing completes, simply perform a rename of the tables to expose them as prepared tables with no impact to end users.

In Summary

Nesting relinquishes control of your queries and processes back to the DW engine.  While the DW engine works well for most scenarios, it doesn’t know your exact scenario; therefore, I always recommend maintaining control over your own database tuning.  This means de-normalizing and making copies of tables for performance or security purposes.  It also means creating re-runnable isolated and atomic processes that are manageable for developers, easy to maintain, and easy to tune.

Loading

Categories

  • Architecture Patterns
  • Fabric
  • Performance Tuning
  • Synapse
  • Top 10 Performance Considerations

Archives

  • July 2023
  • January 2023
  • December 2022
  • November 2022
  • September 2022
  • August 2022

Recent Synapse Videos

Microsoft Fabric is a SaaS platform that lets you easily connect, process, and visualize data, all within a single, secure environment. Instead of maintaining your own security infrastructure, Fabric takes care of it for you, leveraging Microsoft’s resources and expertise to keep your data protected, patched, monitored, and compliant. 

By moving your data to the cloud and working with analytics tools like Power BI, Data Factory, and the next-generation Synapse, you get built-in encryption for data at rest and in transit, along with disaster recovery options. This means you can confidently focus on building data solutions without worrying about the underlying security. 

 

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Product Group presents: Security in Fabric Data Engineering
Fabric Apache Spark Diagnostic Emitter for Logs and Metrics is now in public preview. This new feature allows Apache Spark users to collect Spark logs, job events, and metrics from their Spark applications and send them to various destinations, including Azure Event Hubs, Azure Storage, and Azure Log Analytics. It provides robust support for monitoring and troubleshooting Spark applications, enhancing your visibility into application performance. 

 What Does the Diagnostic Emitter Do? 

The Fabric Apache Spark Diagnostic Emitter enables Apache Spark applications to emit critical logs and metrics that can be used for real-time monitoring, analysis, and troubleshooting. Whether you’re sending logs to Azure Event Hubs, Azure Storage, or Azure Log Analytics, this emitter simplifies the process, allowing you to collect data seamlessly and store it in your preferred destinations.  

 

Blog post: https://blog.fabric.microsoft.com/en-US/blog/announcing-the-fabric-apache-spark-diagnostic-emitter-collect-logs-and-metrics/  

 

🎙 Meet the Speakers: 

👤 Guest Expert: Jenny Jiang, Principal Product Manager | Fabric Data Engineering 

LinkedIn: https://www.linkedin.com/in/jenny-jiang-8b57036/ 

👤 Host: Estera Kot, PhD, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 

Twitter: https://twitter.com/estera_kot 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Spark Diagnostic Emitter for Logs and Metrics
The T-SQL notebook feature in Microsoft Fabric lets you write and run T-SQL code within a notebook. You can use T-SQL notebooks to manage complex queries and write better markdown documentation. It also allows direct execution of T-SQL on connected warehouse or SQL analytics endpoint. By adding a Data Warehouse or SQL analytics endpoint to a notebook, T-SQL developers can run queries directly on the connected endpoint. BI analysts can also perform cross-database queries to gather insights from multiple warehouses and SQL analytics endpoints. 

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Qixiao Wang, Principal Program Manager in Microsoft | Fabric Spark 

Linkedin: https://www.linkedin.com/in/%E5%90%AF%E9%9C%84-%E7%8E%8B-24368233/  

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Like this video? Don't forget to hit the 'Like' button and share it with your network! 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Microsoft Fabric Product Group presents T-SQL support in Microsoft Fabric Notebooks
AutoML (Automated Machine Learning) is a collection of methods and tools that automate machine learning model training and optimization with little human involvement. The aim of AutoML is to simplify and speed up the process of choosing the best machine learning model and hyperparameters for a given dataset, which usually demands much skill and computing power. 

  

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Misha Desai, Senior Program Manager  

 

Misha is a Senior Product Manager based in Seattle, WA, specializing in model tracking, training, and governance within the Fabric Data Science team. 

Linkedin: www.linkedin.com/in/misha-desai-6034a362 

 

👤 Host: Estera Kot, Senior Product Manager at Microsoft and a member of the Fabric Product Group. She holds the role of Product Owner for Apache Spark-based runtimes in Microsoft Fabric and Synapse Analytics. Estera is a Data & AI Architect and is passionate about computer science. 
LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Low Code AutoML UI in Microsoft Fabric Data Science
With high concurrency mode, we’re bringing a game-changing way to orchestrate your data ingestion and transformation processes in enterprise workflows. Notebooks in pipelines now leverage shared, high-performance sessions, combining speed with cost-efficiency—all while maintaining top-tier security. 

 Imagine a pipeline with five notebooks, each running 5 minutes. Normally, the 3-minute Spark start time per step would push your pipeline to 40 minutes. With high concurrency, the total runtime drops to 28 minutes—a 30% speed improvement. 

Unlock faster workflows, lower costs, and a seamless data journey with high concurrency mode. Get ready to experience the next level of pipeline orchestration! 🎉 

https://blog.fabric.microsoft.com/en-US/blog/introducing-high-concurrency-mode-for-notebooks-in-pipelines-for-fabric-spark/  

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
High Concurrency Mode for Notebooks in Pipelines for Fabric Spark
Load More... Subscribe