Synapse Fundamentals for Tuning – Nesting – Daniel Crawford's Data Blog

Calling stored procedures from other stored procedures or referencing a view within another view can introduce performance complexities that are difficult to troubleshoot. I will breakdown views and stored proc nesting issues individually, but the same message applies to both. Avoid heavily nested objects in Synapse if reliable performance is a priority. Understandably, we like the idea of nesting objects for security purposes, reusability, and as building blocks for larger processes and queries. There are, however, other ways of accomplishing the same tasks but they do regrettably require more code to maintain better control of query plans, tempdb consumption, and memory consumption.

Views

Query engines (both SQL and DSQL) attempt to do predicate pushdowns to get a query plan that is the most efficient. The challenge to optimizing nested views is that many times there are no statistics for intermediate datasets or there are calculated columns (think of case statements, field manipulation, or date type conversions in the select clause of the nested view) that are needed to complete the larger query that appears to only be querying a single view. As discussed before, statistics are key to getting the best query plan and nested views work against us in tuning efforts. Trying to detangle query plans from one view to the next is difficult, if not impossible, when nested views involve many tables, and sometimes are referenced multiple times. The query plans that are produced from your estimated query plans or even the actual plan post execution (when reviewing execution steps from DMV’s) are confusing. And make it nearly impossible to understand why a table is being broadcast or shuffled. Tuning this type of query is a very time-consuming task that can have varying return on investment since the execution may not even be consistent from one execution to the next. In these cases, it is recommended to rewrite the query to meet the business need. This means going back to the business to understand what they are trying to accomplish and then dissecting the end user query and the associated views to understand why data is being manipulated. Rewriting the query without views also means that we are no longer reusing code… or does it? It is likely that if we are nesting views, there is some data transformation or reusable code that should be pushed further upstream. Nesting views typically means we are applying business logic repeatedly at time of read instead of pushing that logic into our transformations during data preparation of the warehouse. Remember that one of the main reasons for a data warehouse is data preparation for performant analytics and reporting. If you have nested views, you may not actually be preparing your data for read performance.

What about security?

There are some customers that leverage views to abstract tables for end users and then apply a layer of security for data access. In essence, they are creating a virtual datamart over the top of the warehouse. In some cases, filter criteria is applied to a view based upon the login to further restrict data access at a row level. If you are applying filter criteria in a view based upon a username, I beg you to explore the Row Level Security features. For anyone else that is simply using views as a way to abstract tables and apply security to tables, I have a few comments and recommendations:

1.) Review if the views are adding any benefit over applying the same security (select permissions) to the base tables. (Remember that Row Level Security and Column Level Encryption are additional options if there is concern about fine grain data security within a table.)

2.) Review how users are accessing the data. If you are applying filters, manipulating the data, and applying business logic within the view specific to a particular user group, you may want to consider materializing that data in a separate table instead. (Think of creating a physical data mart rather than a virtual data mart.) This may go against some core design principles that are proliferated today. Engineers and architects are looking for reusability, a single source of truth, and a single copy of the data but I will point you back to my previous comment that part of data warehousing is preparing data for performant analytics and reporting. Also, if you wanted to really adopt the “one copy of data” approach, we would just query directly from the transactional source system. I don’t want to see data duplicated no more than the next engineer/architect, but I would rather the design be performant and non-impactful to the business than a completely sterile environment with strict one copy rules that may not benefit the business.

Stored Procedures

In Synapse Dedicated SQL Pools, I recommend taking a train car approach with small units of work that can be combined to get to your desired outcome (data transformation or query result set). Trying to do too much all in a single stored proc leaves session level temp tables and caches intact until the parent session is deallocated. In some extreme cases, I have seen parent stored procs run not just hours but days. This creates tempdb and memory pressure and is compounded by poor performance of queries within the stored procs. It is best to break up the end-to-end unit of work into bite size pieces. Manage your own temp tables (and stats on those temp tables if needed) instead of letting the DW query engine create its own QTables for data movement. Thoughtfully leverage temp tables in a way that will minimize long term storage. Purposefully deallocate/drop temp tables when done using them. If you are doing a multi-step process, consider re-runnability of the stored proc. It may make sense to execute the large stored proc in smaller pieces that are orchestrated via a Synapse pipeline to guarantee sequential execution of steps which would allow for failed processes to be rerun independently of the parent process.

As an example, instead of creating a master stored proc to sequentially loop through reprocessing partitions in a large fact table by calling child stored procs, consider having a single stored proc with the base code to process a single partition then orchestrate the processing of partitions via a Synapse pipeline so multiple partitions can be processed in parallel. This allows for re-runnability of specific partitions if needed as well as parallel processing. If you are worried about data consistency for end users in between multiple stored procs being called separately or being able to rollback, remember that Synapse has a great RENAME feature which could allow you to segregate the processing of your data from your production end-user consumable tables. You can transform/process your data in a set of tables identical to your serving layer and isolate them for just processing. When processing completes, simply perform a rename of the tables to expose them as prepared tables with no impact to end users.

In Summary

Nesting relinquishes control of your queries and processes back to the DW engine. While the DW engine works well for most scenarios, it doesn’t know your exact scenario; therefore, I always recommend maintaining control over your own database tuning. This means de-normalizing and making copies of tables for performance or security purposes. It also means creating re-runnable isolated and atomic processes that are manageable for developers, easy to maintain, and easy to tune.