If you are coming from a SQL Server world, what are the fundamental concepts in Synapse that are essential to building a performant Dedicated SQL Pool? How does Synapse differ from regular SQL Server? Let’s start with the most basic foundational principle of distribution.
You might hear the term “MPP” thrown around when we talk about Synapse. MPP stands for Massive Parallel Processing – which is just another fancy way of saying that there are multiple compute nodes processing data in parallel. This contrasts with SQL Server which is traditionally an SMP or Symmetric Multi-Processing platform (a single compute node with multiple processors). Synapse Dedicated SQL Pools consist of 60 distributions under the covers and therefore can have as many as 60 compute nodes each having multiple processors in themselves. This allows for scalability at a massive level.
So how does this translate into why my queries are slower than expected in my Dedicated SQL Pool?
Say for example, you have a query running in SQL Server – it queries 100,000 records and returns 1000 records back to the client in 10 seconds. If the same dataset is moved to Synapse, you might expect it to run in a fraction of the time (sub second in fact) because it has so much more compute power behind it. But you might be surprised when it doesn’t necessarily complete any faster but is potentially even slower for small datasets. That is because data must be distributed effectively to utilize all the compute power available.
Because we have 60 distributions, our data is sharded across those 60 different SQL instances with a DW query engine coordinating the work across those instances. If you have ever partitioned data in SQL Server and created a partition scheme that separates partitions onto separate disks, this is an almost identical concept. The only difference is that instead of a partition column, we use a distribution column and neither the partition function or scheme need to be defined manually in Synapse. A hash algorithm is used to hash the distribution field and distribute the data across the 60 distributions automatically. This however, doesn’t necessarily mean it is distributed evenly and that is where we can run into problems.
We can run into performance problems when data is skewed or distributed unevenly putting large quantities of data on a few distributions vs evenly spreading workload across all distributions. Skew can be introduced in the physical storage of the data as well as in the querying of data (think query filters or aggregations on heavily skewed fields). A query can only run as fast as the slowest distribution, so we want to avoid skew not only on disk but also in our queries. Additionally, and frequently, choosing the wrong distribution key causes data movement between distributions. This data movement is typically the largest consumer of resources and contributes the most to overall clocktime of query completion. Choosing the right distribution key is critical to good performance in Synapse.
How to choose the right distribution key
As stated above, there are two components to choosing a good distribution key.
- First, you should choose a field that has high cardinality. Having 1000 or more unique values is a good place to start. The more unique the field, the better distribution you will have.
- Second, you should choose a field that is going to frequently be used to join to other tables and/or included in an aggregation (group by statement). The goal is to move as little data as possible to complete queries.
Do surrogate keys make for a good hash distribution key?
In most scenarios, I recommend using natural keys instead of a surrogate key. Natural or business keys are more likely to be used for aggregations. From an end user standpoint, it will make more sense to distribute on a natural key when using reporting tools such as PowerBI where data consumers may not be aware of the internals of Synapse or the metadata behind the tables.
While surrogate keys have great cardinality due to uniqueness, they bring a list of other issues. Probably the biggest challenge of surrogate keys is that they have to be created sequentially at the time of load and can be very costly as part of the load process. Synapse doesn’t enforce primary key or unique constraints so you must manually ensure uniqueness in your load process and that will be very costly when not done in batch. In short, I would recommend not using surrogate keys at all (unless absolutely necessary due to overlap of source system business keys or other reasons) but rather use the natural keys whenever possible.
If you have already loaded data into your Dedicated SQL Pool, you can easily check skew using the guidance here: Distributed tables design guidance – Azure Synapse Analytics | Microsoft Docs
What are my other options?
Until now, I have been talking about hash distributions in Synapse, but there are two other types of distribution that we haven’t talked about.
Round Robin is a distributed table without a hash key and therefore the records are inserted in a round robin fashion as they come in. This evenly distributes records across distributions most times and is great because you are not likely to have any skew. (My teammate Mark Price-Maher, discusses an exception to this rule here.) The problem with round robin distributions is that it always results in data movement unless you are querying the entire table, with no aggregations, and no joins to other tables. Round Robin tables typically should be used where fast writes are more important than subsequent reads or the entire table will be read (think of staging tables). When creating your tables, if you don’t specify distribution type, your default will be round robin. Be careful to not fall into the trap of allowing every table to be a round robin table. Performance across your dedicated SQL pool will be severely impacted by this.
Replicated tables are just what they say. Data is replicated into the cache on each of the compute nodes so data becomes collocated for joins and therefore no additional data movement would need to occur to complete a query. But there are a few gotchya’s that you want to watch out for:
- Replicated tables are physically stored as round robin tables.
- The cache on each of the computes is invalidated every time you write to the table (insert, update, delete).
- The cache rebuilds on the first read of the data from disk and therefore the first read is slower.
- A maximum of 2 cache rebuild processes can run at a time and they do consume resources on your environment.
While replicated tables sound very appealing because it eliminates data movement, they can also hurt you if the table is too dynamic, too large (they should be less than 2GB as a rule of thumb), or you are joining too many together and defeating the purpose of an MPP platform that is made to distribute the workload across multiple compute nodes.
This is distribution in a very abbreviated overview but provides context necessary for discussing the next fundamental to Synapse query performance: Statistics.