When merging large datasets in Azure SQL Database its imperative to optimize our queries. Failure to do so will most likely result in 40552 : The session has been terminated because of excessive transaction log space usage.
Optimizing a MERGE statement can be achieved through various individual optimizations which depend on the shape and size of the datasets that need to be merged. In my case the following optimizations did the trick.
- Create an index on the join columns in the source table that is unique and covering.
- Create a unique clustered index on the join columns in the target table.
- Parameterize all literal values in the ON clause and in the the WHEN clauses.
- Merge subsets of data from the source to the target table by using OFFSET and ROWS FETCH NEXT or by defining views on the source or target that return the filtered rows and reference the view as the source or target table. Furthermore the use of the WITH <common table expression> clause of the TOP clause to filter out rows from the source or target tables is not recommended because they can generate incorrect results.
DECLARE @offset as bigint DECLARE @offsetIndex as int = 0 DELCARE @batchSize as int = 10000 SELECT @offset = @offsetIndex * @batchSize MERGE Customers AS Target USING (SELECT [Name], [LastName], [Email] from CustomerInsertTable ORDER BY [Email] OFFSET @offset ROWS FETCH NEXT @batchSize ROWS ONLY ) AS Source ON (Target.[Email] = Source.[Email]) WHEN MATCHED THEN UPDATE SET Target.[Name] = Source.[Name] , Target.[LastName] = Source.[LastName] WHEN NOT MATCHED BY TARGET THEN INSERT ([Name], [LastName], [Email]) VALUES ( Source.[Name] , Source.[LastName] , Source.[Email] );
- When the source is smaller than the target table, a nested loops operator is preferable.
MERGE Customers AS Target USING (SELECT [Name],[LastName],[Email] from CustomerInsertTable ) AS Source ON (Target.[Email] = Source.[Email]) WHEN MATCHED THEN UPDATE SET Target.[Name] = Source.[Name] , Target.[LastName] = Source.[LastName] WHEN NOT MATCHED BY TARGET THEN INSERT ([Name], [LastName], [Email]) VALUES ( Source.[Name] , Source.[LastName] , Source.[Email] ) OPTION (LOOP JOIN);
- When the source and target are of similar size and the index are applied to the source and target tables, a merge join operator is the most efficient.
MERGE Customers AS Target USING (SELECT [Name],[LastName],[Email] from CustomerInsertTable ) AS Source ON (Target.[Email] = Source.[Email]) WHEN MATCHED THEN UPDATE SET Target.[Name] = Source.[Name] , Target.[LastName] = Source.[LastName] WHEN NOT MATCHED BY TARGET THEN INSERT ([Name], [LastName], [Email]) VALUES ( Source.[Name] , Source.[LastName] , Source.[Email] ) OPTION (MERGE JOIN);
After putting all these optimizations in place a MERGE query on a 1M record dataset took 2 minutes of execution time instead of 20 minutes. To further optimize the query try different batch sizes.
Possible optimizations for MERGE queries are not limited to the optimizations mentioned in this blog post. If you find other interesting optimizations feel free to mention them in the comments below.
More information
- Optimizing MERGE Statement Performance
- Create Clustered Indexes
- Create Nonclustered Indexes
- Query Hints (Transact-SQL)
- Nested Loops Showplan Operator
- Merge Join Showplan Operator
- Inserting, Updating, and Deleting Data by Using MERGE
- Modify Data Through a View
- Join Fundamentals
- Advanced Query Tuning Concepts
- Specify Query Parameterization Behavior by Using Plan Guides