For years I have been reluctant to use the UNION operator in SQL because it tends to create quite complex queries, until recently having stumbled upon a scenario in which using UNION operator decreased the execution time of a query from five seconds to zero seconds. The volume of data against the query was executed is 5 million records, so this is a very welcome improvement. The following article reveals the details.
When fetching values using a SELECT statement for columns that can take a limited number of values (such as bit), using UNION with multiple queries in which we add all possible combinations of the column values and use TOP(1) for SELECT statement yields better performance versus one query in which we use DISTINCT operator in order to fetch the values.
I have replicated the scenario using a Customer table with the following structure:
The Type column can have one of the following values: ‘Regular’, ‘Silver’, ‘Gold’
The Status column can have one of the following values: 0 or 1.
5 million records were inserted in the table.
A C# API that is using Entity Framework in order to get distinct values for customer type and customer status generated the following SQL:
SELECT DISTINCT [Type], [Status] FROM [dbo].[Customer]
As a side note the SQL generated by Entity Framework includes aliases like ‘Extent1’ for Customer table, the aliases have been removed in the sample queries provided for more clarity.
This query was executed 5 times. The average query execution time was around 5 seconds on a DB server under heavy load (80% CPU, 90% memory used). Average execution time on a DB server with little load (20% CPU 43% memory) was 0.28 seconds.
The execution time is reduced by around 50% if a non-clustered index is added on Type and Status columns.
To get the execution time for the queries we can use the command:
SET STATISTICS TIME ON
The API has been updated so that the SQL search is not performed in all the records of the Customer table. To do that the TOP(1) clause comes to in handy, as it will stop querying the records in the table once a records that satisfies the filtering criteria (if any) is found:
SELECT TOP(1) [Type], [Status] FROM [dbo].[Customer] WHERE [Type] = 'Gold' AND [Status] = 0 UNION SELECT TOP(1) [Type], [Status] FROM [dbo].[Customer] WHERE [Type] = 'Gold' AND [Status] = 1 UNION SELECT TOP(1) [Type], [Status] FROM [dbo].[Customer] WHERE [Type] = 'Silver' AND [Status] = 0 UNION SELECT TOP(1) [Type], [Status] FROM [dbo].[Customer] WHERE [Type] = 'Silver' AND [Status] = 1 UNION SELECT TOP(1) [Type], [Status] FROM [dbo].[Customer] WHERE [Type] = 'Regular' AND [Status] = 0 UNION SELECT TOP(1) [Type], [Status] FROM [dbo].[Customer] WHERE [Type] = 'Regular' AND [Status] = 1
This query was executed 5 times. The average query execution time was: 0 seconds on the DB server with little load, 0 seconds on the DB server with heady load.
In order to check the actual execution plans for the SQL queries we can use the CTRL+M shortcut in SQL Server Management Studio.
For query that used DISTINCT the execution plan yielded the fact that the DB server had to perform a clustered index scan on all 5 million records in the Customer table:
For the query that used UNION the execution plan revealed that a clustered index scan was performed for each query, but the scan was performed on a very small number of records.
In the image below we can check the Number of Rows Read in order to see how many records the DB server scanned in order to find one that matched the WHERE clause.
In other words, the DB server does not need to look in all the 5 million records of the Customer table, if a record is found that matches the WHERE clause, the search ends. Thus, the performance of the SQL query is greatly improved.
After checking what the execution plans look like when the unique values we search for are not randomly spread inside our test data and knowing that a clustered index orders the rows physically inside a table, I took the most disadvantageous cases when the unique values we search for are present in the last records from the table.
Curious what happened? Let’s check it out!
The Customer2 table was created in which the first 4.999.999 records have Status = 0, while the last record had Status = 1. The type was randomly inserted.
Here is the execution plan for the UNION query:
So, the query took 0.77 seconds to execute, and when searching for records where Status = 1, the DB server read all 5 million records in the table.
The Customer3 table was created in which the first 4.999.998 records have Type = ‘Regular’, while the last records have Type = ‘Silver’ and Type = ‘Gold’. The Status was randomly inserted.
Here is the execution plan for the UNION query:
The execution time is 1.2 seconds. Sadly, as there is no record with Status = 0 and Type = ‘Gold’, the DB servers scanned all 5 million records with no result.
The execution plan and execution time for DISTINCT queries are the same for Customer2 and Customer3 tables, as the ones presented in the beginning of this section.
Try it yourself
SQL scripts that allow you to create the Customer table and insert 5 million records in it can be found here: https://github.com/alinruscior/sql-union
You will need access to a database and a SQL server to play around, you can download SQL Express from here: https://www.microsoft.com/en-us/sql-server/sql-server-downloads
For checking how much the execution of a query took you can use SET STATISTICS TIME ON (SQL command) or perform SQL tracing.
Final Words / Lessons Learned
Here are some lessons learned while improving the SQL queries performance are:
- Even if a query looks ugly to you it might look beautiful for SQL server
- Entity Framework prefers generating UNION ALL and DISTINCT, instead of UNION. Still the performance gained is like the one mentioned in this article.
- Always check the execution plans for SQL queries, when the execution plan reveals a scan against a large number of records think of alternatives to improve the query and limit the number of records that the DB server scans.
- The performance improvement described in this article varies depending on how the data is spread in the table that is queried.