Thursday, March 15, 2012

Execution Plans – Merge Join

The Merge Join is a Physical Operation when joining 2 sets of data that are in the same order.

There is an interesting Clustered Index in the Adventure Works database. The SaleOrderDetail table’s primary key is a combination of the SalesOrderID and SalesOrderDetailID, not just the identity field of the table (SalesOrderDetailID). Most developers I work with usually create the primary key (by default is the clustered index) on the Identity fields with a unique constraint on the business key of the table.

The compound primary key might take more space, but it looks to help query plans have more options when optimizing.

The following query will use a Merge Join after doing a clustered index scan on SalesOrderHeader and SalesOrderDetail.

SELECT *
    FROM Sales.SalesOrderHeader soh
        INNER JOIN Sales.SalesOrderDetail sod
            ON sod.SalesOrderID = soh.SalesOrderID

image

Now, of course, no one does a SELECT *, right? Maybe not, but you can see that since the first column in the SalesOrderHeader and SalesOrderDetail clustered indexes is the SalesOrderID, the query optimizer can Merge the 2 data sets together because they are in the same order. The optimizer knows this is the best join.

A benefit to the Merge Join operation is once the first piece of data (SalesOrderID) is merged with Detail data, the data can be passed to the next operator, which can improve performance if further processing is needed.

Say I want a list of SalesOrderIDs by ProductID and retrieve the Product Name from the product tables

SELECT sod.SalesOrderID,
        p.ProductID, p.Name
    FROM Production.Product p
        INNER JOIN Sales.SalesOrderDetail sod
            ON sod.ProductID = p.ProductID
   

image

The SalesOrderDetail Scan is on a non-clustered index on ProductID. This is how the Merge is able to be used with the Clustered Index scan of the Product table

Below is the info on the Scan Operators which shows the Object (index) used in the operator.

imageimage

What happens when we add a column to this query – Order Qty.

SELECT sod.SalesOrderID, sod.OrderQty,
        p.ProductID, p.Name
    FROM Production.Product p
        INNER JOIN Sales.SalesOrderDetail sod
            ON sod.ProductID = p.ProductID
   

image

The optimizer determines the Hash Match physical join is more efficient. The Cost is 1.7226.

An option here is to create a covering index by adding the OrderQty column to a non-cluster index using the INCLUDE statement of the Create Index statement.

CREATE NONCLUSTERED INDEX [IX_SalesOrderDetail_ProductIDIncludeOrderQty] ON [Sales].[SalesOrderDetail]
    ([ProductID] ASC )
    INCLUDE (OrderQty)

image

You can force a Merge Join with a query hint to see the difference without the Covering Index.

-- Force Merge join
SELECT sod.SalesOrderID, sod.OrderQty,
        p.ProductID, p.Name
    FROM Production.Product p
        INNER MERGE JOIN Sales.SalesOrderDetail sod
            ON sod.ProductID = p.ProductID
  

The Cost is now  6.3887 and the query runs with parallelism. And, a Sort operator (74% of query cost) is needed if you do not have the covering index above.

image

Query Hints should be left to the experts.

If the data sets are not in the same order based on the Index (clustered or non-clustered), the optimizer can still use a Sort operator to order the data sets in the same logical order to use a Merge Operation.

The AdventureWorks database is available from CodePlex at http://msftdbprodsamples.codeplex.com/

I first learned Execution Plans from a free PDF provided by Red Gate with author Grant Fritchey. You can get this PDF at http://www.simple-talk.com/sql/performance/execution-plan-basics/

Happy Query Tuning from Louisiana!!!

Friday, March 9, 2012

Execution Plans – Nested Loop

The nested loop is a physical operator used to join 2 or more sets of data when the query optimizer believes this is the best plan for the query.

The following query can be run against the AdventureWorks database:

SELECT cust.CustomerID, soh.SalesOrderID
    FROM Sales.Customer cust
        INNER JOIN Sales.SalesOrderHeader soh
            ON soh.CustomerID = cust.CustomerID
    WHERE cust.CustomerID = 11091

The query joins the Customer table with the Sales Order Header by the CustomerID column (ON soh.CustomerID = cust.CustomerID). This is a Transaction table (Sales Order Header) joined to a lookup table (Customer). You get a Query Plan like the following:

image

The Clustered Index Seek on the Customer table retrieves the data for a customer based on the WHERE clause looking for CustomerID 11091. Since the rows retrieved from the customer table is smaller than the Sales Order Header, the customer becomes the outer loop of the Nested Join. The inner part of the loop looks for Sales Order Header rows based on the CustomerID 11091.

The Index Seek happens because there is an Index (Non-Clustered) on CustomerID for the SalesOrderHeader table. The SELECT part of the query only needs the SalesOrderID which is part of the Non-Clustered index. The clustered index on the Sales Order Header table includes the SalesOrderID column. All clustered index columns are included in the leaf level of the non-clustered index (idxSalesOrderHeader_CustomerID).

image

By hovering your mouse over the nested loop, a tool tip gives you more info, including a description of the Nested Loop, Physical and Logical Operation, and much more.

The Actual Number of Rows shows that the Sales Order Header had 28 rows for this customer.

By adding more columns to the SELECT for data from Sales Order Header, like AccountNumber and OrderDate, we get a second Nested Loop.

SELECT cust.CustomerID, soh.SalesOrderID, soh.AccountNumber, soh.OrderDate
    FROM Sales.Customer cust
        INNER JOIN Sales.SalesOrderHeader soh
            ON soh.CustomerID = cust.CustomerID
    WHERE cust.CustomerID = 11091

image

What happens now, is a second Nested Loop is needed to get the AccountNumber and OrderDate from the Clustered Index of the SalesOrderHeader tables. This is because the Non-clustered index on the CustomerID does not “cover” the columns needed for the SELECT but can be used to Seek the SalesOrderID values needed to satisfy the WHERE on CustomerID 11091. The Key Lookup (Clustered) is OK because it is not a scan.

The query can be improved by adding a Covering Index. The following index will use the INCLUDE clause of the CREATE INDEX statement to include the AccountNumber and OrderDate in the leaf level of the index and not the tree level.

    CREATE NONCLUSTERED INDEX idxSalesOrderHeader_CustomerID_IncludeAccountNumberOrderDate
        ON [Sales].[SalesOrderHeader] ([CustomerID])
        INCLUDE (AccountNumber, OrderDate)
GO

The second Nested Loop (Cost 93%) of the query will be removed.

image 

The Index Seek is now covering the SELECT for SalesOrderHeader data.

We have shown here a basic explanation and example of a Nested Loop along with some information on Cluster and Non-Cluster index Seeks as well as a Key Lookup.

Thomas