Clay Lenhart's Blog » SQL Server Development

Influencing the Execution Plan

Clay Lenhart — Mon, 14 Apr 2008 20:37:56 +0000

I had a performance problem recently with SQL Server, and I went through the standard performance checklist, however it didn’t solve the problem permanently. Sometimes it would perform well, but most times it was performing poorly. I knew the next step was to mess with the execution plan. This is something I really don’t like.

You do not want to force SQL Server to use a particular execution plan, because SQL Server can pick different execution plans depending on how much data will be processed. When it processes a few rows, it will choose a plan that is optimized for a few rows (and typically use nested loops). If the same script processes a lot of rows, it will use a plan that is optimized for a lot of rows (and use merge joins or hash joins). By forcing SQL Server to use single execution plan, you prevent it from using the most efficient execution plan for different scenarios.

But what happens if SQL Server estimates the wrong number of rows? The worst thing it can do is estimate few rows, use an execution plan optimized for a few rows, and actually process a large number of rows. In this scenario, you will find a very slow query.

I found an easy fix for this situation. Use the OPTION(HASH JOIN, MERGE JOIN) modifier to any SELECT, INSERT, UPDATE, or DELETE statement. For instance:

UPDATE cust
SET CustomerSourceID = th.SourceID
FROM Customer cust
INNER JOIN TransactionHeader th ON th.CustomerID = cust.CustomerID
WHERE cust.CustomerSourceID IS NULL
OPTION (HASH JOIN, MERGE JOIN)

The OPTION (HASH JOIN, MERGE JOIN) modifier does not allow SQL Server to use nested loops. Since nested loops are typically efficient for a small number of rows, this causes SQL Server to optimize your query for a large number of rows. Even if this query encounters a few rows, the plan will be moderately efficient.

The good thing about OPTION (HASH JOIN, MERGE JOIN) is

It does not require a statement to be restructured.
It will not likely introduce any bugs.

The bad thing about it is

You prevent SQL Server from selecting the best execution plan for all scenarios. The plan will be optimized for a large number of rows.

HierarchyID in SQL Server 2008

Clay Lenhart — Sat, 23 Feb 2008 20:04:58 +0000

SQL Server 2008 includes a new HierarchyID datatype!

SQL Server Hash Indexes

Clay Lenhart — Sun, 03 Feb 2008 13:27:04 +0000

There are two problems with indexes on large nvarchar columns:

You will likely hit the 900 byte limit in your index
Indexing large data isn’t efficient anyway.

A neat feature of SQL Server is the CHECKSUM() function which hashes your varchar/nvarchar values into a 4 byte number. You can then use this value in an index. For example if you have a Site table, add a calculated column, URLChecksum.

CREATE TABLE Site (
  SiteID int NOT NULL,
  URL nvarchar(2083) NOT NULL,
  URLChecksum AS (checksum([URL])),
 CONSTRAINT [PK_Site] PRIMARY KEY CLUSTERED (SiteID)
);

Next create an index on the hash and include the URL:

CREATE INDEX IX_Site ON Site (URLChecksum) INCLUDE (URL);

This index will make the following query faster:

SELECT SiteID
FROM Site
WHERE URLChecksum = CHECKSUM(N'http://www.microsoft.com/downloads/details.aspx?familyid=9a8b005b-84e4-4f24-8d65-cb53442d9e19&displaylang=en')
AND URL = N'http://www.microsoft.com/downloads/details.aspx?familyid=9a8b005b-84e4-4f24-8d65-cb53442d9e19&displaylang=en';

This query will first “seek” the hash value in the index very quickly, since the hash values are just ints. Once it finds one or more matching hash values, it will check that the URLs match. Since the URLChecksum, URL, and SiteID values are included in the index, this query does not need to touch the Site table.

“Including” Columns in an Index

Clay Lenhart — Sat, 02 Feb 2008 16:08:25 +0000

A neat feature in SQL Server 2005 is the ability to “include” columns in an index. These included columns are not in the main part of the index, but are additional information in the index.

For example, lets say you have the following SELECT statement:

SELECT Url

FROM Site

WHERE Category = 'News';

You might be tempted to create a covering index with Category as the first column, and Url as the second column. Since URLs can be 2083 characters, you can’t put the URL column in an index since it would exceed 900 bytes. However, the query above would benefit from the following index where the URL is “included” in the index and therefore isn’t restricted to 900 bytes.

CREATE INDEX IX_Site ON dbo.Site (Category) INCLUDE (Url);

The main part of the index only contains the category column. The index also stores the Url, but the Url can’t be efficiently used for filtering. In the SELECT statement above, this is OK, since the Url is only returned, not filtered.

Another benefit is that the main part of the index is smaller, so it is faster to find records in the index.

The sp_getapplock secret

Clay Lenhart — Mon, 28 Jan 2008 21:39:59 +0000

sp_getapplock is not very well advertised in SQL Server 2005, however it is a good way to synchronize code in a stored procedure.

Before finding out about sp_getapplock, I would SELECT from a table with an exclusive lock, like so:

BEGIN TRAN;
SELECT TOP 1 * FROM dbo.ATable with (tablockx, holdlock);
-- Do something while the lock is held, for instance:
UPDATE dbo.ATable SET FieldA = FieldA + 1 WHERE FieldB = 'something';
COMMIT TRAN;

This blocks other users from entering the section of code until the lock is released (when the transaction is committed). Normally you couldn’t do FieldA = FieldA + 1, because other users might be updating the table, however with the lock on the table you could.

This approach has some downsides

Other users can’t SELECT from the table.
If you rebuild indexes on the table with the “online=on” option, it will want to put a schema lock on the table, to prevent other schema changes. The exclusive lock prevents the rebuild from starting.

sp_getapplock is the built-in way to allow only one user in a section of code at a time, for example:

BEGIN TRANSACTION
DECLARE @res INT
EXEC @res = sp_getapplock @Resource = 'Lock ID', @LockMode = 'Exclusive';
IF @res >= 0
BEGIN
PRINT 'lock is held.';
END
COMMIT TRAN;

@Resource can be used for different locks that don’t interfere with each other.

You have to be careful which database you are “use”ing. SQL Server assumes that different databases have nothing to do with each other and won’t block each other.

SQL Server Security with EXECUTE AS OWNER

Clay Lenhart — Thu, 24 Jan 2008 21:54:16 +0000

EXECUTE AS OWNER is a great way to limit the permissions of a SQL Server Login. The general idea is to create your stored procedure with the EXECUTE AS OWNER modifier. Any user who has the permissions to execute the stored procedure, runs the stored procedure under the Database’s dbo user (which means it can do anything in the database, but nothing at the server-level nor on other databases). If you only allow your Logins to execute stored procedures (and not touch the tables directly), then you’ve effectively limited the Logins to code you’ve written. If you don’t write any DELETE statements, then Logins can’t delete anything.

This is better than Roles, because Roles are very coarse in comparison. With Roles, you may have to give a User INSERT permissions on table. Instead with EXECUTE AS OWNER you can write a stored procedure that checks the data exactly the way you want in the body of the stored procedure. This is much more fine grained way of handling permissions.

From beginning to end, this is what you do:

Create a Login:

CREATE LOGIN [MyLogin] WITH PASSWORD=N'Password',
DEFAULT_DATABASE=[master], CHECK_EXPIRATION=OFF, CHECK_POLICY=ON;

Create its User in the database:

CREATE USER [MyUser] FOR LOGIN [MyLogin];

I prefer to use schemas to identify “public” stored procedures. So create a schema:

CREATE SCHEMA [public] AUTHORIZATION [dbo];

Give your new user EXECUTE permissions on anything in the public schema: (We will put the new stored procedure in this schema):

GRANT EXECUTE ON SCHEMA::[public] TO [MyUser];

Create your stored procedure:

CREATE PROCEDURE [public].[MyStoredProc]
(
@Param1 int
)
WITH EXECUTE AS OWNER   -- This "EXECUTE AS" modifier on the stored procedure is key!
AS
BEGIN
SET NOCOUNT ON;

-- do something

END

When your stored procedure runs, it can do anything in the database, including calling other stored procedures. It is an easy way to segregate public stored procedures from private ones. This gives you encapsulation, which is a good thing (see section 5.3 in Code Complete about the benefits of encapsulation).

The only permissions outside users need is EXECUTE permission on the public schema, so it is easy to add new stored procedures by creating them in the public schema.

Instead of Roles, you can have schemas. Let’s say you would have 3 roles in the database: admin, anon, and general. The admin role is for Logins that perform administrative activity on a website. The anon role is for people who view your site anonymously, and the general role is for stored procedures that are for both. You can instead, with EXECUTE AS OWNER, create three schemas for your stored procedures: admin, anon, and general. If you want the stored procedure to have admin only Logins to use it, create the stored procedure in the admin schema. The same goes for the other schemas.

Natural Keys vs Surrogate Keys

Clay Lenhart — Mon, 10 Dec 2007 21:36:44 +0000

This blog entry has a good description of the pros (and some cons) of surrogate keys:

http://rapidapplicationdevelopment.blogspot.com/2007/08/in-case-youre-new-to-series-ive.html

Sorting uniqueidentifiers in SQL Server 2005

Clay Lenhart — Tue, 20 Nov 2007 22:15:24 +0000

I had an issue recently where I needed to sort on a uniqueidentifier column and read the data in .Net. I found that .Net sorts Guids differently than SQL Server.

You can see for yourself. Run the following code.


DECLARE @t TABLE (
   g uniqueidentifier
); 

INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-000000000001' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-000000000010' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-000000000100' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-000000001000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-000000010000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-000000100000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-000001000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-000010000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-000100000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-001000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-010000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0000-100000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0001-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0010-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-0100-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0000-1000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0001-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0010-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-0100-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0000-1000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0001-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0010-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-0100-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000000-1000-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000001-0000-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000010-0000-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00000100-0000-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00001000-0000-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00010000-0000-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '00100000-0000-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '01000000-0000-0000-0000-000000000000' );
INSERT INTO @t ( g ) VALUES ( '10000000-0000-0000-0000-000000000000' ); 

SELECT * FROM @t order by g ;

It returns the data in the following bizarre order. Keep in mind the first row is the “smallest” number.

g
01000000-0000-0000-0000-000000000000
10000000-0000-0000-0000-000000000000
00010000-0000-0000-0000-000000000000
00100000-0000-0000-0000-000000000000
00000100-0000-0000-0000-000000000000
00001000-0000-0000-0000-000000000000
00000001-0000-0000-0000-000000000000
00000010-0000-0000-0000-000000000000
00000000-0100-0000-0000-000000000000
00000000-1000-0000-0000-000000000000
00000000-0001-0000-0000-000000000000
00000000-0010-0000-0000-000000000000
00000000-0000-0100-0000-000000000000
00000000-0000-1000-0000-000000000000
00000000-0000-0001-0000-000000000000
00000000-0000-0010-0000-000000000000
00000000-0000-0000-0001-000000000000
00000000-0000-0000-0010-000000000000
00000000-0000-0000-0100-000000000000
00000000-0000-0000-1000-000000000000
00000000-0000-0000-0000-000000000001
00000000-0000-0000-0000-000000000010
00000000-0000-0000-0000-000000000100
00000000-0000-0000-0000-000000001000
00000000-0000-0000-0000-000000010000
00000000-0000-0000-0000-000000100000
00000000-0000-0000-0000-000001000000
00000000-0000-0000-0000-000010000000
00000000-0000-0000-0000-000100000000
00000000-0000-0000-0000-001000000000
00000000-0000-0000-0000-010000000000
00000000-0000-0000-0000-100000000000

In the end, I decided to SELECT two bigint columns that indicate how SQL Server is sorting the data. This is CPU intensive, so it isn’t ideal, however it shows SQL Server’s strange sorting behaviour of the uniqueidentifier column.

CREATE FUNCTION dbo.GuidHigh
(
	@g uniqueidentifier
)
RETURNS bigint
AS
BEGIN

	DECLARE @s varchar(40);
	SET @s = @g;
	-- @s is in the format 3B3A8D04-5D0C-4E0C-AC69-EFC14EE7D849

	SET @s = REPLACE(@s, '-', '');
	-- @s is in the format 3B3A8D045D0C4E0CAC69EFC14EE7D849

	DECLARE @highA varchar(40);
	DECLARE @highB varchar(40);

	SET @highA = SUBSTRING(@s, 21, 12);
	SET @highB = SUBSTRING(@s, 17, 4);

	DECLARE @high varchar(40);
	SET @high = @highA + @highB;

	DECLARE @MinBigInt numeric(21,0);
	SET @MinBigInt = 9223372036854775808;

	RETURN CAST(dbo.[HexStrToNumeric](@high) - @MinBigInt as bigint);

END
GO

CREATE FUNCTION dbo.[GuidLow]
(
	@g uniqueidentifier
)
RETURNS bigint
AS
BEGIN

	DECLARE @s varchar(40);
	SET @s = @g;
	-- @s is in the format 3B3A8D04-5D0C-4E0C-AC69-EFC14EE7D849

	SET @s = REPLACE(@s, '-', '');
	-- @s is in the format 3B3A8D045D0C4E0CAC69EFC14EE7D849

	DECLARE @lowA varchar(40);
	DECLARE @lowB varchar(40);
	DECLARE @lowC varchar(40);
	DECLARE @lowD varchar(40);
	DECLARE @lowE varchar(40);
	DECLARE @lowF varchar(40);
	DECLARE @lowG varchar(40);
	DECLARE @lowH varchar(40);

	SET @lowA = SUBSTRING(@s, 15, 2);
	SET @lowB = SUBSTRING(@s, 13, 2);
	SET @lowC = SUBSTRING(@s, 11, 2);
	SET @lowD = SUBSTRING(@s, 9, 2);
	SET @lowE = SUBSTRING(@s, 7, 2);
	SET @lowF = SUBSTRING(@s, 5, 2);
	SET @lowG = SUBSTRING(@s, 3, 2);
	SET @lowH = SUBSTRING(@s, 1, 2);

	DECLARE @low varchar(40);
	SET @low = @lowA + @lowB + @lowC + @lowD + @lowE + @lowF + @lowG + @lowH;

	DECLARE @MinBigInt numeric(21,0);
	SET @MinBigInt = 9223372036854775808;

	RETURN CAST(dbo.[HexStrToNumeric](@low) - @MinBigInt as bigint);

END
GO

-- do not include "0x" in the parameter, just a string like "8E75EF35FF75A977"

CREATE FUNCTION dbo.[HexStrToNumeric](@hexstr varchar(16))
RETURNS numeric(21, 0) -- enough for 2^64
AS
BEGIN
    DECLARE @hex char(2), @i int, @count int, @result numeric(21, 0), @power numeric(21, 0);
    SET @result = 0;
    SET @count = LEN(@hexstr)
    SET @i = 1
    SET @power = 1;
    WHILE (@i <= @count)
    BEGIN
	SET @power = @power * 16;
        SET @i = @i + 1
    END;

    SET @i = 1
    WHILE (@i <= @count)
    BEGIN
 	SET @power = @power / 16;
        SET @hex = SUBSTRING(@hexstr, @i, 1)
        SET @result = @result + @power *
                CASE WHEN @hex LIKE '[0-9]'
                    THEN CAST(@hex as int)
                    ELSE CAST(ASCII(UPPER(@hex))-55 as int)
                END
        SET @i = @i + 1
    END
    RETURN @result
END
GO

Are Foreign Keys Bad?

Clay Lenhart — Sat, 17 Nov 2007 21:04:11 +0000

The Problem

Mike Simpson’s post on foreign keys raises some good points: http://www.slipjig.org/Mike/post/2007/11/Are-Foreign-Keys-Bad–You-Decide!.aspx. The main issue raised is how foreign keys cause deadlocks. In order to avoid deadlocks, you have to acquire locks on records in the same order, always. When you insert and update records related by foreign keys, you lock records from the parent tables to the child tables. To delete records, you lock records in the opposite order (child to parent tables) due to foreign key constraints, leading to potential deadlocks.

The Usual Solutions

There are three general approaches to deal with deadlocks:

Add retry code to handle deadlocks. Typically this is a lot of work and error prone — not to mention difficult to test. You generally don’t see many developers doing this due to the effort involved.
DELETE in the opposite order and allow deadlocks to occur. This isn’t as bad as it seems. It is common to have a little validation in the database layer — for instance for “The username must be unique” type validation. So you treat the deadlock like a validation error, report it to the user, and let the user hit the Save button again. Keep in mind that you may have a backend processes that can deadlock, which isn’t ideal — these backend processes don’t have to DELETE in order to deadlock. Even if it just has INSERTs and UPDATEs, it still can deadlock with a user who is DELETing in the opposite order (or more generally, locking in a different order).
“Logically delete” in the same order as INSERTs and UPDATEs to avoid deadlocks. Logical deletes are really updates where you set a field such as “IsDeleted” to true. The downside to this approach is all your SELECT statements have to filter out the “deleted” records, which could be error prone. The difference between this approach and the first approach though, is that this approach is much easier to test.

Don’t Use Foreign Keys!?!?

Mike proposes another idea — don’t use foreign keys! Mike’s good about coming up with ideas no one else thinks of, but in this case, is this going too far? Can you justify the tradeoff of data integrity for avoiding deadlocks? Personally, data integrity is more important. Despite this, I want to argue Mike’s side a bit more, b/c well, there are never hard and fast rules in software, like “you must always use foreign keys”.

Foreign keys causes additional locks that you may not be aware of — beyond dictating the order you modify records. Let’s say you have the following two tables: Player and Team. There is a foreign key from the Player table to the Team table. The Team table has the following records:

TeamID	TeamName
1	Manchester United
2	Liverpool

The Player table has the following record:

PlayerID	TeamID	PlayerName
1	2	Gerrard

So Gerrard plays for Liverpool. Two impossible things are about to happen: a) Manchester United is going to be relegated (so we need to delete the team), and Gerrard is going to play for Manchester United.

User A executes the following statement:

BEGIN TRAN;

DELETE FROM Team WHERE TeamID = 1;

Internally in SQL Server, the table looks like:

TeamID	TeamName
1	Manchester United	marked to be deleted and locked
2	Liverpool

When User B executes the following statement, it will block, b/c it is attempting to read Team 1 (Man U), but the record is locked and can’t be read.

BEGIN TRAN;

UPDATE Player SET TeamID = 1 WHERE PlayerID = 1;

The statement is blocked and waiting for the first user to commit the transaction. Foreign keys cause additional locks to be made. Not only that, but the locking goes from a child table to a parent table! This is in the opposite order we modify the records which can lead to deadlocks! (Even though the example above includes a DELETE on the team table, an UPDATE would lock exactly the same way in case you are thinking about doing logical deletes).

Mike’s post talks about a potential new feature in SQL Server where constraints are checked when the transaction is committed, not when individual records are modified, but is it really the answer? It will delay the foreign checking until the transaction is committed, but while it is checking the constraint, it will lock the records. This causes the locking for the whole transaction to be in random order, which will cause deadlocks.

SELECTs Lock Too!

Another thing on deadlocks, SELECTs lock records too! And therefore can deadlock. With joins, it’s anyone’s guess the order in which it locks records (parents first, or children first). As it turns out, most of the deadlocks I’ve seen have come from SELECT statements. The best way to avoid SELECT statements that lock in SQL Server 2005 is to use READ_COMMITTED_SNAPSHOT. To enable it, run the following code:

ALTER DATABASE MyDatabase SET READ_COMMITTED_SNAPSHOT ON

It works very much like READ_COMMITTED, however without locking records. The downside is that READ_COMMITTED_SNAPSHOT uses more I/O than READ_COMMITTED.

Personal Preference

To avoid deadlocks I have a bias towards the following approach. After reading the above, you know there is no silver bullet, but this is a good balance of deadlock avoidance, data integrity, and ease of programming:

Use foreign keys
Do logical deletes
INSERT, UPDATE and Logically delete tables in the same order
Use READ_COMMITTED_SNAPSHOT isolation level.