Advantages of Immutable Data

I was reading about two of Google’s internal programs, Bigtable and Google File System, and how they handle a large amounts of data so that the processing is distributed (Bigtable) and the data is replicated (Google File System). One thing they both do is exploit immutability. So once something is saved to disk, that block of data doesn’t change. If you have a large volume of data, this assumption can be very useful.

Let’s say we’re talking about a transaction table for bank accounts. Records in the transaction table don’t change. Any corrections are handled by creating a new transaction record at the tail end of the table. If you are processing the transactions, this is great! You can ignore previously processed records since you know they will never change.

Posted in Architecture | Tagged | 21 Comments

SQL Server Hash Indexes

There are two problems with indexes on large nvarchar columns:

  • You will likely hit the 900 byte limit in your index
  • Indexing large data isn’t efficient anyway.

A neat feature of SQL Server is the CHECKSUM() function which hashes your varchar/nvarchar values into a 4 byte number. You can then use this value in an index. For example if you have a Site table, add a calculated column, URLChecksum.

CREATE TABLE Site (
  SiteID int NOT NULL,
  URL nvarchar(2083) NOT NULL,
  URLChecksum AS (checksum([URL])),
 CONSTRAINT [PK_Site] PRIMARY KEY CLUSTERED (SiteID)
);

Next create an index on the hash and include the URL:

CREATE INDEX IX_Site ON Site (URLChecksum) INCLUDE (URL);

This index will make the following query faster:

SELECT SiteID
FROM Site
WHERE URLChecksum = CHECKSUM(N'http://www.microsoft.com/downloads/details.aspx?familyid=9a8b005b-84e4-4f24-8d65-cb53442d9e19&displaylang=en')
AND URL = N'http://www.microsoft.com/downloads/details.aspx?familyid=9a8b005b-84e4-4f24-8d65-cb53442d9e19&displaylang=en';

This query will first “seek” the hash value in the index very quickly, since the hash values are just ints. Once it finds one or more matching hash values, it will check that the URLs match. Since the URLChecksum, URL, and SiteID values are included in the index, this query does not need to touch the Site table.

Posted in SQL Server Development | 7 Comments

“Including” Columns in an Index

A neat feature in SQL Server 2005 is the ability to “include” columns in an index. These included columns are not in the main part of the index, but are additional information in the index.

For example, lets say you have the following SELECT statement:

SELECT Url

FROM Site

WHERE Category = 'News';

You might be tempted to create a covering index with Category as the first column, and Url as the second column. Since URLs can be 2083 characters, you can’t put the URL column in an index since it would exceed 900 bytes. However, the query above would benefit from the following index where the URL is “included” in the index and therefore isn’t restricted to 900 bytes.

CREATE INDEX IX_Site ON dbo.Site (Category) INCLUDE (Url);

The main part of the index only contains the category column. The index also stores the Url, but the Url can’t be efficiently used for filtering. In the SELECT statement above, this is OK, since the Url is only returned, not filtered.

Another benefit is that the main part of the index is smaller, so it is faster to find records in the index.

Posted in SQL Server Development | 7 Comments

Finding Used Databases

It can be difficult to determine which databases you are administering are actually being used. Below is a query that determines the last time an index was used since the last time SQL Server started to help narrow down these databases:

SELECT db.name, i.name, S.*, i.*
 FROM sys.dm_db_Index_Usage_Stats S
      inner join sys.databases db on db.database_id = s.database_id
      inner JOIN sys.Indexes I
              ON S.Index_Id = I.Index_Id
              AND S.Object_Id = I.Object_Id
              AND S.DataBase_Id = DB_ID()
where I.Object_Id > 1000
order by db.name
Posted in SQL Server Administration | 7 Comments

SQL Server RAM Usage

SQL Server has difficultly managing memory for systems with a large amount of RAM (say 8+ GB), typically found in 64-bit installations. You might see log messages such as:

A significant part of sql server process memory has been paged out.

Or worse, SSAS may fail to process a cube with the only message that it was cancelled. To see if you have this problem, open Performance Monitor and view “Private Bytes” under the Process category. The value should be less than the amount of physical RAM.

I found this following article useful in dealing with this issue: http://blogs.technet.com/askperf/archive/2007/05/18/sql-and-the-working-set.aspx

You want to configure SQL Server and SSAS to use a specific amount of RAM, but be sure to leave plenty of RAM for OS and other processes, such as SSIS packages. Monitor Private Bytes while you run a typical workload to see how much RAM you can allocate to SQL Server and SSAS. It won’t be surprising if you allocate 2/3 of the RAM to SQL Server and SSAS, and leave 1/3 for the OS and other processes.

Posted in SQL Server Administration | 5 Comments

Defragging Indexes in SQL Server 2005

There are two ways to optimize your indexes in SQL Server 2005:

  • ALTER INDEX … REORGANIZE
  • ALTER INDEX … REBUILD

Which one should you choose? According to BOL, it depends on how fragmented they are. The avg_fragmentation_in_percent column from the sys.dm_db_index_physical_stats() system function lets you know how fragmented your indexes are. If the value is below 5.0, then there is no need to defragment the index. If the value is between 5.0 and 30.0, then “ALTER INDEX … REORGANIZE”, is the best option. If the fragmentation is greater than 30.0, then “ALTER INDEX … REBUILD” is the best option.

REORGANIZE has a couple of benefits. REORGANIZE will log only the pages that changed, while REBUILD will log the whole index. REORGANIZE can be done online, while REBUILD can’t (unless you add the ONLINE=ON option for Enterprise Edition installations).

However REBUILD will update the statistics of the index which lead to better execution plans, while REORGANIZE does not.

Posted in SQL Server Administration | 9 Comments

Listing checked-out files in Team System (TFS)

You can get a complete list of checked out files by using the tf.exe command line tool found in: C:\Program Files\Microsoft Visual Studio 8\Common7\IDE

All checked out files in $/MyProject:

tf.exe status $/MyProject /user:* /s:http://MyTfsServer:8080 /recursive

All files checked out to Mark:

tf.exe status /user:mark /s:http://MyTfsServer:8080

You can write the output to a file:

tf.exe status /user:* /s:http://MyTfsServer:8080 > c:\checkedout.txt
Posted in Source Control | 25 Comments

Backing up to a NTFS compressed folder

It seems like a good idea to backup SQL Server 2005 databases to a compressed folder, however if your database is larger than roughly 30 GB, you will get the following error:

BackupMedium::ReportIoError: write failure on backup
device 'G:\Backups\200501220402.BAK'. Operating system
error 33(The process cannot access the file because
another process has locked a portion of the file.).

One way to get around this is to stripe the backup to multiple files, for instance:

BACKUP DATABASE AdventureWorks to
DISK='G:\Backups\200501220402.BAK1',
DISK='G:\Backups\200501220402.BAK2',
DISK='G:\Backups\200501220402.BAK3'
WITH FORMAT, CHECKSUM;

Not ideal, but there are two options:

  1. Buy LiteSpeed Backup which includes compressed backups
  2. Wait for SQL Server 2008, which includes compressed backups
Posted in SQL Server Administration | 32 Comments

The sp_getapplock secret

sp_getapplock is not very well advertised in SQL Server 2005, however it is a good way to synchronize code in a stored procedure.

Before finding out about sp_getapplock, I would SELECT from a table with an exclusive lock, like so:

BEGIN TRAN;
SELECT TOP 1 * FROM dbo.ATable with (tablockx, holdlock);
-- Do something while the lock is held, for instance:
UPDATE dbo.ATable SET FieldA = FieldA + 1 WHERE FieldB = 'something';
COMMIT TRAN;

This blocks other users from entering the section of code until the lock is released (when the transaction is committed). Normally you couldn’t do FieldA = FieldA + 1, because other users might be updating the table, however with the lock on the table you could.

This approach has some downsides

  1. Other users can’t SELECT from the table.
  2. If you rebuild indexes on the table with the “online=on” option, it will want to put a schema lock on the table, to prevent other schema changes. The exclusive lock prevents the rebuild from starting.

sp_getapplock is the built-in way to allow only one user in a section of code at a time, for example:

BEGIN TRANSACTION
DECLARE @res INT
EXEC @res = sp_getapplock @Resource = 'Lock ID', @LockMode = 'Exclusive';
IF @res >= 0
BEGIN
PRINT 'lock is held.';
END
COMMIT TRAN;

@Resource can be used for different locks that don’t interfere with each other.

You have to be careful which database you are “use”ing. SQL Server assumes that different databases have nothing to do with each other and won’t block each other.

Posted in SQL Server Development | 22 Comments

SQL Server Security with EXECUTE AS OWNER

EXECUTE AS OWNER is a great way to limit the permissions of a SQL Server Login. The general idea is to create your stored procedure with the EXECUTE AS OWNER modifier. Any user who has the permissions to execute the stored procedure, runs the stored procedure under the Database’s dbo user (which means it can do anything in the database, but nothing at the server-level nor on other databases). If you only allow your Logins to execute stored procedures (and not touch the tables directly), then you’ve effectively limited the Logins to code you’ve written. If you don’t write any DELETE statements, then Logins can’t delete anything.

This is better than Roles, because Roles are very coarse in comparison. With Roles, you may have to give a User INSERT permissions on table. Instead with EXECUTE AS OWNER you can write a stored procedure that checks the data exactly the way you want in the body of the stored procedure. This is much more fine grained way of handling permissions.

From beginning to end, this is what you do:

Create a Login:

CREATE LOGIN [MyLogin] WITH PASSWORD=N'Password',
DEFAULT_DATABASE=[master], CHECK_EXPIRATION=OFF, CHECK_POLICY=ON;

Create its User in the database:

CREATE USER [MyUser] FOR LOGIN [MyLogin];

I prefer to use schemas to identify “public” stored procedures. So create a schema:

CREATE SCHEMA [public] AUTHORIZATION [dbo];

Give your new user EXECUTE permissions on anything in the public schema: (We will put the new stored procedure in this schema):

GRANT EXECUTE ON SCHEMA::[public] TO [MyUser];

Create your stored procedure:

CREATE PROCEDURE [public].[MyStoredProc]
(
@Param1 int
)
WITH EXECUTE AS OWNER   -- This "EXECUTE AS" modifier on the stored procedure is key!
AS
BEGIN
SET NOCOUNT ON;

-- do something

END

When your stored procedure runs, it can do anything in the database, including calling other stored procedures. It is an easy way to segregate public stored procedures from private ones. This gives you encapsulation, which is a good thing (see section 5.3 in Code Complete about the benefits of encapsulation).

The only permissions outside users need is EXECUTE permission on the public schema, so it is easy to add new stored procedures by creating them in the public schema.

Instead of Roles, you can have schemas. Let’s say you would have 3 roles in the database: admin, anon, and general. The admin role is for Logins that perform administrative activity on a website. The anon role is for people who view your site anonymously, and the general role is for stored procedures that are for both. You can instead, with EXECUTE AS OWNER, create three schemas for your stored procedures: admin, anon, and general. If you want the stored procedure to have admin only Logins to use it, create the stored procedure in the admin schema. The same goes for the other schemas.

Posted in SQL Server Development | 56 Comments