HierarchyID in SQL Server 2008
SQL Server 2008 includes a new HierarchyID datatype!
SQL Server 2008 includes a new HierarchyID datatype!
I was reading about two of Google's internal programs, Bigtable and Google File System, and how they handle a large amounts of data so that the processing is distributed (Bigtable) and the data is replicated (Google File System). One thing they both do is exploit immutability. So once something is saved to disk, that block of data doesn't change. If you have a large volume of data, this assumption can be very useful.
Let's say we're talking about a transaction table for bank accounts. Records in the transaction table don't change. Any corrections are handled by creating a new transaction record at the tail end of the table. If you are processing the transactions, this is great! You can ignore previously processed records since you know they will never change.
There are two problems with indexes on large nvarchar columns:
A neat feature of SQL Server is the CHECKSUM() function which hashes your varchar/nvarchar values into a 4 byte number. You can then use this value in an index. For example if you have a Site table, add a calculated column, URLChecksum.
CREATE TABLE Site ( SiteID int NOT NULL, URL nvarchar(2083) NOT NULL, URLChecksum AS (checksum([URL])), CONSTRAINT [PK_Site] PRIMARY KEY CLUSTERED (SiteID) );
Next create an index on the hash and include the URL:
CREATE INDEX IX_Site ON Site (URLChecksum) INCLUDE (URL);
This index will make the following query faster:
SELECT SiteID FROM Site WHERE URLChecksum = CHECKSUM(N'http://www.microsoft.com/downloads/details.aspx?familyid=9a8b005b-84e4-4f24-8d65-cb53442d9e19&displaylang=en') AND URL = N'http://www.microsoft.com/downloads/details.aspx?familyid=9a8b005b-84e4-4f24-8d65-cb53442d9e19&displaylang=en';
This query will first "seek" the hash value in the index very quickly, since the hash values are just ints. Once it finds one or more matching hash values, it will check that the URLs match. Since the URLChecksum, URL, and SiteID values are included in the index, this query does not need to touch the Site table.
A neat feature in SQL Server 2005 is the ability to "include" columns in an index. These included columns are not in the main part of the index, but are additional information in the index.
For example, lets say you have the following SELECT statement:
SELECT Url FROM Site WHERE Category = 'News';
You might be tempted to create a covering index with Category as the first column, and Url as the second column. Since URLs can be 2083 characters, you can't put the URL column in an index since it would exceed 900 bytes. However, the query above would benefit from the following index where the URL is "included" in the index and therefore isn't restricted to 900 bytes.
CREATE INDEX IX_Site ON dbo.Site (Category) INCLUDE (Url);
The main part of the index only contains the category column. The index also stores the Url, but the Url can't be efficiently used for filtering. In the SELECT statement above, this is OK, since the Url is only returned, not filtered.
Another benefit is that the main part of the index is smaller, so it is faster to find records in the index.