simon
2012-02-23 15:05:26 UTC
In my system, I have many tables with Primary Key, which contains more
than one column(usually from 2 to 5 columns). And this columns are
often of varchar dataType.
SQL by default creates clustered index on PK. Is that good idea?
This clustered index is unique, but is not narrow, not ever increasing
and sometimes even not static.
So, by following general rules this is not good candidate for
clustered index.
What now, should I leave table without clustered index or should I add
one identity column to this table and mark it as clustered.
In general, identity column is suggested.
Lets make simple example.
If I create 3 tables, one with clustered index, one without clustered
index(heap) and one with additional identity column as clustered
index:
CREATE TABLE [dbo].[testTbl1](
[prodID] [nvarchar](50) NOT NULL,
[cntID] [char](10) NOT NULL,
[col1] [nchar](10) NULL,
[col2] [nchar](10) NULL,
[col3] [nchar](10) NULL,
[color] [nvarchar](50) NULL,
CONSTRAINT [PK_testIndex1] PRIMARY KEY NONCLUSTERED (
[prodID] ASC,
[cntID] ASC
)ON [PRIMARY] ) ON [PRIMARY]
CREATE TABLE [dbo].[testTbl2](
[prodID] [nvarchar](50) NOT NULL,
[cntID] [char](10) NOT NULL,
[col1] [nchar](10) NULL,
[col2] [nchar](10) NULL,
[col3] [nchar](10) NULL,
[color] [nvarchar](50) NULL,
CONSTRAINT [PK_testIndex2] PRIMARY KEY CLUSTERED (
[prodID] ASC,
[cntID] ASC
)ON [PRIMARY] ) ON [PRIMARY]
CREATE TABLE [dbo].[testTbl3](
[idNum] [int] IDENTITY(1,1) NOT NULL,
[prodID] [nvarchar](50) NOT NULL,
[cntID] [char](10) NOT NULL,
[col1] [nchar](10) NULL,
[col2] [nchar](10) NULL,
[col3] [nchar](10) NULL,
[color] [nvarchar](50) NULL,
CONSTRAINT [PK_testIndex3] PRIMARY KEY NONCLUSTERED (
[prodID] ASC,
[cntID] ASC
)ON [PRIMARY]) ON [PRIMARY]
CREATE CLUSTERED INDEX [ix_idNum] ON [dbo].[testTbl3] ([idNum] ASC)
Then I insert million records into this tables(I leave col1, col2 and
col3 as NULL). (you can use data from some other tables to insert or
you can use RedGate Data generator)
INSERT INTO dbo.testTbl1( prodID, cntID, color )
SELECT col1, col2, col3 FROM someProductionTable
If I compare indexes now:
SELECT
i.name AS IndexName,
SUM(s.used_page_count) * 8 AS IndexSizeKB
FROM sys.dm_db_partition_stats AS s
JOIN sys.indexes AS i
ON s.[object_id] = i.[object_id] AND s.index_id = i.index_id
WHERE s.[object_id] = object_id('dbo.testIndex1') OR s.[object_id] =
object_id('dbo.testIndex2') OR s.[object_id] =
object_id('dbo.testIndex3')
GROUP BY i.name
ORDER BY i.name
IndexName IndexSizeKB
NULL 139856 (this one is from heap table - testTbl1)
ix_idNum 141680
PK_testIndex1 56496
PK_testIndex2 141216
PK_testIndex3 52416
Now, I create one nonclustered index on column color:
CREATE NONCLUSTERED INDEX [ix_color1] ON [dbo].[testTbl1] ([color]
ASC)
CREATE NONCLUSTERED INDEX [ix_color2] ON [dbo].[testTbl2] ([color]
ASC)
CREATE NONCLUSTERED INDEX [ix_color3] ON [dbo].[testTbl3] ([color]
ASC)
and look for index size. I see the following:
ix_color1 33856
ix_color2 53304
ix_color3 29624
ix_color2 on clustered table has greatest size, because it includes
clustered key, which has greatest size.
ix_color1 on heap table has little greater size as ix_color3 on table
with clustered key on identity column:
Clustered index as part of nonclustered index has :
1 byte (number of variable columns)
+ 4 bytes (page pointer)
+ 4 bytes for identity column
in my case 9356 KB.
If nonclustered index is on heap, than it includes file identifier
(ID), page number, and number of the row on the page as pointer to the
data, which is greater size than identity in clustered table.
ix_color2, which has wide clustered index is almost twice in the size
comparing to ix_color3 with narrow clustered index.
So, the conclusion is that identity is the best solution for clustered
index if considering disk space.
But when I start to query the table, I get interesting results.
Queries like:
SELECT prodID FROM dbo.testTbl1 WHERE color='#005BAB' ORDER BY cntID
SELECT cntID FROM dbo.testTbl1 WHERE color='#005BAB' ORDER BY prodID
SELECT cntID, color FROM dbo.testTbl1 WHERE prodID='00003113'
SELECT prodID, color FROM dbo.testTbl1 WHERE cntID='CZK'
....
....
and all other possible combinations are almost in every case much
faster on table dbo.testTbl2 with clustered index on PK(rather than
identity or heap).
That is because PK is usually included in every query(in where, select
or order by - this one benefits only from first column in PK) and it
has high selectivity and if in clustered index you would get usually
index seek.
If I would update also the values of other columns(col1, col2, col3)
the table would have more pages and table scans(which happens in 1 and
3 case) would be even slower and dbo.testTbl2 would be even better
comparing to other two tables.
So, should I take in most cases PK as clustered index(which is also
sql server default)?
If I look the query :
SELECT prodID FROM dbo.testTbl3 WHERE color='#005BAB' ORDER BY cntID
If I would include prodId and cntID in index ix_color3, it would take
almost the same CPU time and IO as in case with ix_color2.
But index is in this case 59560 KB, which is 12% greater that
ix_color2. So, table dbo.testTbl2 is winner again.
Conclusion: even if all nonclustered indexes are much greater because
of wide clustered index, the most queries are much faster(comparing to
tiny identity cluster or heap), because PK in practice is usually part
of every query.
But many people recommend identity as cluster in 99% of cases.
Any thoughts?
Thanks,
Simon
than one column(usually from 2 to 5 columns). And this columns are
often of varchar dataType.
SQL by default creates clustered index on PK. Is that good idea?
This clustered index is unique, but is not narrow, not ever increasing
and sometimes even not static.
So, by following general rules this is not good candidate for
clustered index.
What now, should I leave table without clustered index or should I add
one identity column to this table and mark it as clustered.
In general, identity column is suggested.
Lets make simple example.
If I create 3 tables, one with clustered index, one without clustered
index(heap) and one with additional identity column as clustered
index:
CREATE TABLE [dbo].[testTbl1](
[prodID] [nvarchar](50) NOT NULL,
[cntID] [char](10) NOT NULL,
[col1] [nchar](10) NULL,
[col2] [nchar](10) NULL,
[col3] [nchar](10) NULL,
[color] [nvarchar](50) NULL,
CONSTRAINT [PK_testIndex1] PRIMARY KEY NONCLUSTERED (
[prodID] ASC,
[cntID] ASC
)ON [PRIMARY] ) ON [PRIMARY]
CREATE TABLE [dbo].[testTbl2](
[prodID] [nvarchar](50) NOT NULL,
[cntID] [char](10) NOT NULL,
[col1] [nchar](10) NULL,
[col2] [nchar](10) NULL,
[col3] [nchar](10) NULL,
[color] [nvarchar](50) NULL,
CONSTRAINT [PK_testIndex2] PRIMARY KEY CLUSTERED (
[prodID] ASC,
[cntID] ASC
)ON [PRIMARY] ) ON [PRIMARY]
CREATE TABLE [dbo].[testTbl3](
[idNum] [int] IDENTITY(1,1) NOT NULL,
[prodID] [nvarchar](50) NOT NULL,
[cntID] [char](10) NOT NULL,
[col1] [nchar](10) NULL,
[col2] [nchar](10) NULL,
[col3] [nchar](10) NULL,
[color] [nvarchar](50) NULL,
CONSTRAINT [PK_testIndex3] PRIMARY KEY NONCLUSTERED (
[prodID] ASC,
[cntID] ASC
)ON [PRIMARY]) ON [PRIMARY]
CREATE CLUSTERED INDEX [ix_idNum] ON [dbo].[testTbl3] ([idNum] ASC)
Then I insert million records into this tables(I leave col1, col2 and
col3 as NULL). (you can use data from some other tables to insert or
you can use RedGate Data generator)
INSERT INTO dbo.testTbl1( prodID, cntID, color )
SELECT col1, col2, col3 FROM someProductionTable
If I compare indexes now:
SELECT
i.name AS IndexName,
SUM(s.used_page_count) * 8 AS IndexSizeKB
FROM sys.dm_db_partition_stats AS s
JOIN sys.indexes AS i
ON s.[object_id] = i.[object_id] AND s.index_id = i.index_id
WHERE s.[object_id] = object_id('dbo.testIndex1') OR s.[object_id] =
object_id('dbo.testIndex2') OR s.[object_id] =
object_id('dbo.testIndex3')
GROUP BY i.name
ORDER BY i.name
IndexName IndexSizeKB
NULL 139856 (this one is from heap table - testTbl1)
ix_idNum 141680
PK_testIndex1 56496
PK_testIndex2 141216
PK_testIndex3 52416
Now, I create one nonclustered index on column color:
CREATE NONCLUSTERED INDEX [ix_color1] ON [dbo].[testTbl1] ([color]
ASC)
CREATE NONCLUSTERED INDEX [ix_color2] ON [dbo].[testTbl2] ([color]
ASC)
CREATE NONCLUSTERED INDEX [ix_color3] ON [dbo].[testTbl3] ([color]
ASC)
and look for index size. I see the following:
ix_color1 33856
ix_color2 53304
ix_color3 29624
ix_color2 on clustered table has greatest size, because it includes
clustered key, which has greatest size.
ix_color1 on heap table has little greater size as ix_color3 on table
with clustered key on identity column:
Clustered index as part of nonclustered index has :
1 byte (number of variable columns)
+ 4 bytes (page pointer)
+ 4 bytes for identity column
in my case 9356 KB.
If nonclustered index is on heap, than it includes file identifier
(ID), page number, and number of the row on the page as pointer to the
data, which is greater size than identity in clustered table.
ix_color2, which has wide clustered index is almost twice in the size
comparing to ix_color3 with narrow clustered index.
So, the conclusion is that identity is the best solution for clustered
index if considering disk space.
But when I start to query the table, I get interesting results.
Queries like:
SELECT prodID FROM dbo.testTbl1 WHERE color='#005BAB' ORDER BY cntID
SELECT cntID FROM dbo.testTbl1 WHERE color='#005BAB' ORDER BY prodID
SELECT cntID, color FROM dbo.testTbl1 WHERE prodID='00003113'
SELECT prodID, color FROM dbo.testTbl1 WHERE cntID='CZK'
....
....
and all other possible combinations are almost in every case much
faster on table dbo.testTbl2 with clustered index on PK(rather than
identity or heap).
That is because PK is usually included in every query(in where, select
or order by - this one benefits only from first column in PK) and it
has high selectivity and if in clustered index you would get usually
index seek.
If I would update also the values of other columns(col1, col2, col3)
the table would have more pages and table scans(which happens in 1 and
3 case) would be even slower and dbo.testTbl2 would be even better
comparing to other two tables.
So, should I take in most cases PK as clustered index(which is also
sql server default)?
If I look the query :
SELECT prodID FROM dbo.testTbl3 WHERE color='#005BAB' ORDER BY cntID
If I would include prodId and cntID in index ix_color3, it would take
almost the same CPU time and IO as in case with ix_color2.
But index is in this case 59560 KB, which is 12% greater that
ix_color2. So, table dbo.testTbl2 is winner again.
Conclusion: even if all nonclustered indexes are much greater because
of wide clustered index, the most queries are much faster(comparing to
tiny identity cluster or heap), because PK in practice is usually part
of every query.
But many people recommend identity as cluster in 99% of cases.
Any thoughts?
Thanks,
Simon