Discussion:
Clustered index example
(too old to reply)
simon
2012-02-23 15:05:26 UTC
Permalink
In my system, I have many tables with Primary Key, which contains more
than one column(usually from 2 to 5 columns). And this columns are
often of varchar dataType.
SQL by default creates clustered index on PK. Is that good idea?

This clustered index is unique, but is not narrow, not ever increasing
and sometimes even not static.
So, by following general rules this is not good candidate for
clustered index.

What now, should I leave table without clustered index or should I add
one identity column to this table and mark it as clustered.
In general, identity column is suggested.

Lets make simple example.
If I create 3 tables, one with clustered index, one without clustered
index(heap) and one with additional identity column as clustered
index:

CREATE TABLE [dbo].[testTbl1](
[prodID] [nvarchar](50) NOT NULL,
[cntID] [char](10) NOT NULL,
[col1] [nchar](10) NULL,
[col2] [nchar](10) NULL,
[col3] [nchar](10) NULL,
[color] [nvarchar](50) NULL,
CONSTRAINT [PK_testIndex1] PRIMARY KEY NONCLUSTERED (
[prodID] ASC,
[cntID] ASC
)ON [PRIMARY] ) ON [PRIMARY]


CREATE TABLE [dbo].[testTbl2](
[prodID] [nvarchar](50) NOT NULL,
[cntID] [char](10) NOT NULL,
[col1] [nchar](10) NULL,
[col2] [nchar](10) NULL,
[col3] [nchar](10) NULL,
[color] [nvarchar](50) NULL,
CONSTRAINT [PK_testIndex2] PRIMARY KEY CLUSTERED (
[prodID] ASC,
[cntID] ASC
)ON [PRIMARY] ) ON [PRIMARY]

CREATE TABLE [dbo].[testTbl3](
[idNum] [int] IDENTITY(1,1) NOT NULL,
[prodID] [nvarchar](50) NOT NULL,
[cntID] [char](10) NOT NULL,
[col1] [nchar](10) NULL,
[col2] [nchar](10) NULL,
[col3] [nchar](10) NULL,
[color] [nvarchar](50) NULL,
CONSTRAINT [PK_testIndex3] PRIMARY KEY NONCLUSTERED (
[prodID] ASC,
[cntID] ASC
)ON [PRIMARY]) ON [PRIMARY]

CREATE CLUSTERED INDEX [ix_idNum] ON [dbo].[testTbl3] ([idNum] ASC)


Then I insert million records into this tables(I leave col1, col2 and
col3 as NULL). (you can use data from some other tables to insert or
you can use RedGate Data generator)

INSERT INTO dbo.testTbl1( prodID, cntID, color )
SELECT col1, col2, col3 FROM someProductionTable

If I compare indexes now:
SELECT
i.name AS IndexName,
SUM(s.used_page_count) * 8 AS IndexSizeKB
FROM sys.dm_db_partition_stats AS s
JOIN sys.indexes AS i
ON s.[object_id] = i.[object_id] AND s.index_id = i.index_id
WHERE s.[object_id] = object_id('dbo.testIndex1') OR s.[object_id] =
object_id('dbo.testIndex2') OR s.[object_id] =
object_id('dbo.testIndex3')
GROUP BY i.name
ORDER BY i.name

IndexName IndexSizeKB
NULL 139856 (this one is from heap table - testTbl1)
ix_idNum 141680
PK_testIndex1 56496
PK_testIndex2 141216
PK_testIndex3 52416


Now, I create one nonclustered index on column color:
CREATE NONCLUSTERED INDEX [ix_color1] ON [dbo].[testTbl1] ([color]
ASC)
CREATE NONCLUSTERED INDEX [ix_color2] ON [dbo].[testTbl2] ([color]
ASC)
CREATE NONCLUSTERED INDEX [ix_color3] ON [dbo].[testTbl3] ([color]
ASC)

and look for index size. I see the following:
ix_color1 33856
ix_color2 53304
ix_color3 29624

ix_color2 on clustered table has greatest size, because it includes
clustered key, which has greatest size.
ix_color1 on heap table has little greater size as ix_color3 on table
with clustered key on identity column:

Clustered index as part of nonclustered index has :
1 byte (number of variable columns)
+ 4 bytes (page pointer)
+ 4 bytes for identity column
in my case 9356 KB.

If nonclustered index is on heap, than it includes file identifier
(ID), page number, and number of the row on the page as pointer to the
data, which is greater size than identity in clustered table.

ix_color2, which has wide clustered index is almost twice in the size
comparing to ix_color3 with narrow clustered index.

So, the conclusion is that identity is the best solution for clustered
index if considering disk space.

But when I start to query the table, I get interesting results.
Queries like:
SELECT prodID FROM dbo.testTbl1 WHERE color='#005BAB' ORDER BY cntID
SELECT cntID FROM dbo.testTbl1 WHERE color='#005BAB' ORDER BY prodID
SELECT cntID, color FROM dbo.testTbl1 WHERE prodID='00003113'
SELECT prodID, color FROM dbo.testTbl1 WHERE cntID='CZK'
....
....
and all other possible combinations are almost in every case much
faster on table dbo.testTbl2 with clustered index on PK(rather than
identity or heap).
That is because PK is usually included in every query(in where, select
or order by - this one benefits only from first column in PK) and it
has high selectivity and if in clustered index you would get usually
index seek.
If I would update also the values of other columns(col1, col2, col3)
the table would have more pages and table scans(which happens in 1 and
3 case) would be even slower and dbo.testTbl2 would be even better
comparing to other two tables.

So, should I take in most cases PK as clustered index(which is also
sql server default)?

If I look the query :
SELECT prodID FROM dbo.testTbl3 WHERE color='#005BAB' ORDER BY cntID

If I would include prodId and cntID in index ix_color3, it would take
almost the same CPU time and IO as in case with ix_color2.
But index is in this case 59560 KB, which is 12% greater that
ix_color2. So, table dbo.testTbl2 is winner again.

Conclusion: even if all nonclustered indexes are much greater because
of wide clustered index, the most queries are much faster(comparing to
tiny identity cluster or heap), because PK in practice is usually part
of every query.
But many people recommend identity as cluster in 99% of cases.

Any thoughts?

Thanks,
Simon
Erland Sommarskog
2012-02-24 19:23:52 UTC
Permalink
Post by simon
In my system, I have many tables with Primary Key, which contains more
than one column(usually from 2 to 5 columns). And this columns are
often of varchar dataType.
SQL by default creates clustered index on PK. Is that good idea?
Sometimes. Sometimes not.

But the way SQL Server works, it usually best to have a clustered index.
SQL Server appears to look at heaps the funny uncle from the countryside.
Post by simon
and all other possible combinations are almost in every case much
faster on table dbo.testTbl2 with clustered index on PK(rather than
identity or heap).
That is because PK is usually included in every query(in where, select
or order by - this one benefits only from first column in PK) and it
has high selectivity and if in clustered index you would get usually
index seek.
But you could have the cake and eat it: use IDENTITY for the clustered
index, and then include the PK columns where it fits.

Overall, this a complex area, and there are no clearcut answer, but for
each table you design, you will need to make the choice whether to
cluster by the PK, some other column or add a surrogate to use only for
the clustered index. A lot depends on what you want to optimize for:
A certain range query? INSERT performance? Something else?
--
Erland Sommarskog, SQL Server MVP, ***@sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
simon
2012-02-28 08:31:49 UTC
Permalink
Post by Erland Sommarskog
Post by simon
In my system, I have many tables with Primary Key, which contains more
than one column(usually from 2 to 5 columns). And this columns are
often of varchar dataType.
SQL by default creates clustered index on PK. Is that good idea?
Sometimes. Sometimes not.
But the way SQL Server works, it usually best to have a clustered index.
SQL Server appears to look at heaps the funny uncle from the countryside.
Post by simon
and all other possible combinations are almost in every case much
faster on table dbo.testTbl2 with clustered index on PK(rather than
identity or heap).
That is because PK is usually included in every query(in where, select
or order by - this one benefits only from first column in PK) and it
has high selectivity and if in clustered index you would get usually
index seek.
But you could have the cake and eat it: use IDENTITY for the clustered
index, and then include the PK columns where it fits.
Overall, this a complex area, and there are no clearcut answer, but for
each table you design, you will need to make the choice whether to
cluster by the PK, some other column or add a surrogate to use only for
A certain range query? INSERT performance? Something else?
--
SQL 2008:http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005:http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
Thanks Erland for your thoughts.
Having clustered index on identity and define PK as non-clustered
index takes more space than clustered index on PK and also performance
in queries is not better.
On the other hand, having clustered index on PK, creates page splits
and fragmentation of clustered index when inserting new rows and I'm
not sure how this affects non-clustered indexes?
(I guess no affect, even if PK would be non-unique, the rebuild of it
would rebuilt also all nonclustered indexes - but rebuild anyway
doesn't happen when page splits occur)
Otherwise page splits happens also on PK as non-clustered index(if
clustered is identity) so no big difference at inserts.
Usually there are a lot more reads than inserts, so, page splits
doesn't have great impact on performance in 90% of cases.

I can imagine only two reasons why having clustered on identity and
nonclustered on PK:

1. When other tables have relation (FK) on this table. JOIN by
identity is much faster than JOIN by 5 columns of PK for example.

2. Other nonclustered indexes are greater because of wide clustered
index
(but since many queries have PK included, I should anyway include PK
columns in many non-clustered indexes if I would have identity as
clustered)

Is there any other advantage of having identity as clustered + PK as
nonclustered regards to PK as clustered?

Otherwise I would say in general(98% of all cases):
If table has relations on other table(s), than always use identity as
clustered index and add PK as columns or included columns on other
nonclustered indexes.
If table doesn't have relations than put PK as clustered, since it is
included in almost every select query.

Is that good guideline for start? For special cases, as you said, this
could be changed.

BR,
Simon
Erland Sommarskog
2012-02-29 02:24:32 UTC
Permalink
Post by simon
On the other hand, having clustered index on PK, creates page splits
and fragmentation of clustered index when inserting new rows and I'm
not sure how this affects non-clustered indexes?
The NCI indexes are not affected by what happens in the clustered index.
Post by simon
(I guess no affect, even if PK would be non-unique,
If the PK is non-unique, you have a problem. :-)
Post by simon
Is there any other advantage of having identity as clustered + PK as
nonclustered regards to PK as clustered?
Reduce page splits and improve insert-performance. If your table has more
read than inserts, it may not matter, but if there is a high amount of
concurrent inserts it can certainly matter.
Post by simon
If table has relations on other table(s), than always use identity as
clustered index and add PK as columns or included columns on other
nonclustered indexes.
Actually, I have many cases of multi-column foreign keys. I recall one
situation where I did design two tables with a surrogate key to link
them. When I came back to code against these tables years later, I found
that it was very difficult because of the surrogate.
--
Erland Sommarskog, SQL Server MVP, ***@sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
simon
2012-02-29 09:45:07 UTC
Permalink
Post by Erland Sommarskog
Post by simon
On the other hand, having clustered index on PK, creates page splits
and fragmentation of clustered index when inserting new rows and I'm
not sure how this affects non-clustered indexes?
The NCI indexes are not affected by what happens in the clustered index.
Post by simon
(I guess no affect, even if  PK would be non-unique,
If the PK is non-unique, you have a problem. :-)
Post by simon
Is there any other advantage of having identity as clustered + PK as
nonclustered regards to PK as clustered?
Reduce page splits and improve insert-performance. If your table has more
read than inserts, it may not matter, but if there is a high amount of
concurrent inserts it can certainly matter.
Post by simon
If table has relations on other table(s), than always use identity as
clustered index and add PK as columns or included columns on other
nonclustered indexes.
Actually, I have many cases of multi-column foreign keys. I recall one
situation where I did design two tables with a surrogate key to link
them. When I came back to code against these tables years later, I found
that it was very difficult because of the surrogate.
--
SQL 2008:http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005:http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
"Reduce page splits and improve insert-performance."

This is actually not big improvement. If you have identity as
clustered and PK as nonclustered, than page splitting occurs on
nonclustered index(instead of clustered).
So, there is no big difference I guess?

One way to reduce page splitting is to use fill factor 80% for
example. And than rebuild index every night when there is no active
users.
But what is in most cases better in normal oltp system with 10% of
updates, inserts and 90% of reads? Having fill factor less than 100%,
which produces more pages to read every time but reduce page splits,
or heaving fill factor 100%, which produce minimal pages but maximal
page splitting?

For performance reasons still I think that in most cases for FK is
better to have identity.
Can you give me one example where identity is really bad idea for FK
for sake of performance? As you described, your reason was not of
performance nature but logical?

Thank you very much for your explains.
br, Simon
Erland Sommarskog
2012-02-29 14:57:39 UTC
Permalink
Post by simon
"Reduce page splits and improve insert-performance."
This is actually not big improvement. If you have identity as
clustered and PK as nonclustered, than page splitting occurs on
nonclustered index(instead of clustered).
So, there is no big difference I guess?
As they say, your mileage may vary. If the table is wide, the cost
of splitting the NC index is lower than splitting the data page.
If the table is narrow, the difference is less, I guess.
Post by simon
For performance reasons still I think that in most cases for FK is
better to have identity.
Can you give me one example where identity is really bad idea for FK
for sake of performance? As you described, your reason was not of
performance nature but logical?
The updates become cumbersome, and that would of course affect performance.
Again, your mileage may vary. In a data warehouse, they have surrogate
keys all over the place, because they read a lot more than they update.
--
Erland Sommarskog, SQL Server MVP, ***@sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
Continue reading on narkive:
Loading...