Xml shredding performance

Discussion:

(too old to reply)

Johnny Persson

2010-03-08 09:37:12 UTC

Hi,

we are having some performance issues regarding xml shredding.

At this point we are extracting data from xmls from nearly 60 different
companies - and therefore 60 different xml structures. The total amount
of xml is about 350MB and we are trying to extract the data as fast as
possible.

Our current system extracts, transforms and loads the data in about five
minutes. We would however like to do this in about one minute to be pleased.

We use the "nodes/cross apply"-technique to shred the xmls into our
internal format.

This is how we shred the data.
------------------------------

1) Load xml into a temporary table (#XmlTable)
2) Set an xml index
3) Query (like below)

INSERT INTO #TransformedData
SELECT
T0.T.value('asasd', 'asdadd')
T1.T.value('asasd', 'asdadd')
FROM
#XmlTable
CROSS APPLY
data.nodes('asd') AS T0(T)
T0.T.nodes('level1') AS T1(T)

DROP #XmlTable

4) Pass the temporary table #TransformedData into the common/shared
transformation procedure

EXEC LookupData

-------------------------------

This is very I/O intensive and it makes the system slow. Are there any
other good ways to parse the xmls in the sql server? Should we perhaps
move the shredding outside the SQL environment into, for instance, a C#
method which bulk loads the data?

Regards,
Johnny

Uri Dimant

2010-03-08 10:06:49 UTC

Permalink

Johny

http://sqlblog.com/blogs/adam_machanic/archive/2010/01/12/t-sql-tuesday-002-is-it-xml-or-not.aspx

http://sqlcat.com/msdnmirror/archive/2010/03/01/performance-tips-of-using-xml-data-in-sql-server.aspx

Post by Johnny Persson
Hi,
we are having some performance issues regarding xml shredding.
At this point we are extracting data from xmls from nearly 60 different
companies - and therefore 60 different xml structures. The total amount of
xml is about 350MB and we are trying to extract the data as fast as
possible.
Our current system extracts, transforms and loads the data in about five
minutes. We would however like to do this in about one minute to be pleased.
We use the "nodes/cross apply"-technique to shred the xmls into our
internal format.
This is how we shred the data.
------------------------------
1) Load xml into a temporary table (#XmlTable)
2) Set an xml index
3) Query (like below)
INSERT INTO #TransformedData
SELECT
T0.T.value('asasd', 'asdadd')
T1.T.value('asasd', 'asdadd')
FROM
#XmlTable
CROSS APPLY
data.nodes('asd') AS T0(T)
T0.T.nodes('level1') AS T1(T)
DROP #XmlTable
4) Pass the temporary table #TransformedData into the common/shared
transformation procedure
EXEC LookupData
-------------------------------
This is very I/O intensive and it makes the system slow. Are there any
other good ways to parse the xmls in the sql server? Should we perhaps
move the shredding outside the SQL environment into, for instance, a C#
method which bulk loads the data?
Regards,
Johnny

Johnny Persson

2010-03-08 14:03:26 UTC

Permalink

Thank you for your answer,

we use XML as column data type so the first article does not seem to
affect us. It is however interesting how the query optimizer work - or
not work :)

The second article had a really interesting part about typed/untyped
xmls. We have had no xml schema for any xml so I tried that..

I created an xml schema for an xml file (~5MB) and compared the
performance between a stored procedure which use the xml schema and a
stored procedure which does not.

The result is to me a bit strange. When we use a "typed xml column" the
execution time is ~15s and the subtree cost is 428. When we use the
normal, untyped xml column, the execution time is the same BUT the
subtree cost is however 866!

Do you have any thoughts about the result and why the execution time
isn't affected?

Regards,
Johnny

Post by Uri Dimant
Johny
http://sqlblog.com/blogs/adam_machanic/archive/2010/01/12/t-sql-tuesday-002-is-it-xml-or-not.aspx
http://sqlcat.com/msdnmirror/archive/2010/03/01/performance-tips-of-using-xml-data-in-sql-server.aspx

Uri Dimant

2010-03-08 14:26:05 UTC

Permalink

Johnny
Unfortunately I have very limited knowledge about using XML in SQL
Server..., if Erland jumps in .....

Post by Johnny Persson
Thank you for your answer,
we use XML as column data type so the first article does not seem to
affect us. It is however interesting how the query optimizer work - or not
work :)
The second article had a really interesting part about typed/untyped xmls.
We have had no xml schema for any xml so I tried that..
I created an xml schema for an xml file (~5MB) and compared the
performance between a stored procedure which use the xml schema and a
stored procedure which does not.
The result is to me a bit strange. When we use a "typed xml column" the
execution time is ~15s and the subtree cost is 428. When we use the
normal, untyped xml column, the execution time is the same BUT the subtree
cost is however 866!
Do you have any thoughts about the result and why the execution time isn't
affected?
Regards,
Johnny

Erland Sommarskog

2010-03-08 23:05:40 UTC

Permalink

Post by Johnny Persson
The result is to me a bit strange. When we use a "typed xml column" the
execution time is ~15s and the subtree cost is 428. When we use the
normal, untyped xml column, the execution time is the same BUT the
subtree cost is however 866!

The subtree cost is just an estimate, so take it for what it's worth.

Schemabound XML may have some advantages, if you extract a lot from the
same document, but it can also cut the other way, as schema vaildation
is quite expensive.

As I said in another post, you should look at how you address the nodes.
There is a lot to win in that area.

--
Erland Sommarskog, SQL Server MVP, ***@sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
SQL 2000: http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

Johnny Persson

2010-03-09 07:03:50 UTC

Permalink

Hi Erland,

Yes, below is a real example of one of our parsers. As far as I
understand the node addressing is correct.. Or am I wrong? It has
happened before :)

SELECT
-- /A/B
TG.T.value('@name', 'nvarchar(255)'),
TG.T.value('(ElementX)[1]', 'nvarchar(255)'),
TG.T.value('(ElementY)[1]', 'nvarchar(255)'),

-- /A/B/C
TOS.T.value('@type', 'nvarchar(255)'),
TOS.T.value('@name', 'nvarchar(255)'),

-- /A/B/C/D
TOO.T.value('@name', 'nvarchar(255)'),
TOO.T.value('@id', 'nvarchar(255)')
FROM
#XmlTable
CROSS APPLY Data.nodes('/A/B') AS TG(T)
CROSS APPLY TG.T.nodes('C') AS TOS(T)
CROSS APPLY TOS.T.nodes('D') AS TOO(T)

Regards,
Johnny

Post by Erland Sommarskog

The subtree cost is just an estimate, so take it for what it's worth.
Schemabound XML may have some advantages, if you extract a lot from the
same document, but it can also cut the other way, as schema vaildation
is quite expensive.
As I said in another post, you should look at how you address the nodes.
There is a lot to win in that area.

Erland Sommarskog

2010-03-09 08:13:51 UTC

Permalink

Post by Johnny Persson
Yes, below is a real example of one of our parsers. As far as I
understand the node addressing is correct.. Or am I wrong? It has
happened before :)

"Correct" and "best" are two different things.

Post by Johnny Persson
CROSS APPLY TG.T.nodes('C') AS TOS(T)
CROSS APPLY TOS.T.nodes('D') AS TOO(T)

Here you should have /C and /D for best performance.

If you only specify C, XPath will return any node named C, not just top
nodes. And finding those takes more time.

--
Erland Sommarskog, SQL Server MVP, ***@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

Stefan Hoffmann

2010-03-08 10:08:41 UTC

Permalink

hi Johnny,

Post by Johnny Persson
This is very I/O intensive and it makes the system slow. Are there any
other good ways to parse the xmls in the sql server? Should we perhaps
move the shredding outside the SQL environment into, for instance, a C#
method which bulk loads the data?

You may take a look at 'Performing Bulk Load of XML Data (SQLXML 4.0)':

http://technet.microsoft.com/en-us/library/ms171993.aspx

Depending on your files, I would at least set some simple C# samples to
compare performance. You may consider implementing it as CLR stored
procedure if it performs better.

http://technet.microsoft.com/en-us/library/ms131094%28SQL.90%29.aspx

mfG
--> stefan <--

Erland Sommarskog

2010-03-08 11:34:54 UTC

Permalink

Post by Johnny Persson
1) Load xml into a temporary table (#XmlTable)
2) Set an xml index
3) Query (like below)
INSERT INTO #TransformedData
SELECT
T0.T.value('asasd', 'asdadd')
T1.T.value('asasd', 'asdadd')
FROM
#XmlTable
CROSS APPLY
data.nodes('asd') AS T0(T)
T0.T.nodes('level1') AS T1(T)

Assuming these are top-level nodes, change this to

data.nodes('/asd') AS T0(T)
T0.T.nodes('/level1') AS T1(T)

There can be huge performance gain by using correct node addressing.
(Unfortunately, the syntax for the best addressing form can be quite
convuluted in some cases.)

Michael Coles

2010-03-09 02:31:27 UTC

Permalink

Post by Johnny Persson
INSERT INTO #TransformedData
SELECT
T0.T.value('asasd', 'asdadd')
T1.T.value('asasd', 'asdadd')
FROM
#XmlTable
CROSS APPLY
data.nodes('asd') AS T0(T)
T0.T.nodes('level1') AS T1(T)

Just adding to what Erland said, if your XML data looks like this:

<level1>
<asd>...</asd>
<asd>...</asd>
</level1>

Consider using the '//asd' path. Also consider putting a numeric predicate
on your .value() function calls like this 'asasd[1]'. It would help to have
some representational XML data if you can supply some (and some expected
results). The XML schema can give the optimizer some hints that can help
performance, but you have to weigh the benefit against the extra cost of
validating your data against it. Another option is to perform the shredding
outside of SQL Server in your application. If you don't want to bother with
SQLXML Bulk Loading you can use .NET to shred the data and bulk load it.

--
Thanks

Michael Coles
SQL Server MVP
Author, "Expert SQL Server 2008 Encryption"
(http://www.apress.com/book/view/1430224649)
----------------

Johnny Persson

2010-03-09 07:07:55 UTC

Permalink

Hi,

I have now tried to shred the data in a C# console project. The
performance gain is huge.

Which is the best option to pass the data to the SQL server? Bulk load
into a table variable or "send the data" through a table valued CLR method?

Regards,
Johnny

Post by Michael Coles

<level1>
<asd>...</asd>
<asd>...</asd>
</level1>
Consider using the '//asd' path. Also consider putting a numeric
predicate on your .value() function calls like this 'asasd[1]'. It would
help to have some representational XML data if you can supply some (and
some expected results). The XML schema can give the optimizer some hints
that can help performance, but you have to weigh the benefit against the
extra cost of validating your data against it. Another option is to
perform the shredding outside of SQL Server in your application. If you
don't want to bother with SQLXML Bulk Loading you can use .NET to shred
the data and bulk load it.

Erland Sommarskog

2010-03-09 08:15:27 UTC

Permalink

Post by Johnny Persson
I have now tried to shred the data in a C# console project. The
performance gain is huge.
Which is the best option to pass the data to the SQL server? Bulk load
into a table variable or "send the data" through a table valued CLR method?

SqlBulkCopy or table-valued parameters if you are on SQL 2008.

Johnny Persson

2010-03-09 08:26:27 UTC

Permalink

Ok,

thank you for all your answers. I will parse the xmls inside a
C#-project and pass the data through a table-valed parameter.

Thank you all, once again.

Regards
Johnny

Post by Erland Sommarskog

SqlBulkCopy or table-valued parameters if you are on SQL 2008.

Continue reading on narkive:

Search results for 'Xml shredding performance' (Questions and Answers)

replies

What is the difference bn DB2 and Oracle?

started 2006-06-25 22:56:37 UTC

programming & design