Detect Changes Using Checksum

Discussion:

(too old to reply)

Dave

2006-08-30 14:36:02 UTC

Hi,

I am working on a project which integrates diverse and often unreliable data
sources. I need to check for changes in source data on a daily basis. I
import data into MSSQL 2000 and using checksum functionality I compare
yesterday's snapshot of the data with today's. I run a SProc which allocates
a checksum value to each record before the comparison. I can therefore
determine which records have changed and further process them.

But is checksum sophisticated/robust enough to rely on in a production
environment. Testing has not unearthed any problems so far but I was
wondering is my approach fundamentally sound or have I missed something?

Thanks,

Dave.

2006-08-30 14:59:02 UTC

Permalink

Why use checksums if you can compare the values directly?

Different values could produce the same checksum.

ML

---
http://milambda.blogspot.com/

Steve Kass

2006-08-30 15:03:57 UTC

Permalink

Dave,

Checksum will not detect all changes. It's not a question of
robustness, but mathematical impossibility. There are only
4 billion different possible checksum values, and far more than
4 billion different possible contents of a single row for a typical table.

Without comparing all the data, you are taking a chance - not a big
one perhaps, but a chance nonetheless - that you will get the same
checksum for different data.

See http://groups.google.com/groups/search?q=gabcdefe for an
example of when this happens.

Steve Kass
Drew University
www.stevekass.com

Post by Dave
Hi,
I am working on a project which integrates diverse and often unreliable data
sources. I need to check for changes in source data on a daily basis. I
import data into MSSQL 2000 and using checksum functionality I compare
yesterday's snapshot of the data with today's. I run a SProc which allocates
a checksum value to each record before the comparison. I can therefore
determine which records have changed and further process them.
But is checksum sophisticated/robust enough to rely on in a production
environment. Testing has not unearthed any problems so far but I was
wondering is my approach fundamentally sound or have I missed something?
Thanks,
Dave.

Dave

2006-08-30 15:22:02 UTC

Permalink

The reason I was using checksum (binary_checksum) was for speed and
simplicity (I Imagined). It was hoped to run the process considerably more
frequently than nightly. The risk of error is small but too big. I will
reconfigure my SProcs and perform proper data comparison.

Thanks ML and Steve.

Dave.

Alexander Kuznetsov

2006-08-30 17:46:30 UTC

Permalink

Post by Dave
frequently than nightly. The risk of error is small but too big. I will
reconfigure my SProcs and perform proper data comparison.

Dave,

instead of compatring data column by column, as follows

c1.col1 <> c2.col1 OR (c1.col1 IS NULL AND c2.col1 IS NOT NULL) OR
(c1.col1 IS NOT NULL AND c2.col1 IS NULL)

use

SELECT COUNT(*) FROM(
SELECT col1, col2, ... FROM YesterdaysVersion
UNION
SELECT col1, col2, ... FROM TodaysVersion) t

If it's 1 , versions are identical

Jim Underwood

2006-08-30 19:35:55 UTC

Permalink

Alex,
If both YesterdaysVersion and TodaysVersion contain 100 identical rows,
won't the count(*) return 100, rahter than 1?

Did you mean to do an except?

Post by Alexander Kuznetsov

Post by Dave
frequently than nightly. The risk of error is small but too big. I will
reconfigure my SProcs and perform proper data comparison.

Dave,
instead of compatring data column by column, as follows
c1.col1 <> c2.col1 OR (c1.col1 IS NULL AND c2.col1 IS NOT NULL) OR
(c1.col1 IS NOT NULL AND c2.col1 IS NULL)
use
SELECT COUNT(*) FROM(
SELECT col1, col2, ... FROM YesterdaysVersion
UNION
SELECT col1, col2, ... FROM TodaysVersion) t
If it's 1 , versions are identical

Hugo Kornelis

2006-08-30 21:27:02 UTC

Permalink

Post by Dave
The reason I was using checksum (binary_checksum) was for speed and
simplicity (I Imagined). It was hoped to run the process considerably more
frequently than nightly. The risk of error is small but too big. I will
reconfigure my SProcs and perform proper data comparison.
Thanks ML and Steve.
Dave.

Hi Dave,

In addition to Alexander's suggestion, here's yet another way to compare
two tables:

SELECT col1, col2, ..., colN, MIN(version) AS version
FROM (SELECT col1, col2, ..., colN, 'Yesterday' AS version
FROM YesterdaysVersion
UNION ALL
SELECT col1, col2, ..., colN, 'Today' AS version
FROM YesterdaysVersion)
GROUP BY col1, col2, ..., colN
WHERE COUNT(*) = 1

--
Hugo Kornelis, SQL Server MVP

Uri Dimant

2006-08-31 07:20:36 UTC

Permalink

Hugo
Perhaps I missunderstood the OP's request , but the below example shows that
both tables have an idetrnical data.

SELECT COUNT(*) FROM(
SELECT c FROM #t
UNION
SELECT c FROM #t1) t

create table #t (c int)
insert into #t values (50)
insert into #t values (100)
-----------
create table #t1(c int)
insert into #t1 values (50)
insert into #t1 values (100)

Post by Hugo Kornelis

Hi Dave,
In addition to Alexander's suggestion, here's yet another way to compare
SELECT col1, col2, ..., colN, MIN(version) AS version
FROM (SELECT col1, col2, ..., colN, 'Yesterday' AS version
FROM YesterdaysVersion
UNION ALL
SELECT col1, col2, ..., colN, 'Today' AS version
FROM YesterdaysVersion)
GROUP BY col1, col2, ..., colN
WHERE COUNT(*) = 1
--
Hugo Kornelis, SQL Server MVP

Hugo Kornelis

2006-09-01 21:51:40 UTC

Permalink

Post by Uri Dimant
Hugo
Perhaps I missunderstood the OP's request , but the below example shows that
both tables have an idetrnical data.
SELECT COUNT(*) FROM(
SELECT c FROM #t
UNION
SELECT c FROM #t1) t
create table #t (c int)
insert into #t values (50)
insert into #t values (100)
-----------
create table #t1(c int)
insert into #t1 values (50)
insert into #t1 values (100)

Hi Uri,

If I remove one of the rows from one of the tables, the result of the
query doesn't change even though the tables are no longer identical. If
I then remove the "other" row from the other table, the output is STILL
the same.

Your query can be used to find if the data is identical, but only in
conjunction with two extra queries, for the COUNT(*) of each of the two
tables individually. And once those have run, you know if the tables are
identical or not, but you don't know what the differences are.

My query shows the differences immediately, and is probably faster too
(one table scan for each table vs two scans for your query).

--
Hugo Kornelis, SQL Server MVP

Arnie Rowland

2006-08-30 17:45:31 UTC

Permalink

Building upon the previous remarks, and your concern for speed and
efficiency, you may be able to compare the indexed columns first, and then
only if there is a match, compare the non-indexed columns.
--
Arnie Rowland, Ph.D.
Westwood Consulting, Inc

Most good judgment comes from experience.
Most experience comes from bad judgment.
- Anonymous