KeySum - Using Checksum Keys
Document Sample


In addition to this the result can
Σ KEYSUM incorporate the length of the original
string which improves the uniqueness
© 1997 Data Management & Warehousing of lower order algorithm results
considerably.
INTRODUCTION
Keysum is a new and interesting HOW DOES THIS HELP?
technique (not a product) in the The table key is now an integer, the
generation of keys within a database. It optimal format on which to index. The
has particular application within Data user now calls a function to convert the
Warehouses where keys are often required string into the checksum and
made up of de-normalised uses the index to look up the
alphanumeric data. appropriate row. On very large tables
this is considerably faster than
THE PROBLEMS conventional string look-up.
Data that has been de-normalised often
has a primary key that is made up of a Furthermore the data can be validated,
single string, a series of concatenated as, if the current checksum differs from
strings, or other data types that can be the stored checksum then the data has
converted to strings. The key is changed. This also works when re-
traditionally costly in terms of storage loading data, as any existing data will
requirements and access speed when still be able to reference the old key. It
used in an index. It is, however, vital to should also be noted that when a field
the usability of the data. within the key is altered the key also
needs to be re-generated.
The second issue is that in a data
warehousing environment data may be If this technique is used in contexts
loaded and assigned an arbitrary such as trend analysis within a Data
unique number as a key. If the data Warehouse it is also possible that the
needs to be re-loaded at a later date, occasional mis-match because of a
possibly with additions, then it is duplicate checksum will not be
impossible to guarantee that the same statistically significant and therefore
arbitrary key will be assigned to the the key can be considered unique.
same row.
WHAT ARE THE ISSUES?
THE SOLUTION No checksum is guarantied to be
The solution is simplicity itself. The unique. It is therefore possible that two
generated key of the row should be the different records can return the same
checksum of the string that makes up value. If the length is included in the
the unique key. This will, depending checksum it is still not guarantied but it
on the checksum algorithm chosen, further reduces the risk. When
generate a large integer that will be choosing a checksum algorithm it is
nearly unique within the scope of the important to consider the amount of
data. For example using the industry records for which the checksum will
standard CRC32 algorithm will provide a key. If you have a table with
generate a number in the range 0 to 500,000 rows (such as a table that
4294967296, whilst using the Message contains addresses) then CRC32 will
Digest algorithm MD5 will generate a have an 8500:1 chance of duplicates
number between 0 and 3.4 * 1038. without considering the length of the
original string.
MD5 on the other has the remote An optional parameter was included to
6.8*1032:1 chance of generating a allow the use of different algorithms in
duplicate checksum. This is because it different contexts. For example where
uses 128 bits rather than CRC32 which only a small data set needs a checksum
uses only 32 bits. key then CRC32 may be suitable,
whilst MD5 is used only for the largest
When implementing the algorithm it is data sets.
important to note that checksums
normally return unsigned integers as WHERE DO I GET A CHECKSUM
their result. Your database and routines ALGORITHM?
that access the checksum must all be The inevitably answer to this question
able to handle the size of the result and is ‘From the Internet’. Any site that
ensure that they deal with the issue of distributes the source for FreeBSD
signed versus unsigned variables. includes an implementation of CRC32.
MD5 is also widely available.
IS THIS FEATURE AVAILABLE NOW?
There is no direct implementation of a THE FUTURE DIRECTION
checksum within the SQL Dialects of The author hope that in the future that
the major vendors currently available, Database vendors such as Oracle will
however it can be implemented via an add the checksum function to their
external procedure call. SQL dialects. Once available as a in-
built function the need to implement
The author has implemented this checksums via external procedure calls
technique within an Oracle7™ will disappear and performance will be
database. A daemon was created that improved even more. It will also allow
took as its input the string and returned some standardisation is the choice and
two values, the checksum and the handling of the checksum algorithms.
length. This was connected to the
database via a ‘Database Pipe’. When a Data Management & Warehousing is
checksum was required a PL/SQL the trading name of David M Walker, a
stored procedure was called that placed freelance Data Warehousing consultant.
the string into the database pipe and Address: 138, Finchampstead Road,
Wokingham, Berkshire,
received the two values, the checksum RG41 2NU, United Kingdom.
and the length, back. WWW: http://www.datamgmt.com
Telephone: +44 (0) 7050 028 911
The daemon was also implemented as Fax: +44 (0) 7050 028 912
a shared library so that it could be Copyright © 1997 All rights reserved.
All Copyrights and Trademarks respected
accessed from the command line and MD5 Copyright © 1991-2, RSA Data Security, Inc.
from other utilities that could call a Oracle7™ is a trademark of Oracle Corporation
shared ‘C’ library.
WHAT IS THE MD5 MESSAGE-DIGEST ALGORITHM?
MD5 is a message-digest algorithm. The algorithm takes as input a message of
arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of
the input. It is conjectured that it is computationally infeasible to produce two
messages having the same message digest, or to produce any message having a given
pre-specified target message digest.
The MD5 algorithm is designed to be quite fast on 32-bit machines. In addition, the
MD5 algorithm does not require any large substitution tables; the algorithm can be
coded quite compactly.
Copyright (C) 1991-2, RSA Data Security, Inc. Created 1991. All rights reserved.
Related docs
Other docs by DavidWalker18
OpenWorld04 - Information Delivery - The Change In Data Management At Network Rail - Presentation
Views: 21 | Downloads: 0
Get documents about "