KeySum - Using Checksum Keys

Shared by: DavidWalker18
-
Stats
views:
23
posted:
4/24/2012
language:
English
pages:
2
Document Sample
scope of work template
							                                             In addition to this the result can
Σ KEYSUM                                    incorporate the length of the original
                                             string which improves the uniqueness
© 1997 Data Management & Warehousing         of lower order algorithm results
                                             considerably.
INTRODUCTION
Keysum is a new and interesting              HOW DOES THIS HELP?
technique (not a product) in the             The table key is now an integer, the
generation of keys within a database. It     optimal format on which to index. The
has particular application within Data       user now calls a function to convert the
Warehouses where keys are often              required string into the checksum and
made      up     of      de-normalised       uses the index to look up the
alphanumeric data.                           appropriate row. On very large tables
                                             this is considerably faster than
THE PROBLEMS                                 conventional string look-up.
Data that has been de-normalised often
has a primary key that is made up of a       Furthermore the data can be validated,
single string, a series of concatenated      as, if the current checksum differs from
strings, or other data types that can be     the stored checksum then the data has
converted to strings. The key is             changed. This also works when re-
traditionally costly in terms of storage     loading data, as any existing data will
requirements and access speed when           still be able to reference the old key. It
used in an index. It is, however, vital to   should also be noted that when a field
the usability of the data.                   within the key is altered the key also
                                             needs to be re-generated.
The second issue is that in a data
warehousing environment data may be          If this technique is used in contexts
loaded and assigned an arbitrary             such as trend analysis within a Data
unique number as a key. If the data          Warehouse it is also possible that the
needs to be re-loaded at a later date,       occasional mis-match because of a
possibly with additions, then it is          duplicate checksum will not be
impossible to guarantee that the same        statistically significant and therefore
arbitrary key will be assigned to the        the key can be considered unique.
same row.
                                             WHAT ARE THE ISSUES?
THE SOLUTION                                 No checksum is guarantied to be
The solution is simplicity itself. The       unique. It is therefore possible that two
generated key of the row should be the       different records can return the same
checksum of the string that makes up         value. If the length is included in the
the unique key. This will, depending         checksum it is still not guarantied but it
on the checksum algorithm chosen,            further reduces the risk. When
generate a large integer that will be        choosing a checksum algorithm it is
nearly unique within the scope of the        important to consider the amount of
data. For example using the industry         records for which the checksum will
standard CRC32 algorithm will                provide a key. If you have a table with
generate a number in the range 0 to          500,000 rows (such as a table that
4294967296, whilst using the Message         contains addresses) then CRC32 will
Digest algorithm MD5 will generate a         have an 8500:1 chance of duplicates
number between 0 and 3.4 * 1038.             without considering the length of the
                                             original string.
MD5 on the other has the remote                      An optional parameter was included to
6.8*1032:1 chance of generating a                    allow the use of different algorithms in
duplicate checksum. This is because it               different contexts. For example where
uses 128 bits rather than CRC32 which                only a small data set needs a checksum
uses only 32 bits.                                   key then CRC32 may be suitable,
                                                     whilst MD5 is used only for the largest
When implementing the algorithm it is                data sets.
important to note that checksums
normally return unsigned integers as                 WHERE   DO          I   GET      A   CHECKSUM
their result. Your database and routines             ALGORITHM?
that access the checksum must all be                 The inevitably answer to this question
able to handle the size of the result and            is ‘From the Internet’. Any site that
ensure that they deal with the issue of              distributes the source for FreeBSD
signed versus unsigned variables.                    includes an implementation of CRC32.
                                                     MD5 is also widely available.
IS THIS FEATURE AVAILABLE NOW?
There is no direct implementation of a               THE FUTURE DIRECTION
checksum within the SQL Dialects of                  The author hope that in the future that
the major vendors currently available,               Database vendors such as Oracle will
however it can be implemented via an                 add the checksum function to their
external procedure call.                             SQL dialects. Once available as a in-
                                                     built function the need to implement
The author has implemented this                      checksums via external procedure calls
technique within an Oracle7™                         will disappear and performance will be
database. A daemon was created that                  improved even more. It will also allow
took as its input the string and returned            some standardisation is the choice and
two values, the checksum and the                     handling of the checksum algorithms.
length. This was connected to the
database via a ‘Database Pipe’. When a               Data Management & Warehousing is
checksum was required a PL/SQL                       the trading name of David M Walker, a
stored procedure was called that placed              freelance Data Warehousing consultant.
the string into the database pipe and                Address:   138, Finchampstead Road,
                                                                Wokingham, Berkshire,
received the two values, the checksum                           RG41 2NU, United Kingdom.
and the length, back.                                WWW:       http://www.datamgmt.com
                                                     Telephone: +44 (0) 7050 028 911
The daemon was also implemented as                   Fax:       +44 (0) 7050 028 912
a shared library so that it could be                 Copyright © 1997 All rights reserved.
                                                     All Copyrights and Trademarks respected
accessed from the command line and                   MD5 Copyright © 1991-2, RSA Data Security, Inc.
from other utilities that could call a               Oracle7™ is a trademark of Oracle Corporation
shared ‘C’ library.
WHAT IS THE MD5 MESSAGE-DIGEST ALGORITHM?

MD5 is a message-digest algorithm. The algorithm takes as input a message of
arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of
the input. It is conjectured that it is computationally infeasible to produce two
messages having the same message digest, or to produce any message having a given
pre-specified target message digest.

The MD5 algorithm is designed to be quite fast on 32-bit machines. In addition, the
MD5 algorithm does not require any large substitution tables; the algorithm can be
coded quite compactly.
                         Copyright (C) 1991-2, RSA Data Security, Inc. Created 1991. All rights reserved.

						
Related docs
Other docs by DavidWalker18