In addition to this the result can Σ KEYSUM incorporate the length of the original string which improves the uniqueness © 1997 Data Management & Warehousing of lower order algorithm results considerably. INTRODUCTION Keysum is a new and interesting HOW DOES THIS HELP? technique (not a product) in the The table key is now an integer, the generation of keys within a database. It optimal format on which to index. The has particular application within Data user now calls a function to convert the Warehouses where keys are often required string into the checksum and made up of de-normalised uses the index to look up the alphanumeric data. appropriate row. On very large tables this is considerably faster than THE PROBLEMS conventional string look-up. Data that has been de-normalised often has a primary key that is made up of a Furthermore the data can be validated, single string, a series of concatenated as, if the current checksum differs from strings, or other data types that can be the stored checksum then the data has converted to strings. The key is changed. This also works when re- traditionally costly in terms of storage loading data, as any existing data will requirements and access speed when still be able to reference the old key. It used in an index. It is, however, vital to should also be noted that when a field the usability of the data. within the key is altered the key also needs to be re-generated. The second issue is that in a data warehousing environment data may be If this technique is used in contexts loaded and assigned an arbitrary such as trend analysis within a Data unique number as a key. If the data Warehouse it is also possible that the needs to be re-loaded at a later date, occasional mis-match because of a possibly with additions, then it is duplicate checksum will not be impossible to guarantee that the same statistically significant and therefore arbitrary key will be assigned to the the key can be considered unique. same row. WHAT ARE THE ISSUES? THE SOLUTION No checksum is guarantied to be The solution is simplicity itself. The unique. It is therefore possible that two generated key of the row should be the different records can return the same checksum of the string that makes up value. If the length is included in the the unique key. This will, depending checksum it is still not guarantied but it on the checksum algorithm chosen, further reduces the risk. When generate a large integer that will be choosing a checksum algorithm it is nearly unique within the scope of the important to consider the amount of data. For example using the industry records for which the checksum will standard CRC32 algorithm will provide a key. If you have a table with generate a number in the range 0 to 500,000 rows (such as a table that 4294967296, whilst using the Message contains addresses) then CRC32 will Digest algorithm MD5 will generate a have an 8500:1 chance of duplicates number between 0 and 3.4 * 1038. without considering the length of the original string. MD5 on the other has the remote An optional parameter was included to 6.8*1032:1 chance of generating a allow the use of different algorithms in duplicate checksum. This is because it different contexts. For example where uses 128 bits rather than CRC32 which only a small data set needs a checksum uses only 32 bits. key then CRC32 may be suitable, whilst MD5 is used only for the largest When implementing the algorithm it is data sets. important to note that checksums normally return unsigned integers as WHERE DO I GET A CHECKSUM their result. Your database and routines ALGORITHM? that access the checksum must all be The inevitably answer to this question able to handle the size of the result and is ‘From the Internet’. Any site that ensure that they deal with the issue of distributes the source for FreeBSD signed versus unsigned variables. includes an implementation of CRC32. MD5 is also widely available. IS THIS FEATURE AVAILABLE NOW? There is no direct implementation of a THE FUTURE DIRECTION checksum within the SQL Dialects of The author hope that in the future that the major vendors currently available, Database vendors such as Oracle will however it can be implemented via an add the checksum function to their external procedure call. SQL dialects. Once available as a in- built function the need to implement The author has implemented this checksums via external procedure calls technique within an Oracle7™ will disappear and performance will be database. A daemon was created that improved even more. It will also allow took as its input the string and returned some standardisation is the choice and two values, the checksum and the handling of the checksum algorithms. length. This was connected to the database via a ‘Database Pipe’. When a Data Management & Warehousing is checksum was required a PL/SQL the trading name of David M Walker, a stored procedure was called that placed freelance Data Warehousing consultant. the string into the database pipe and Address: 138, Finchampstead Road, Wokingham, Berkshire, received the two values, the checksum RG41 2NU, United Kingdom. and the length, back. WWW: http://www.datamgmt.com Telephone: +44 (0) 7050 028 911 The daemon was also implemented as Fax: +44 (0) 7050 028 912 a shared library so that it could be Copyright © 1997 All rights reserved. All Copyrights and Trademarks respected accessed from the command line and MD5 Copyright © 1991-2, RSA Data Security, Inc. from other utilities that could call a Oracle7™ is a trademark of Oracle Corporation shared ‘C’ library. WHAT IS THE MD5 MESSAGE-DIGEST ALGORITHM? MD5 is a message-digest algorithm. The algorithm takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given pre-specified target message digest. The MD5 algorithm is designed to be quite fast on 32-bit machines. In addition, the MD5 algorithm does not require any large substitution tables; the algorithm can be coded quite compactly. Copyright (C) 1991-2, RSA Data Security, Inc. Created 1991. All rights reserved.