CPSC335 01 Hashing

Document Sample
CPSC335 01 Hashing Powered By Docstoc
					          CPSC 335

Information Structures II
      Computer Science
       University of Calgary

   Definition of Hashing
   Did you know that?
   Hash functions
   Collision Resolution
   Analysis of searching with Hash tables

          Introduction to Hashing

Approaches to Search

1. Sequential and list methods
   (lists, tables, arrays).

2. Direct access by key value (hashing)

3. Tree indexing methods.

            Introduction to Hashing

Hashing is the process of mapping a key value to a
  position in a table.

A hash   function maps key values to positions.

A hash   table is an array that holds the records.
    Searching in a hash table can be done in O(1) regardless of the
  hash table size.

Introduction to Hashing

Introduction to Hashing

 Example of Usefullness
  10 stock details, 10 table positions

  Stock numbers are between 0 and 1

  Using the whole stock numbers may
  require 1000 storage locations and
  this is an obvious waste of memory.

              Introduction of Hashing

Applications of Hashing
    Compilers use hash tables to keep track of declared variables

 A hash table can be used for on-line spelling checkers — if
 misspelling detection (rather than correction) is important, an entire
 dictionary can be hashed and words checked in constant time

 Game playing programs use hash tables to store seen positions,
  thereby saving computation time if the position is encountered

    Hash functions can be used to quickly check for inequality — if
    two elements hash to different values they must be different

    Storing sparse data
Did you know that?
   Cryptography was once known only to the key people in the the
    National Security Agency and a few academics.
   Until 1996, it was illegal to export strong cryptography from the
    United States.
   Fast forward to 2006, and the Payment Card Industry Data
    Security Standard (PCI DSS) requires merchants to encrypt
    cardholder information. Visa and MasterCard can levy fines of up
    to $500,000 for not complying!
   Among methods recommended are:
       Strong one-way hash functions (hashed indexes)
       Truncation
       Index tokens and pads (pads must be securely stored)
       Strong cryptography
    [Hashing for fun and profit: Demystifying encryption for PCI DSS
        Roger Nebel]

Did you know that?
    Transport Layer Security protocol on networks (TLS) uses the
     Rivest, Shamir, and Adleman (RSA) public key algorithm for the
     TLS key exchange and authentication, and only the Secure
     Hashing Algorithm 1 (SHA-1) for the key exchange and hashing.

    [System cryptography: Use FIPS compliant algorithms for
     encryption, hashing, and signing, Microsoft TechNews, 2005]

Did you know that?

   Spatial hashing studies performed at Microsoft Research Redmond
    combine hashing with computer graphics to create a new set of tools for
    rendering, mesh reconstruction, and collision optimization (see public
    poster by Hugues Hoppe on the next slide)

                                                                                                      Sylvain Lefebvre Hugues Hoppe
                              Perfect Spatial Hashing                                                       (Microsoft Research)

• We design a perfect hash function to losslessly pack sparse data while retaining efficient random access:

                                      Hash table        Offset table                                                         Hash table         Offset table
                     1282                     382           182                                       1283                        353                193

             Hash function                                                                       Applications
                                                                                                      2D     3D
• Simply: h( p)  p    p (modulo table sizes)

                                q  p mod r
              p                                Offset                                                                                                 24372    11632

                                              table 
                                    [q]                                                                                          83333

                                 s  h( p)
         Domain                                Hash                                                               3D textures                  3D painting
                                              table H                                                          10243, 46MB, 530fps          20483, 56MB, 200fps
                                                            Vector images               Sprite maps          nearest: 7.5MB, 370fps
                                                          10242, 500KB, 700fps          +900KB, 200fps

• Perfect hash on multidimensional data
  • No collisions  ideal for GPU
  • Single lookup into a small offset table
• Offsets only ~4 bits per defined data                                                   1.8%
• Access only ~4 instructions on GPU                                                                              Simulation               Collision detection
                                                                    Alpha compression
                                                                                                                  2563, 100fps              10243, 12MB, 140fps
• Optimized spatial coherence                                          0.9bits/pixel, 800fps
Did you know that?

    Combining hashing and encryption
     provides a much stronger tool for
     database and password protection.
    http://msdn.microsoft.com/msdnmag/is
    [Security Briefs, SMDN Magazine]

How can I store passwords in a
custom user database?
    There are several options. The simplest might leave you with
    cleartext passwords. The following example is XML:
           <users> <user name='Alice' password='7&y2si(V1dX'/>
        <user name='Bob' password='mary'/>
        <user name='Fred' password='mary'/> </users>
    After implementing something like this, you'll likely feel rather
        uncomfortable that all those passwords are sitting there in one
        file, in the clear. If you don't feel uncomfortable, you should!

    The first approach you might take to protect these passwords is to
        encrypt them. That's better than nothing, but it's not the best
        solution. In order to validate a user's password, you need the
        encryption key, which means it needs to be available on the
        machine where the passwords are processed.

   How can I store passwords in a
   custom user database?
A better solution that doesn't require any key at all is a one-way function!
   A cryptographic hash algorithm like SHA-1 or MD5 is a sophisticated one-
    way function that takes some input and produces a hash value as output,
    but more resistant to collisions.
   It's incredibly unlikely that you'd find two messages that hash to the same
    value! As a one-way function, it can't be reversed. There is no key that you
    need to store. You hash the password before storing it in the database:
         <users> <user name='Alice' password='D16E9B18FA038...'/>
         <user name='Bob' password='5665331B9B819...'/>
         <user name='Fred' password='5665331B9B819...'/> </users>
    Now when you receive the cleartext password and need to verify it, you
       don't decrypt the stored password for comparison. Instead, you
       hash the password provided by the user and compare the result
       with your stored hash.
    If an attacker manages to steal your password database, he won't be able to use
          the passwords, as they can't be reversed back into cleartext.

   But look closely at Bob and Fred's hashed passwords. If the attacker happened to be
    Fred, he now knows that Bob uses the same password he does. What luck! Even without
    this sort of luck, a bad guy can perform a dictionary attack against the hashed passwords
    to find matches.
   The usual way a dictionary attack is performed is to get a list of commonly used
    passwords, like the lists you'll find at ftp://coast.cs.purdue.edu/pub/dict/wordlists, and
    calculate the hash for each. Now the attacker can compare the hash values of his
    dictionary with those in the password database. Once he finds a match, he looks up the
    corresponding password.
   To slow down the attack, use salt. Salt is a way to season the passwords before hashing
    them, making the attacker's precomputed dictionary useless. Here's how it's done.
    Whenever you add an entry to the database, you calculate a random string of digits to be
    used as salt. When you want to calculate the hash of Alice's password, you look up the
    salt value for Alice's account, prepend it to the password, and hash them together. The
    resulting database looks like this:
         <users> <user name='Alice' salt='Tu72*&' password='6DB80AE7...'/>
         <user name='Bob' salt='N5sb#X' password='096B1085...'/>
         <user name='Fred' salt='q-V3bi' password='9118812E...'/> </users>
   Note that now there is no way to tell that Bob and Fred are using the same password.

Salt: example of usage

 Below is a C# example of using hash library
      [Keith Brown, Hashing Passwords, The AllowPartiallyTrustedCallers Attribute]:

 string password = Console.ReadLine();
 SaltedHash sh = SaltedHash.Create(password);
 // imagine storing the salt and hash in a database
 string salt = sh.Salt;
  string hash = sh.Hash;
  Console.WriteLine("Salt: {0}", salt);
  Console.WriteLine("Hash: {0}", hash);
 // after looking up salt and hash, verify a password
 SaltedHash ver = SaltedHash.Create(salt, hash);
 bool isValid = ver.Verify(password);

                   Hash Functions

Hash Functions
     Hashing is the process of chopping up the key and
    mixing it up in various ways in order to obtain an index
    which will be uniformly distributed over the range of
    indices -- hence the ‘hashing’.

There are several common ways of doing this:

 Truncation
 Folding
 Modular Arithmetic

                        Hash Functions

Hash Functions – Truncation
     Truncation is a method in which parts of the key are ignored and
    the remaining portion becomes the index.
    - For this, we take the given key and produce a hash location by
     taking portions of the key (truncating the key).

    Example – If a hash table can hold 1000 entries and an 8-digit
             number is used as key, the 3rd, 5th and 7th digits starting
             from the left of the key could be used to produce the
    - e.g. .. Key is 62538194 and the hash location is 589.
       Advantage: Simple and easy to implement.

       Problems: Clustering and repetition.
                       Hash Functions

Hash Functions – Folding
    Folding breaks the key into several parts and combines the parts to
    form an index.
        - The parts may be recombined by addition, subtraction, multiplications
    and may have to be truncated as well.
        - Such a process is usually better than truncation by itself since it
    produces a better distribution: all of the digits in the key are considered.
        - Using a key 62538194 and breaking it into 3 numbers using the first 3
    and the last 2 digits produced 625, 381 and 94. These could be added to get
    1100 which could be truncated to 100.
       They could be also be multiplied together and then three digits chosen
      from the middle of the number produced.

                          Hash Functions

Hash Functions – (Modular Arithmetic)
     Modular Arithmetic process essentially assures that the index
    produced is within a specified range. For this, the key is converted to
    an integer which is divided by the range of the index with the resulting
    function being the value of the remainder.

Uses: biometrics, encryption, compression
       - If the value of the modulus is a prime number, the distribution of
    obtained is quite uniform.
       - A table whose size is some number which has many factors provides the
    possibility of many indices which are the same, so the size should be a prime

                       Hash Functions

Good Hash Functions
    Hash functions which use all of the key are almost always better
    than those which use only some of the key.

    - When only portions are used, information is lost and therefore the
      number of possibilities for the final key are reduced.

    - If we deal with the integer its binary form, then the number of
      pieces that can be manipulated by the hash function is greatly

                     Collision Resolution

    It is obvious that no matter what function is used, the possibility
    exists that the use of the function will produce an index which is a
    duplicate of an index which already exists. This is a Collision.

    Collision resolution strategy:

    - Open addressing: store the key/entry in a different position

         - Chaining: chain together several keys/entries in each position

                    Collision Resolution

Collision - Example
- - Hash table size 11
- - Hash function: key mod hash size

So, the new positions in the hash table are:

Some collisions occur with this hash function.

                     Collision Resolution

Collision Resolution – Open Addressing
     Resolving collisions by open addressing is resolving the problem by
    taking the next open space as determined by rehashing the key
    according to some algorithm.

     Two main open addressing collision resolution techniques:
    - - Linear probing: increase by 1 each time [mod table size!]
    - - Quadratic probing: to the original position, add 1, 4, 9, 16,…
     also in some cases key-dependent increment technique is used.

    If the table position given by the hashed key is already
    occupied, increase the position by some amount, until an empty
    position is found
                         Collision Resolution
Collision Resolution – Open Addressing
Linear Probing
new position = (current position + 1) MOD hash size
Example –
Before linear probing:

After linear probing:

Problem – Clustering occurs, that is, the used spaces tend to appear in groups
which tends to grow and thus increase the search time to reach an open space.    25
                     Collision Resolution

Collision Resolution – Open Addressing
     In order to try to avoid clustering, a method which does not look for
    the first open space must be used.

 Two common methods are used –

    - - Quadratic Probing
    - - Key-dependent Increments

                      Collision Resolution
Collision Resolution – Open Addressing

Quadratic Probing
new position = (collision position + j2) MOD hash size
                                             { j = 1, 2, 3, 4, ……}
Example –
Before quadratic probing:

After quadratic probing:

Problem – Overflow may occurs when there is still space in the hash table.
              Collision Resolution
Collision Resolution – Open Addressing

Key-dependent Increments
   This technique is used to solve the overflow problem of the
    quadratic probing method.

 These increments vary according to the key used for the hash
      If the original hash function results in a good distribution, then key-
     dependent functions work quite well for rehashing and all locations in the
     table will eventually be probed for a free position.

 Key dependent increments are determined by using the key to
 calculate a new value and then using this as an increment to determine
 successive probes.

                        Collision Resolution
Collision Resolution – Open Addressing
Key-dependent Increments
For example, since the original hash function was key Mod 11, we might choose a function
    of key MOD 7 to find the increment. Thus the hash function becomes - -

new position = current position + ( key DIV 11) MOD 11
Example –
Before key-dependent increments:

After key-dependent increments:

              Collision Resolution
Collision Resolution – Open Addressing
Key-dependent Increments
     In all of the closed hash functions it is important to ensure that an
    of 0 does not arise.

    - - If the increment is equal to hash size the same position will be probed all
      the time, so this value cannot be used.

 If we ensure that the hash size is prime and the divisors for the open and
 closed hash are prime, the rehash function does not produce a 0
 increment, then this method will usually access all positions as does the linear

    - - Using a key-dependent method usually result reduces clustering and
       therefore searches for an empty position should not be as long as for the
       linear method.
                    Collision Resolution

Collision Resolution – Chaining
    Each table position is a linked list
    Add the keys and entries anywhere in the
   list (front easiest)
Advantages over open addressing:
       - Simpler insertion and removal
       - Array size is not a limitation (but
         should still minimize collisions: make
         table size roughly equal to expected
         number of keys and entries)
 - Memory overhead is large if entries are

                   Collision Resolution

Collision Resolution – Chaining
Before chaining:

After chaining:

          Analysis of Searching using Hash Tables
     In analyzing search efficiency, the average is usually used. Searching with
    hash tables is highly dependent on how full the table is since as the table
    approaches a full state, more rehashes are necessary. The proportion of the
    table which is full is called the Load   Factor.
    - - When collisions are resolved using open addressing, the maximum load
        factor is 1.
    - - Using chaining, however, the load factor can be greater than 1 when the
        table is full and the linked list attached to each hash address has more than
        one element.

 - Chaining consistently requires fewer probes than open addressing.
 - Traversal of the linked list is slow and if the records are small, it may be just
    as well to use open addressing.
 - Chaining is the best under two conditions --- when the number of
   unsuccessful searches is large or when the records are large.
 - Open addressing would likely be a reasonable choice when most searches are
   likely to be successful, the load factor is moderate and the records are
   relatively small.                                                                 33
        Analysis of Searching using Hash Tables
Average number of probes for different collision resolution
[ The values are for large hash tables, in this case larger than 430]

       Analysis of Searching using Hash Tables
When are other representations more suitable than hashing:

 Hash tables are very good if there is a need for many searches in a
 reasonably stable table

 Hash tables are not so good if there are many insertions and
  deletions, or if table traversals are needed — in this case, AVL trees
  are better

 If there are more data than available memory then use a B-tree

 Also, hashing is very slow for any operations which require the
  entries to be sorted e.g. Find the minimum key

              Some Links to Hashing Animation
Links for interactive hashing example:

   http://www.engin.umd.umich.edu/CIS/course.des/cis350/hashing/WEB/HashApplet.htm

   http://www.cs.auckland.ac.nz/software/AlgAnim/hash_tables.html

   http://www.cse.yorku.ca/~aaw/Hang/hash/Hash.html

   http://www.cs.pitt.edu/~kirk/cs1501/animations/Hashing.html


Shared By: