Docstoc

Hashing Hashing Dr Ying

Document Sample
Hashing Hashing Dr Ying Powered By Docstoc
					Hashing

  Dr. Ying Lu
ylu@cse.unl.edu
Giving credit where credit is due:
• Most of the slides are based on the lecture
  note by Dr. David Matuszek, University of
  Pennsylvania


• I have modified many of the slides and
  added new slides.




                                                2
             Example Problem
• Assume that you are searching for a book in a
  library catalog
   – You know the book’s ISBN
   – You want to get the book’s record and find out where
     the book is located in the library


• Which searching algorithm will you use?




                                                            3
                       Searching
• Searching an array for a given key
   – If the array is not sorted, the search requires O(n) time
      • If the key isn’t there, we need to search all n elements
      • If the key is there, we search n/2 elements on average
   – If the array is sorted, we can do a binary search
      • A binary search requires O(log n) time
      • About equally fast whether the element is found or not
   – It doesn’t seem like we could do much better
      • How about an O(1), that is, constant time search?
      • We can do it if the array is organized in a particular way



                                                                     4
        Why searching by key?
• We put key/value pairs        ...     key       value
  into the table                141
   – We use a key to find a     142    robin     robin info
     place in the table
                                143 sparrow     sparrow info
   – The value holds the
     information we are         144   hawk       hawk info
     actually interested in     145   seagull   seagull info
   – Book records database
                                146
     (ISBN/book-info pairs)
   – Another example:           147   bluejay   bluejay info
     student records database   148    owl        owl info
     (ID/record pairs)

                                                             5
                         Hashing
• Suppose we were to come up with a ―magic
  function‖ that, given a key to search for, would
  tell us exactly where in the array to look
   – If it’s in that location, it’s in the array
   – If it’s not in that location, it’s not in the array


• This function is called a hash function




                                                           6
    Example (ideal) hash function
• Suppose our hash function        0      kiwi
  gave us the following outputs:   1
     hashCode("apple") = 5         2     banana
     hashCode("watermelon") = 3
     hashCode("grapes") = 8
                                   3   watermelon
     hashCode("cantaloupe") = 7    4
     hashCode("kiwi") = 0
     hashCode("strawberry") = 9
                                   5      apple
     hashCode("mango") = 6         6     mango
     hashCode("banana") = 2
                                   7   cantaloupe
                                   8     grapes
                                   9   strawberry
                                                    7
       Finding the hash function
• How can we come up with this magic function?
• In general, we cannot--there is no such magic
  function 
   – In a few specific cases, where all the possible keys are
     known in advance, it has been possible to compute a
     perfect hash function
• What is the next best thing?
   – A perfect hash function would tell us exactly where to
     look
   – In general, the best we can do is a function that tells us
     where to start looking!


                                                                  8
Example imperfect hash function
• Suppose our hash function   0      kiwi
  gave us the following       1
  outputs:                    2     banana
  – hash("apple") = 5         3   watermelon
    hash("watermelon") = 3
    hash("grapes") = 8        4
    hash("cantaloupe") = 7    5      apple
    hash("kiwi") = 0
    hash("strawberry") = 9    6     mango
    hash("mango") = 6         7   cantaloupe
    hash("banana") = 2
    hash("honeydew") = 6      8     grapes
• Now what?
                              9   strawberry
                                               9
                   Collisions
• When two keys hash to the same array location,
  this is called a collision
• Collisions are normally treated as ―first come, first
  served‖—the first key that hashes to the location
  gets it
• We have to find something to do with the second
  and subsequent keys that hash to this same
  location



                                                      10
            Handling collisions
• What can we do when two different keys attempt to
  occupy the same place in an array?
   – Solution #1 (closed hashing): Search from there for an
     empty location
      • Can stop searching when we find the key or an empty location
      • Search must be end-around
   – Solution #2 (open hashing): Use the array location as the
     header of a linked list of keys that hash to this location
• All these solutions work, provided:
   – We use the same technique to add things to the array as
     we use to search for things in the array

                                                                  11
                   Insertion, I
• Suppose you want to add         ...
  seagull to this hash table      141

• Also suppose:                   142  robin
   – hashCode(seagull) = 143      143 sparrow
   – table[143] is not empty      144  hawk
   – table[143] != seagull        145   seagull
   – table[144] is not empty
                                  146
   – table[144] != seagull
                                  147   bluejay
   – table[145] is empty
                                  148    owl
• Therefore, put seagull at
                                  ...
  location 145
                                                  12
                  Searching, I
• Suppose you want to look up    ...
  seagull in this hash table     141
• Also suppose:                  142  robin
   – hashCode(seagull) = 143
                                 143 sparrow
   – table[143] is not empty
   – table[143] != seagull       144  hawk
   – table[144] is not empty     145   seagull
   – table[144] != seagull       146
   – table[145] is not empty
                                 147   bluejay
   – table[145] == seagull !
                                 148    owl
• We found seagull at location
                                 ...
  145
                                                 13
                 Searching, II
• Suppose you want to look up    ...
  cow in this hash table         141
• Also suppose:                  142  robin
   – hashCode(cow) = 144
                                 143 sparrow
   – table[144] is not empty
   – table[144] != cow           144  hawk
   – table[145] is not empty     145   seagull
   – table[145] != cow           146
   – table[146] is empty
                                 147   bluejay
• If cow were in the table, we   148    owl
  should have found it by now
                                 ...
• Therefore, it isn’t here
                                                 14
                  Insertion, II
• Suppose you want to add         ...
  hawk to this hash table         141

• Also suppose                    142  robin
   – hashCode(hawk) = 143         143 sparrow
   – table[143] is not empty      144  hawk
   – table[143] != hawk           145   seagull
   – table[144] is not empty
                                  146
   – table[144] == hawk
                                  147   bluejay
• hawk is already in the table,
                                  148    owl
  so do nothing
                                  ...

                                                  15
                   Insertion, III
• Suppose:                                ...
   – You want to add cardinal to          141
     this hash table                      142  robin
   – hashCode(cardinal) = 147
                                          143 sparrow
   – The last location is 148
                                          144  hawk
   – 147 and 148 are occupied
                                          145   seagull
• Solution:                               146
   – Treat the table as circular; after
                                          147   bluejay
     148 comes 0
                                          148    owl
   – Hence, cardinal goes in
     location 0 (or 1, or 2, or ...)
                                                          16
                Clustering
• One problem with the closed hashing
  technique is the tendency to form ―clusters‖
• A cluster is a group of items not containing
  any open slots
• The bigger a cluster gets, the more likely it is
  that new keys will hash into the cluster, and
  make it even bigger
• Clusters cause efficiency to degrade
• double hashing: use second hash function to
  compute increment
                                               17
    Efficiency of Closed Hasing
• Hash tables are actually surprisingly efficient
• Until the table is about 70% full, the number of
  probes (places looked at in the table) is typically
  only 2 or 3
• Sophisticated mathematical analysis is required to
  prove that the expected cost of inserting into a
  hash table, or looking something up in the hash
  table, is O(1)
• Even if the table is nearly full (leading to long
  searches), efficiency is usually still quite high
                                                    18
       Solution #2: Open hashing
           (Bucket hashing )
• The previous             ...
  solutions used closed    141
  hashing: all entries     142  robin
  went into a ―flat‖       143 sparrow   seagull
  (unstructured) array     144  hawk
• Another solution is to   145
  make each array          146
  location the header of   147 bluejay
  a linked list of keys    148   owl
  that hash to that        ...
  location
                      Efficiency for searching?
                                              19
           Other considerations
• The hash table might fill up; we need to be
  prepared for that
   – Not a problem for open hashing, of course
• You cannot delete items from a closed hashing
  table
   – This would create empty slots that might prevent you
     from finding items that hash before the slot but end up
     after it
   – Again, not a problem for open hashing
• Generally speaking, hash tables work best when
  the table size is a prime number

                                                               20
              In-class exercises
• In an array of 2k+1 integers, there are k integers that
  appear twice and 1 integer that appears once in the
  array. Design an efficient algorithm to identify the
  unique integer.




                   Design and Analysis of Algorithms –
                               Chapter 6                    21

				
DOCUMENT INFO