Tables and Hashing

Document Sample
Tables and Hashing Powered By Docstoc
					      Tables and Hashing

• tables and hashing
• amortized analysis

          Dictionary datastructure

   Dictionary:
    –   Dynamic-set data structure for storing items indexed using
    –   Supports operations Insert, Search, and Delete.
    –   Keys can be any type (string, tuple), but they are converted to
    –   Applications:
          Symbol table of a compiler.

          Memory-management tables in operating systems.

          Access person by name

   Hash Tables:
    –   Effective way of implementing dictionaries.
    –   Generalization of ordinary arrays.

          Direct-address Tables
   Direct-address Tables are ordinary arrays.
   Facilitate direct addressing.
    –   Element whose key is k is obtained by indexing into the kth
        position of the array.
   Applicable when we can afford to allocate an array with
    one position for every possible key.
    –   i.e. when the universe of keys U is small.
   Dictionary operations can be implemented to take O(1)

              Tables: rows & columns of
   A table has several fields (types of information)
     – A telephone book may have fields name, address, phone
     – A user account table may have fields user id, password, home
   To find an entry in the table, you only need know the
    contents of one of the fields (not all of them). This field is
    the key
     – In a telephone book, the key is usually name
     – In a user account table, the key is usually user id

   Ideally, a key uniquely identifies an entry
     –   If the key is name and no two entries in the telephone book
         have the same name, the key uniquely identifies the entries
         The Table ADT: operations
 insert: given a key and an entry, inserts the entry into the
 find: given a key, finds the entry associated with the key

 remove: given a key, finds the entry associated with the

  key, and removes it

 getIterator: returns an iterator, which visits each of the

  entries one by one (the order may or may not be defined)

             How should we implement a
             Our choice of representation for the Table ADT
             depends on the answers to the following

 How often are entries inserted and removed?
 How many of the possible key values are likely to be used?

 What is the likely pattern of searching for keys?

    –   e.g. Will most of the accesses be to just one or two key values?
 Is the table small enough to fit into memory?
 How long will the table exist?

            TableNode: a key and its entry
   For searching purposes, it is best to store the key and the
    entry separately (even though the key‘s value may be
    inside the entry)

                   key                   entry
                 “Smith” “Smith”, “124 Hawkers Lane”, “9675846”
                 “Yeo”   “Yeo”, “1 Apple Crescent”, “0044 1970 622455”

         Implementation 1:
         unsorted sequential array
 An array in which TableNodes are
  stored consecutively in any order
 insert: add to back of array; O(1)         key     entry
 find: search through the keys one at
  a time, potentially all of the keys;
  O(n)                                   3
 remove: find + replace removed

  node with last node; O(n)                   and so on

          Implementation 2:
          sorted sequential array
 An array in which TableNodes are
  stored consecutively, sorted by key
 insert: add in sorted order; O(n)          key     entry
 find: binary chop; O(log n)
 remove: find, remove node and
  shuffle down; O(n)                     3

                                              and so on

    We can use binary chop because the
    array elements are sorted

         Implementation 3:
         linked list (unsorted or sorted)
 TableNodes are again stored
 insert: add to front; O(1)             key     entry
  or O(n) for a sorted list
 find: search through potentially all

  the keys, one at a time; O(n)
  still O(n) for a sorted list
 remove: find, remove using pointer

  alterations; O(n)
                                          and so on

         Implementation 4:
         AVL tree
 An AVL tree, ordered by key
 insert: a standard insert; O(log n)          key    entry
 find: a standard find (without

  removing, of course); O(log n)
                                        key   entry     key   entry
 remove: a standard remove; O(log

                                              key    entry

    O(log n) is very good…
                                              and so on
    …but O(1) would be even better!

           Implementation 5:
   An array in which TableNodes are
    not stored consecutively - their
    place of storage is calculated            key   entry
    using the key and a hash function
                 hash        array
     Key                     index
 Hashed key: the result of applying
  a hash function to a key
 Keys and entries are scattered
  throughout the array

         Implementation 5:
 An array in which TableNodes are
  not stored consecutively - their
  place of storage is calculated              key   entry
  using the key and a hash function
 insert: calculate place of storage,
  insert TableNode; O(1)
 find: calculate place of storage,
  retrieve entry; O(1)
 remove: calculate place of

  storage, set it to null; O(1)

        All are O(1) !
            Hashing example: a fruit shop

10 stock details, 10 table positions
                                                    key       entry
Stock numbers are between 0 and 1000           0   85     85, apples
Use hash function: stock no. / 100             1
What if we now insert stock no. 350?           2
    Position 3 is occupied: there is a         3   323    323, guava
Collision resolution strategy: insert in the   4   462    462, pears
next free position (linear probing)            5   350    350, oranges
Given a stock number, we find stock by
using the hash function again, and use the
collision resolution strategy if necessary
                                               9   912    912, papaya

               Three factors affecting the
               performance of hashing
   The hash function
     – Ideally, it should distribute keys and entries evenly throughout the table
     – It should minimise collisions, where the position given by the hash
       function is already occupied
   The collision resolution strategy
     – Separate chaining: chain together several keys/entries in each position
     – Open addressing: store the key/entry in a different position

   The size of the table
     – Too big will waste memory; too small will increase collisions and may
       eventually force rehashing (copying into a larger table)
     – Should be appropriate for the hash function used – and a prime number
       is best

                Choosing a hash function:
                turning a key into a table position
   Truncation
     – Ignore part of the key and use the rest as the array index
       (converting non-numeric parts)
     – A fast technique, but check for an even distribution Folding
     – Partition the key into several parts and then combine them in any
       convenient way
     – Unlike truncation, uses information from the whole key

   Modular arithmetic (used by truncation & folding, and on its
     –   To keep the calculated table position within the table, divide the
         position by the size of the table, and take the remainder as the
         new position

              Examples of hash functions (1)
   Truncation: If students have an 9-digit identification
    number, take the last 3 digits as the table position
     –   e.g. 925371622 becomes 622
   Folding: Split a 9-digit number into three 3-digit numbers,
    and add them
     –   e.g. 925371622 becomes 925 + 376 + 622 = 1923
   Modular arithmetic: If the table size is 1000, the first
    example always keeps within the table range, but the
    second example does not (it should be mod 1000)
     –   e.g. 1923 mod 1000 = 923     (in Java: 1923 % 1000)

             Examples of hash functions (2)
   Using a telephone number as a key
     – The area code is not random, so will not spread the keys/entries
       evenly through the table (many collisions)
     – The last 3-digits are more random

   Using a name as a key
     – Use full name rather than surname (surname not particularly
     – Assign numbers to the characters (e.g. a = 1, b = 2; or use
       Unicode values)
     – Strategy 1: Add the resulting numbers. Bad for large table size.
     – Strategy 2: Call the number of possible characters c (e.g. c = 54
       for alphabet in upper and lower case, plus space and hyphen).
       Then multiply each character in the name by increasing powers
       of c, and add together.
              Choosing the table size to
              minimise collisions
 As the number of elements in the table increases, the
  likelihood of a collision increases - so make the table as
  large as practical
 If the table size is 100, and all the hashed keys are

  divisable by 10, there will be many collisions!
     –   Particularly bad if table size is a power of a small integer such
         as 2 or 10
   More generally, collisions may be more frequent if:
     –   greatest common divisor (hashed keys, table size) > 1
   Therefore, make the table size a prime number (gcd = 1)
                             Collisions may still happen, so we
                             need a collision resolution strategy
             Collision resolution:
             open addressing (1)
         Probing: If the table position given by the hashed
         key is already occupied, increase the position by
         some amount, until an empty position is found

   Linear probing: increase by 1 each time [mod table size!]
   Quadratic probing: to the original position, add 1, 4, 9, 16,…

    Use the collision resolution strategy when inserting and when
    finding (ensure that the search key and the found keys match)
 May also double hash: result of linear probing  result of another hash function
With open addressing, the table size should be double the expected no. of elements

              Collision resolution:
              open addressing (2)
   If the table is fairly empty with many collisions, linear
    probing may cluster (group) keys/entries
     –   This increases the time to insert and to find

                    1   2    3    4    5    6   7    8
For a table of size n, then if the table is empty, the probability of the next entry
going to any particular place is 1/n
In the diagram, the probability of position 2 getting filled next is 2/n (either a
hash to 1 or to 2 fills it)
Once 2 is full, the probability of 4 being filled next is 4/n and then of 7 is 7/n
(i.e. the probability of getting long strings steadily increases)

         Collision resolution:
         open addressing (3)
 An empty key/entry marks the end of a cluster, and so can
  be used to terminate a find operation
 So, if we remove an entry within a cluster, we should not

  empty it!
 To allow probing to continue, the removed entry must be

  marked as ‗removed but cluster continues‘

              Collision resolution:
              open addressing (4)
   Quadratic probing is a solution to the clustering problem
     – Linear probing adds 1, 2, 3, etc. to the original hashed key
     – Quadratic probing adds 12, 22, 32 etc. to the original hashed key

   However, whereas linear probing guarantees that all empty
    positions will be examined if necessary, quadratic probing
    does not
     –   e.g. Table size 16 and original hashed key 3 gives the
         sequence: 3, 4, 7, 12, 3, 12, 7, 4…
   More generally, with quadratic probing, insertion may be
    impossible if the table is more than half-full!
     –   Need to rehash (see later)

             Collision resolution: chaining

   Each table position is a linked list
   Add the keys and entries anywhere in No need to change position!
    the list (front easiest)
   Advantages over open addressing:
                                                   key entry key entry
      – Simpler insertion and removal
      – Array size is not a limitation (but
        should still minimise collisions: make     key entry key entry
        table size roughly equal to expected    10
        number of keys and entries)
   Disadvantage
      – Memory overhead is large if entries        key entry
        are small                              123

             Rehashing: enlarging the table
   To rehash:
     – Create a new table of double the size (adjusting until it is again
     – Transfer the entries in the old table to the new table, by recomputing
       their positions (using the hash function)
   When should we rehash?
     – When the table is completely full
     – With quadratic probing, when the table is half-full or insertion fails
   Why double the size?
     – If n is the number of elements in the table, there must have been n/2
       insertions before the previous rehash (if rehashing done when table
     – So by making the table size 2n, a constant cost is added to each

            Applications of Hashing
   Compilers use hash tables to keep track of declared variables
   A hash table can be used for on-line spelling checkers — if
    misspelling detection (rather than correction) is important, an
    entire dictionary can be hashed and words checked in
    constant time
   Game playing programs use hash tables to store seen
    positions, thereby saving computation time if the position is
    encountered again
   Hash functions can be used to quickly check for inequality —
    if two elements hash to different values they must be different
   Storing sparse data

             When are other representations
             more suitable than hashing?
 Hash tables are very good if there is a need for many
  searches in a reasonably stable table
 Hash tables are not so good if there are many insertions

  and deletions, or if table traversals are needed — in this
  case, AVL trees are better
 If there are more data than available memory then use a

 Also, hashing is very slow for any operations which require

  the entries to be sorted
    –   e.g. Find the minimum key


   What do we lose?
    –   Operations that require ordering are inefficient
    –   FindMax: O(n)             O(log n) Balanced binary tree
    –   FindMin: O(n)             O(log n) Balanced binary tree
    –   PrintSorted: O(n log n) O(n) Balanced binary tree
   What do we gain?
    –   Insert:   O(1)            O(log n) Balanced binary tree
    –   Delete:   O(1)            O(log n) Balanced binary tree
    –   Find:     O(1)            O(log n) Balanced binary tree
   How to handle Collision?
    –   Separate chaining
    –   Open addressing

         Performance of Hashing
 The number of probes depends on the load factor (usually
  denoted by ) which represents the ratio of entries present
  in the table to the number of positions in the array
 We also need to consider successful and unsuccessful

  searches separately
 For a chained hash table, the average number of probes

  for an unsuccessful search is  and for a successful search
  is 1 + /2

           Performance of Hashing (2)

   For open addressing, the formulae are more complicated
    but typical values are:
     Load Factor           0.1    0.5   0.8   0.9   0.99
     Successful search
     Linear Probes         1.05   1.6   3.4   6.2   21.3
     Quadratic Probes      1.04   1.5   2.1   2.7   5.2
     Unsuccessful search
     Linear Probes         1.13   2.7   15.4 59.8 430
     Quadratic probes      1.13   2.2   5.2 11.9 126
   Note that these do not depend on the size of the array or
    the number of entries present but only on the ratio (the
    load factor)
           Amortized Analysis of complexity

 Used when complexity of an operation is very different in
  the different state of the algorithm/data structure
 Three methods for amortized analysis:

    – aggregate analysis
    – accounting method
    – potential method

           Sequence of operations
The problem:
 We have a data structure

 We perform a sequence of operations

    – Operations may be of different types (e.g., insert, delete)
    – Depending on the state of the structure the actual cost of an
      operation may differ (e.g., inserting into a sorted array)
 Just analyzing the worst-case time of a single operation
  may not say too much
 We want the average running time of an operation (but

  from the worst-case sequence of operations!).

              Binary counter example
   Example data structure: a binary counter
     – Operation: Increment
     – Implementation: An array of bits A[0..k–1]

     1 i  0
     2 while i < k and A[i] = 1 do
     3    A[i]  0
     4    i  i + 1
     5 if i < k then A[i]  1

        How many bit assignments do we have to do in
         the worst-case to perform Increment(A)?
            But usually we do much less bit assignments!

            Analysis of binary counter
   How many bit-assignments do we do on average?
     – Let‘s consider a sequence of n Increment’s
     – Let‘s compute the sum of bit assignments:
         A[0] assigned on each operation: n assignments

         A[1] assigned every two operations: n/2 assignments

         A[2] assigned every four ops: n/4 assignments

                               i         i
         A[i] assigned every 2 ops: n/2 assignments

                            lg n 
                                 
                              2i   2n
                            i 0  
        Thus, a single operation takes 2n/n = 2 = O(1)
         time amortized time
           Aggregate analysis
   Aggregate analysis – a simple way to do amortized
     – Treat all operations equally
     – Compute the worst-case running time of a sequence of n
     – Divide by n to get an amortized running time

        Another look at binary counter
Another way of looking at it
(proving the amortized time):
 – To assign a bit, I have to use one dollar
 – When I assign ―1‖, I use one dollar, plus I put one dollar in my
   ―savings account‖ associated with that bit.
 – When I assign ―0‖, I can do it using a dollar from the savings
   account on that bit
 – How much do I have to pay for the Increment(A) for this scheme
   to work?
      Only one assignment of “1” in the algorithm. Obviously, two

       dollars will always pay for the operation
 – Amortized complexity of the Increment(A) is 2 = O(1)

             Accounting method
   Principles of the accounting method
     1. Associate credit accounts with different parts of the structure
     2. Associate amortized costs with operations and show how they credit
       or debit accounts
          Different costs may be assigned to different operations

          Requirement for all sequences of operations

           (c – real cost, c’ – amortized cost):

                                n          n

                                c   c
                               i 1
                                          i 1

             This is equivalent to requiring that the sum of all
              credits in the data structure is non-negative
             holds for the binary counter if starting at 0
        3. Show that this requirement is satisfied

              Potential method
   We can have one account associated with the whole
     – We call it a potential
     – It‘s a function that maps a state of the data structure after
       operation i to a number: F(Di)

                        ci  ci  F( Di )  F( Di 1 )

            The main step of this method is defining the
             potential function
                 Requirement: F(Dn) – F(D0)  0
            Once we have F, we can compute the
             amortized costs of operations
            Binary counter example
   How do we define the potential function for the binary
     – Potential of A: bi = a number of ―1‖s
     – What is F(Di) – F(Di-1), if the number of bits set to 0 in operation
       i is ti?
     – What is the amortized cost of Increment(A)?
           We showed that     F(Di) – F(Di-1)  1 – ti
           Real cost ci = ti + 1
           Thus,

            ci  ci  F( Di )  F( Di 1 )  (ti  1)  (1  ti )  2

              Potential method
   We can analyze the counter even if it does not start at 0
    using potential method:
     – Let‘s say we start with b0 and end with bn ―1‖s
     – Observe that:
                                 n           n

                                 c   c  F ( D )  F ( D )
                                i 1
                                            i 1
                                                   i   n    0

            We have that: ci  2

             This means that:  i
                                           c  2n  b  b
                                            n    0
                                 i 1
            Note that b0  k. This means that, if k = O(n)
             then the total actual cost is O(n).

           Dynamic table

   It is often useful to have a dynamic table:
       – The table that expands and contracts as necessary when new
         elements are added or deleted.
            Expands when insertion is done and the table is already full

            Contracts when deletion is done and there is “too much” free space

       – Contracting or expanding involves relocating
            Allocate new memory space of the new size

            Copy all elements from the table into the new space

            Free the old space

       – Worst-case time for insertions and deletions:
            Without relocation: O(1)

            With relocation: O(m), where m – the number of elements in the table

   Load factor
     – num – current number of elements in the table
     – size – the total number of elements that can be stored in the
       allocated memory
     – Load factor a = num/size

   It would be nice to have these two properties:
     – Amortized cost of insert and delete is constant
     – The load factor is always above some constant
         That is the table is not too empty

              Naive insertions
   Let’s look only at insertions: Why not expand the table by
    some constant when it overflows?
     –   What is the amortized cost of an insertion?

             Aggregate analysis
   The “right” way to expand – double the size of the table
     – Let‘s do an aggregate analysis
     – The cost of i-th insertion is:
         i, if i–1 is an exact power of 2

         1, otherwise

     – Let‘s sum up…
     – The total cost of n insertions is then < 3n
     – Accounting method gives the intuition:
         Pay $1 for inserting the element

         Put $1 into element’s account for reallocating it later

         Put $1 into the account of another element to pay for a later

          relocation of that element

             Potential function
   What potential function do we want to have?
     Fi=2numi     – sizei
     – It is always non-negative
     – Amortized cost of insertion:
          Insertion triggers an expansion

          Insertion does not trigger an expansion

     – Both cases: 3

   Deletions: What if we contract whenever the table is about
    to get less than half full?
     – Would the amortized running times of a sequence of insertions
       and deletions be constant?
     – Problem: we want to avoid doing reallocations often without
       having accumulated ―the money‖ to pay for that!

   Idea: delay contraction!
     – Contract only when num = size/4
     – Second requirement still satisfied: a > ¼

   How do we define the potential function?

                      2  num  size if a  1/ 2
                      size / 2  num if a  1/ 2
        It is always non-negative
        Let’s compute the amortized running time of
            a  ½ (with contraction, without contraction)


Shared By: