# Tables and Hashing

Document Sample

```					      Tables and Hashing

• tables and hashing
• amortized analysis

1/47
Dictionary datastructure

   Dictionary:
–   Dynamic-set data structure for storing items indexed using
keys.
–   Supports operations Insert, Search, and Delete.
–   Keys can be any type (string, tuple), but they are converted to
integers
–   Applications:
 Symbol table of a compiler.

 Memory-management tables in operating systems.

 Access person by name

   Hash Tables:
–   Effective way of implementing dictionaries.
–   Generalization of ordinary arrays.

2/47
   Direct-address Tables are ordinary arrays.
–   Element whose key is k is obtained by indexing into the kth
position of the array.
   Applicable when we can afford to allocate an array with
one position for every possible key.
–   i.e. when the universe of keys U is small.
   Dictionary operations can be implemented to take O(1)
time.

3/47
Tables: rows & columns of
information
   A table has several fields (types of information)
– A telephone book may have fields name, address, phone
number
– A user account table may have fields user id, password, home
folder
   To find an entry in the table, you only need know the
contents of one of the fields (not all of them). This field is
the key
– In a telephone book, the key is usually name
– In a user account table, the key is usually user id

   Ideally, a key uniquely identifies an entry
–   If the key is name and no two entries in the telephone book
have the same name, the key uniquely identifies the entries
4/47
 insert: given a key and an entry, inserts the entry into the
table
 find: given a key, finds the entry associated with the key

 remove: given a key, finds the entry associated with the

key, and removes it

Also:
 getIterator: returns an iterator, which visits each of the

entries one by one (the order may or may not be defined)
etc.

5/47
How should we implement a
table?
Our choice of representation for the Table ADT
depends on the answers to the following

 How often are entries inserted and removed?
 How many of the possible key values are likely to be used?

 What is the likely pattern of searching for keys?

–   e.g. Will most of the accesses be to just one or two key values?
 Is the table small enough to fit into memory?
 How long will the table exist?

6/47
TableNode: a key and its entry
   For searching purposes, it is best to store the key and the
entry separately (even though the key‘s value may be
inside the entry)

key                   entry
“Smith” “Smith”, “124 Hawkers Lane”, “9675846”
TableNode
“Yeo”   “Yeo”, “1 Apple Crescent”, “0044 1970 622455”

7/47
Implementation 1:
unsorted sequential array
 An array in which TableNodes are
stored consecutively in any order
 insert: add to back of array; O(1)         key     entry
0
 find: search through the keys one at
1
a time, potentially all of the keys;
2
O(n)                                   3
 remove: find + replace removed

…
node with last node; O(n)                   and so on

8/47
Implementation 2:
sorted sequential array
 An array in which TableNodes are
stored consecutively, sorted by key
 insert: add in sorted order; O(n)          key     entry
0
 find: binary chop; O(log n)
1
 remove: find, remove node and
2
shuffle down; O(n)                     3

…
and so on

We can use binary chop because the
array elements are sorted

9/47
Implementation 3:
 TableNodes are again stored
consecutively
 insert: add to front; O(1)             key     entry
or O(n) for a sorted list
 find: search through potentially all

the keys, one at a time; O(n)
still O(n) for a sorted list
 remove: find, remove using pointer

alterations; O(n)
and so on

10/47
Implementation 4:
AVL tree
 An AVL tree, ordered by key
 insert: a standard insert; O(log n)          key    entry
 find: a standard find (without

removing, of course); O(log n)
key   entry     key   entry
 remove: a standard remove; O(log

n)
key    entry

O(log n) is very good…
and so on
…but O(1) would be even better!

11/47
Implementation 5:
hashing
   An array in which TableNodes are
not stored consecutively - their
place of storage is calculated            key   entry
using the key and a hash function
4
hash        array
Key                     index
function
10
 Hashed key: the result of applying
a hash function to a key
 Keys and entries are scattered
123
throughout the array

12/47
Implementation 5:
hashing
 An array in which TableNodes are
not stored consecutively - their
place of storage is calculated              key   entry
using the key and a hash function
 insert: calculate place of storage,
4
insert TableNode; O(1)
 find: calculate place of storage,
10
retrieve entry; O(1)
 remove: calculate place of

storage, set it to null; O(1)
123

All are O(1) !
13/47
Hashing example: a fruit shop

10 stock details, 10 table positions
key       entry
Stock numbers are between 0 and 1000           0   85     85, apples
Use hash function: stock no. / 100             1
What if we now insert stock no. 350?           2
Position 3 is occupied: there is a         3   323    323, guava
collision
Collision resolution strategy: insert in the   4   462    462, pears
next free position (linear probing)            5   350    350, oranges
6
Given a stock number, we find stock by
7
using the hash function again, and use the
8
collision resolution strategy if necessary
9   912    912, papaya

14/47
Three factors affecting the
performance of hashing
   The hash function
– Ideally, it should distribute keys and entries evenly throughout the table
– It should minimise collisions, where the position given by the hash
   The collision resolution strategy
– Separate chaining: chain together several keys/entries in each position
– Open addressing: store the key/entry in a different position

   The size of the table
– Too big will waste memory; too small will increase collisions and may
eventually force rehashing (copying into a larger table)
– Should be appropriate for the hash function used – and a prime number
is best

15/47
Choosing a hash function:
turning a key into a table position
   Truncation
– Ignore part of the key and use the rest as the array index
(converting non-numeric parts)
– A fast technique, but check for an even distribution Folding
– Partition the key into several parts and then combine them in any
convenient way
– Unlike truncation, uses information from the whole key

   Modular arithmetic (used by truncation & folding, and on its
own)
–   To keep the calculated table position within the table, divide the
position by the size of the table, and take the remainder as the
new position

16/47
Examples of hash functions (1)
   Truncation: If students have an 9-digit identification
number, take the last 3 digits as the table position
–   e.g. 925371622 becomes 622
   Folding: Split a 9-digit number into three 3-digit numbers,
–   e.g. 925371622 becomes 925 + 376 + 622 = 1923
   Modular arithmetic: If the table size is 1000, the first
example always keeps within the table range, but the
second example does not (it should be mod 1000)
–   e.g. 1923 mod 1000 = 923     (in Java: 1923 % 1000)

17/47
Examples of hash functions (2)
   Using a telephone number as a key
– The area code is not random, so will not spread the keys/entries
evenly through the table (many collisions)
– The last 3-digits are more random

   Using a name as a key
– Use full name rather than surname (surname not particularly
random)
– Assign numbers to the characters (e.g. a = 1, b = 2; or use
Unicode values)
– Strategy 1: Add the resulting numbers. Bad for large table size.
– Strategy 2: Call the number of possible characters c (e.g. c = 54
for alphabet in upper and lower case, plus space and hyphen).
Then multiply each character in the name by increasing powers
18/47
Choosing the table size to
minimise collisions
 As the number of elements in the table increases, the
likelihood of a collision increases - so make the table as
large as practical
 If the table size is 100, and all the hashed keys are

divisable by 10, there will be many collisions!
–   Particularly bad if table size is a power of a small integer such
as 2 or 10
   More generally, collisions may be more frequent if:
–   greatest common divisor (hashed keys, table size) > 1
   Therefore, make the table size a prime number (gcd = 1)
Collisions may still happen, so we
need a collision resolution strategy
19/47
Collision resolution:
Probing: If the table position given by the hashed
key is already occupied, increase the position by
some amount, until an empty position is found

   Linear probing: increase by 1 each time [mod table size!]
   Quadratic probing: to the original position, add 1, 4, 9, 16,…

Use the collision resolution strategy when inserting and when
finding (ensure that the search key and the found keys match)
May also double hash: result of linear probing  result of another hash function
With open addressing, the table size should be double the expected no. of elements

20/47
Collision resolution:
   If the table is fairly empty with many collisions, linear
probing may cluster (group) keys/entries
–   This increases the time to insert and to find

1   2    3    4    5    6   7    8
For a table of size n, then if the table is empty, the probability of the next entry
going to any particular place is 1/n
In the diagram, the probability of position 2 getting filled next is 2/n (either a
hash to 1 or to 2 fills it)
Once 2 is full, the probability of 4 being filled next is 4/n and then of 7 is 7/n
(i.e. the probability of getting long strings steadily increases)

21/47
Collision resolution:
 An empty key/entry marks the end of a cluster, and so can
be used to terminate a find operation
 So, if we remove an entry within a cluster, we should not

empty it!
 To allow probing to continue, the removed entry must be

marked as ‗removed but cluster continues‘

22/47
Collision resolution:
   Quadratic probing is a solution to the clustering problem
– Linear probing adds 1, 2, 3, etc. to the original hashed key
– Quadratic probing adds 12, 22, 32 etc. to the original hashed key

   However, whereas linear probing guarantees that all empty
positions will be examined if necessary, quadratic probing
does not
–   e.g. Table size 16 and original hashed key 3 gives the
sequence: 3, 4, 7, 12, 3, 12, 7, 4…
   More generally, with quadratic probing, insertion may be
impossible if the table is more than half-full!
–   Need to rehash (see later)

23/47
Collision resolution: chaining

   Each table position is a linked list
   Add the keys and entries anywhere in No need to change position!
the list (front easiest)
key entry key entry
– Simpler insertion and removal
4
– Array size is not a limitation (but
should still minimise collisions: make     key entry key entry
table size roughly equal to expected    10
number of keys and entries)
– Memory overhead is large if entries        key entry
are small                              123

24/47
Rehashing: enlarging the table
   To rehash:
– Create a new table of double the size (adjusting until it is again
prime)
– Transfer the entries in the old table to the new table, by recomputing
their positions (using the hash function)
   When should we rehash?
– When the table is completely full
– With quadratic probing, when the table is half-full or insertion fails
   Why double the size?
– If n is the number of elements in the table, there must have been n/2
insertions before the previous rehash (if rehashing done when table
full)
– So by making the table size 2n, a constant cost is added to each
insertion

25/47
Applications of Hashing
   Compilers use hash tables to keep track of declared variables
   A hash table can be used for on-line spelling checkers — if
misspelling detection (rather than correction) is important, an
entire dictionary can be hashed and words checked in
constant time
   Game playing programs use hash tables to store seen
positions, thereby saving computation time if the position is
encountered again
   Hash functions can be used to quickly check for inequality —
if two elements hash to different values they must be different
   Storing sparse data

26/47
When are other representations
more suitable than hashing?
 Hash tables are very good if there is a need for many
searches in a reasonably stable table
 Hash tables are not so good if there are many insertions

and deletions, or if table traversals are needed — in this
case, AVL trees are better
 If there are more data than available memory then use a

B-tree
 Also, hashing is very slow for any operations which require

the entries to be sorted
–   e.g. Find the minimum key

27/47
Issues:

   What do we lose?
–   Operations that require ordering are inefficient
–   FindMax: O(n)             O(log n) Balanced binary tree
–   FindMin: O(n)             O(log n) Balanced binary tree
–   PrintSorted: O(n log n) O(n) Balanced binary tree
   What do we gain?
–   Insert:   O(1)            O(log n) Balanced binary tree
–   Delete:   O(1)            O(log n) Balanced binary tree
–   Find:     O(1)            O(log n) Balanced binary tree
   How to handle Collision?
–   Separate chaining

28/47
Performance of Hashing
 The number of probes depends on the load factor (usually
denoted by ) which represents the ratio of entries present
in the table to the number of positions in the array
 We also need to consider successful and unsuccessful

searches separately
 For a chained hash table, the average number of probes

for an unsuccessful search is  and for a successful search
is 1 + /2

29/47
Performance of Hashing (2)

   For open addressing, the formulae are more complicated
but typical values are:
Load Factor           0.1    0.5   0.8   0.9   0.99
Successful search
Linear Probes         1.05   1.6   3.4   6.2   21.3
Quadratic Probes      1.04   1.5   2.1   2.7   5.2
Unsuccessful search
Linear Probes         1.13   2.7   15.4 59.8 430
Quadratic probes      1.13   2.2   5.2 11.9 126
   Note that these do not depend on the size of the array or
the number of entries present but only on the ratio (the
30/47
Amortized Analysis of complexity

 Used when complexity of an operation is very different in
the different state of the algorithm/data structure
 Three methods for amortized analysis:

– aggregate analysis
– accounting method
– potential method

31/47
Sequence of operations
The problem:
 We have a data structure

 We perform a sequence of operations

– Operations may be of different types (e.g., insert, delete)
– Depending on the state of the structure the actual cost of an
operation may differ (e.g., inserting into a sorted array)
 Just analyzing the worst-case time of a single operation
may not say too much
 We want the average running time of an operation (but

from the worst-case sequence of operations!).

32/47
Binary counter example
   Example data structure: a binary counter
– Operation: Increment
– Implementation: An array of bits A[0..k–1]

Increment(A)
1 i  0
2 while i < k and A[i] = 1 do
3    A[i]  0
4    i  i + 1
5 if i < k then A[i]  1

    How many bit assignments do we have to do in
the worst-case to perform Increment(A)?
   But usually we do much less bit assignments!

33/47
Analysis of binary counter
   How many bit-assignments do we do on average?
– Let‘s consider a sequence of n Increment’s
– Let‘s compute the sum of bit assignments:
 A[0] assigned on each operation: n assignments

 A[1] assigned every two operations: n/2 assignments

 A[2] assigned every four ops: n/4 assignments

i         i
 A[i] assigned every 2 ops: n/2 assignments

lg n 
     
n
  2i   2n
i 0  
    Thus, a single operation takes 2n/n = 2 = O(1)
time amortized time
34/47
Aggregate analysis
   Aggregate analysis – a simple way to do amortized
analysis
– Treat all operations equally
– Compute the worst-case running time of a sequence of n
operations.
– Divide by n to get an amortized running time

35/47
Another look at binary counter
Another way of looking at it
(proving the amortized time):
– To assign a bit, I have to use one dollar
– When I assign ―1‖, I use one dollar, plus I put one dollar in my
―savings account‖ associated with that bit.
– When I assign ―0‖, I can do it using a dollar from the savings
account on that bit
– How much do I have to pay for the Increment(A) for this scheme
to work?
 Only one assignment of “1” in the algorithm. Obviously, two

dollars will always pay for the operation
– Amortized complexity of the Increment(A) is 2 = O(1)

36/47
Accounting method
   Principles of the accounting method
1. Associate credit accounts with different parts of the structure
2. Associate amortized costs with operations and show how they credit
or debit accounts
 Different costs may be assigned to different operations

 Requirement for all sequences of operations

(c – real cost, c’ – amortized cost):

n          n

 c   c
i 1
i
i 1
i

 This is equivalent to requiring that the sum of all
credits in the data structure is non-negative
 holds for the binary counter if starting at 0
3. Show that this requirement is satisfied

37/47
Potential method
   We can have one account associated with the whole
structure:
– We call it a potential
– It‘s a function that maps a state of the data structure after
operation i to a number: F(Di)

ci  ci  F( Di )  F( Di 1 )

   The main step of this method is defining the
potential function
   Requirement: F(Dn) – F(D0)  0
   Once we have F, we can compute the
amortized costs of operations
38/47
Binary counter example
   How do we define the potential function for the binary
counter?
– Potential of A: bi = a number of ―1‖s
– What is F(Di) – F(Di-1), if the number of bits set to 0 in operation
i is ti?
– What is the amortized cost of Increment(A)?
 We showed that     F(Di) – F(Di-1)  1 – ti
 Real cost ci = ti + 1
 Thus,

ci  ci  F( Di )  F( Di 1 )  (ti  1)  (1  ti )  2

39/47
Potential method
   We can analyze the counter even if it does not start at 0
using potential method:
– Let‘s say we start with b0 and end with bn ―1‖s
– Observe that:
n           n

 c   c  F ( D )  F ( D )
i 1
i
i 1
i   n    0

   We have that: ci  2
n

This means that:  i
c  2n  b  b
n    0

i 1
   Note that b0  k. This means that, if k = O(n)
then the total actual cost is O(n).

40/47
Dynamic table

   It is often useful to have a dynamic table:
– The table that expands and contracts as necessary when new
 Expands when insertion is done and the table is already full

 Contracts when deletion is done and there is “too much” free space

– Contracting or expanding involves relocating
 Allocate new memory space of the new size

 Copy all elements from the table into the new space

 Free the old space

– Worst-case time for insertions and deletions:
 Without relocation: O(1)

 With relocation: O(m), where m – the number of elements in the table

41/47
Requirements
– num – current number of elements in the table
– size – the total number of elements that can be stored in the
allocated memory
– Load factor a = num/size

   It would be nice to have these two properties:
– Amortized cost of insert and delete is constant
– The load factor is always above some constant
 That is the table is not too empty

42/47
Naive insertions
   Let’s look only at insertions: Why not expand the table by
some constant when it overflows?
–   What is the amortized cost of an insertion?

43/47
Aggregate analysis
   The “right” way to expand – double the size of the table
– Let‘s do an aggregate analysis
– The cost of i-th insertion is:
 i, if i–1 is an exact power of 2

 1, otherwise

– Let‘s sum up…
– The total cost of n insertions is then < 3n
– Accounting method gives the intuition:
 Pay \$1 for inserting the element

 Put \$1 into element’s account for reallocating it later

 Put \$1 into the account of another element to pay for a later

relocation of that element

44/47
Potential function
   What potential function do we want to have?
Fi=2numi     – sizei
– It is always non-negative
– Amortized cost of insertion:
 Insertion triggers an expansion

 Insertion does not trigger an expansion

– Both cases: 3

45/47
Deletions
   Deletions: What if we contract whenever the table is about
to get less than half full?
– Would the amortized running times of a sequence of insertions
and deletions be constant?
– Problem: we want to avoid doing reallocations often without
having accumulated ―the money‖ to pay for that!

46/47
Deletions
   Idea: delay contraction!
– Contract only when num = size/4
– Second requirement still satisfied: a > ¼

   How do we define the potential function?

 2  num  size if a  1/ 2
F
 size / 2  num if a  1/ 2
    It is always non-negative
    Let’s compute the amortized running time of
deletions:
   a  ½ (with contraction, without contraction)

47/47

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 90 posted: 2/2/2011 language: English pages: 47