Document Sample

Tables and Hashing • tables and hashing • amortized analysis 1/47 Dictionary datastructure Dictionary: – Dynamic-set data structure for storing items indexed using keys. – Supports operations Insert, Search, and Delete. – Keys can be any type (string, tuple), but they are converted to integers – Applications: Symbol table of a compiler. Memory-management tables in operating systems. Access person by name Hash Tables: – Effective way of implementing dictionaries. – Generalization of ordinary arrays. 2/47 Direct-address Tables Direct-address Tables are ordinary arrays. Facilitate direct addressing. – Element whose key is k is obtained by indexing into the kth position of the array. Applicable when we can afford to allocate an array with one position for every possible key. – i.e. when the universe of keys U is small. Dictionary operations can be implemented to take O(1) time. 3/47 Tables: rows & columns of information A table has several fields (types of information) – A telephone book may have fields name, address, phone number – A user account table may have fields user id, password, home folder To find an entry in the table, you only need know the contents of one of the fields (not all of them). This field is the key – In a telephone book, the key is usually name – In a user account table, the key is usually user id Ideally, a key uniquely identifies an entry – If the key is name and no two entries in the telephone book have the same name, the key uniquely identifies the entries 4/47 The Table ADT: operations insert: given a key and an entry, inserts the entry into the table find: given a key, finds the entry associated with the key remove: given a key, finds the entry associated with the key, and removes it Also: getIterator: returns an iterator, which visits each of the entries one by one (the order may or may not be defined) etc. 5/47 How should we implement a table? Our choice of representation for the Table ADT depends on the answers to the following How often are entries inserted and removed? How many of the possible key values are likely to be used? What is the likely pattern of searching for keys? – e.g. Will most of the accesses be to just one or two key values? Is the table small enough to fit into memory? How long will the table exist? 6/47 TableNode: a key and its entry For searching purposes, it is best to store the key and the entry separately (even though the key‘s value may be inside the entry) key entry “Smith” “Smith”, “124 Hawkers Lane”, “9675846” TableNode “Yeo” “Yeo”, “1 Apple Crescent”, “0044 1970 622455” 7/47 Implementation 1: unsorted sequential array An array in which TableNodes are stored consecutively in any order insert: add to back of array; O(1) key entry 0 find: search through the keys one at 1 a time, potentially all of the keys; 2 O(n) 3 remove: find + replace removed … node with last node; O(n) and so on 8/47 Implementation 2: sorted sequential array An array in which TableNodes are stored consecutively, sorted by key insert: add in sorted order; O(n) key entry 0 find: binary chop; O(log n) 1 remove: find, remove node and 2 shuffle down; O(n) 3 … and so on We can use binary chop because the array elements are sorted 9/47 Implementation 3: linked list (unsorted or sorted) TableNodes are again stored consecutively insert: add to front; O(1) key entry or O(n) for a sorted list find: search through potentially all the keys, one at a time; O(n) still O(n) for a sorted list remove: find, remove using pointer alterations; O(n) and so on 10/47 Implementation 4: AVL tree An AVL tree, ordered by key insert: a standard insert; O(log n) key entry find: a standard find (without removing, of course); O(log n) key entry key entry remove: a standard remove; O(log n) key entry O(log n) is very good… and so on …but O(1) would be even better! 11/47 Implementation 5: hashing An array in which TableNodes are not stored consecutively - their place of storage is calculated key entry using the key and a hash function 4 hash array Key index function 10 Hashed key: the result of applying a hash function to a key Keys and entries are scattered 123 throughout the array 12/47 Implementation 5: hashing An array in which TableNodes are not stored consecutively - their place of storage is calculated key entry using the key and a hash function insert: calculate place of storage, 4 insert TableNode; O(1) find: calculate place of storage, 10 retrieve entry; O(1) remove: calculate place of storage, set it to null; O(1) 123 All are O(1) ! 13/47 Hashing example: a fruit shop 10 stock details, 10 table positions key entry Stock numbers are between 0 and 1000 0 85 85, apples Use hash function: stock no. / 100 1 What if we now insert stock no. 350? 2 Position 3 is occupied: there is a 3 323 323, guava collision Collision resolution strategy: insert in the 4 462 462, pears next free position (linear probing) 5 350 350, oranges 6 Given a stock number, we find stock by 7 using the hash function again, and use the 8 collision resolution strategy if necessary 9 912 912, papaya 14/47 Three factors affecting the performance of hashing The hash function – Ideally, it should distribute keys and entries evenly throughout the table – It should minimise collisions, where the position given by the hash function is already occupied The collision resolution strategy – Separate chaining: chain together several keys/entries in each position – Open addressing: store the key/entry in a different position The size of the table – Too big will waste memory; too small will increase collisions and may eventually force rehashing (copying into a larger table) – Should be appropriate for the hash function used – and a prime number is best 15/47 Choosing a hash function: turning a key into a table position Truncation – Ignore part of the key and use the rest as the array index (converting non-numeric parts) – A fast technique, but check for an even distribution Folding – Partition the key into several parts and then combine them in any convenient way – Unlike truncation, uses information from the whole key Modular arithmetic (used by truncation & folding, and on its own) – To keep the calculated table position within the table, divide the position by the size of the table, and take the remainder as the new position 16/47 Examples of hash functions (1) Truncation: If students have an 9-digit identification number, take the last 3 digits as the table position – e.g. 925371622 becomes 622 Folding: Split a 9-digit number into three 3-digit numbers, and add them – e.g. 925371622 becomes 925 + 376 + 622 = 1923 Modular arithmetic: If the table size is 1000, the first example always keeps within the table range, but the second example does not (it should be mod 1000) – e.g. 1923 mod 1000 = 923 (in Java: 1923 % 1000) 17/47 Examples of hash functions (2) Using a telephone number as a key – The area code is not random, so will not spread the keys/entries evenly through the table (many collisions) – The last 3-digits are more random Using a name as a key – Use full name rather than surname (surname not particularly random) – Assign numbers to the characters (e.g. a = 1, b = 2; or use Unicode values) – Strategy 1: Add the resulting numbers. Bad for large table size. – Strategy 2: Call the number of possible characters c (e.g. c = 54 for alphabet in upper and lower case, plus space and hyphen). Then multiply each character in the name by increasing powers of c, and add together. 18/47 Choosing the table size to minimise collisions As the number of elements in the table increases, the likelihood of a collision increases - so make the table as large as practical If the table size is 100, and all the hashed keys are divisable by 10, there will be many collisions! – Particularly bad if table size is a power of a small integer such as 2 or 10 More generally, collisions may be more frequent if: – greatest common divisor (hashed keys, table size) > 1 Therefore, make the table size a prime number (gcd = 1) Collisions may still happen, so we need a collision resolution strategy 19/47 Collision resolution: open addressing (1) Probing: If the table position given by the hashed key is already occupied, increase the position by some amount, until an empty position is found Linear probing: increase by 1 each time [mod table size!] Quadratic probing: to the original position, add 1, 4, 9, 16,… Use the collision resolution strategy when inserting and when finding (ensure that the search key and the found keys match) May also double hash: result of linear probing result of another hash function With open addressing, the table size should be double the expected no. of elements 20/47 Collision resolution: open addressing (2) If the table is fairly empty with many collisions, linear probing may cluster (group) keys/entries – This increases the time to insert and to find 1 2 3 4 5 6 7 8 For a table of size n, then if the table is empty, the probability of the next entry going to any particular place is 1/n In the diagram, the probability of position 2 getting filled next is 2/n (either a hash to 1 or to 2 fills it) Once 2 is full, the probability of 4 being filled next is 4/n and then of 7 is 7/n (i.e. the probability of getting long strings steadily increases) 21/47 Collision resolution: open addressing (3) An empty key/entry marks the end of a cluster, and so can be used to terminate a find operation So, if we remove an entry within a cluster, we should not empty it! To allow probing to continue, the removed entry must be marked as ‗removed but cluster continues‘ 22/47 Collision resolution: open addressing (4) Quadratic probing is a solution to the clustering problem – Linear probing adds 1, 2, 3, etc. to the original hashed key – Quadratic probing adds 12, 22, 32 etc. to the original hashed key However, whereas linear probing guarantees that all empty positions will be examined if necessary, quadratic probing does not – e.g. Table size 16 and original hashed key 3 gives the sequence: 3, 4, 7, 12, 3, 12, 7, 4… More generally, with quadratic probing, insertion may be impossible if the table is more than half-full! – Need to rehash (see later) 23/47 Collision resolution: chaining Each table position is a linked list Add the keys and entries anywhere in No need to change position! the list (front easiest) Advantages over open addressing: key entry key entry – Simpler insertion and removal 4 – Array size is not a limitation (but should still minimise collisions: make key entry key entry table size roughly equal to expected 10 number of keys and entries) Disadvantage – Memory overhead is large if entries key entry are small 123 24/47 Rehashing: enlarging the table To rehash: – Create a new table of double the size (adjusting until it is again prime) – Transfer the entries in the old table to the new table, by recomputing their positions (using the hash function) When should we rehash? – When the table is completely full – With quadratic probing, when the table is half-full or insertion fails Why double the size? – If n is the number of elements in the table, there must have been n/2 insertions before the previous rehash (if rehashing done when table full) – So by making the table size 2n, a constant cost is added to each insertion 25/47 Applications of Hashing Compilers use hash tables to keep track of declared variables A hash table can be used for on-line spelling checkers — if misspelling detection (rather than correction) is important, an entire dictionary can be hashed and words checked in constant time Game playing programs use hash tables to store seen positions, thereby saving computation time if the position is encountered again Hash functions can be used to quickly check for inequality — if two elements hash to different values they must be different Storing sparse data 26/47 When are other representations more suitable than hashing? Hash tables are very good if there is a need for many searches in a reasonably stable table Hash tables are not so good if there are many insertions and deletions, or if table traversals are needed — in this case, AVL trees are better If there are more data than available memory then use a B-tree Also, hashing is very slow for any operations which require the entries to be sorted – e.g. Find the minimum key 27/47 Issues: What do we lose? – Operations that require ordering are inefficient – FindMax: O(n) O(log n) Balanced binary tree – FindMin: O(n) O(log n) Balanced binary tree – PrintSorted: O(n log n) O(n) Balanced binary tree What do we gain? – Insert: O(1) O(log n) Balanced binary tree – Delete: O(1) O(log n) Balanced binary tree – Find: O(1) O(log n) Balanced binary tree How to handle Collision? – Separate chaining – Open addressing 28/47 Performance of Hashing The number of probes depends on the load factor (usually denoted by ) which represents the ratio of entries present in the table to the number of positions in the array We also need to consider successful and unsuccessful searches separately For a chained hash table, the average number of probes for an unsuccessful search is and for a successful search is 1 + /2 29/47 Performance of Hashing (2) For open addressing, the formulae are more complicated but typical values are: Load Factor 0.1 0.5 0.8 0.9 0.99 Successful search Linear Probes 1.05 1.6 3.4 6.2 21.3 Quadratic Probes 1.04 1.5 2.1 2.7 5.2 Unsuccessful search Linear Probes 1.13 2.7 15.4 59.8 430 Quadratic probes 1.13 2.2 5.2 11.9 126 Note that these do not depend on the size of the array or the number of entries present but only on the ratio (the load factor) 30/47 Amortized Analysis of complexity Used when complexity of an operation is very different in the different state of the algorithm/data structure Three methods for amortized analysis: – aggregate analysis – accounting method – potential method 31/47 Sequence of operations The problem: We have a data structure We perform a sequence of operations – Operations may be of different types (e.g., insert, delete) – Depending on the state of the structure the actual cost of an operation may differ (e.g., inserting into a sorted array) Just analyzing the worst-case time of a single operation may not say too much We want the average running time of an operation (but from the worst-case sequence of operations!). 32/47 Binary counter example Example data structure: a binary counter – Operation: Increment – Implementation: An array of bits A[0..k–1] Increment(A) 1 i 0 2 while i < k and A[i] = 1 do 3 A[i] 0 4 i i + 1 5 if i < k then A[i] 1 How many bit assignments do we have to do in the worst-case to perform Increment(A)? But usually we do much less bit assignments! 33/47 Analysis of binary counter How many bit-assignments do we do on average? – Let‘s consider a sequence of n Increment’s – Let‘s compute the sum of bit assignments: A[0] assigned on each operation: n assignments A[1] assigned every two operations: n/2 assignments A[2] assigned every four ops: n/4 assignments i i A[i] assigned every 2 ops: n/2 assignments lg n n 2i 2n i 0 Thus, a single operation takes 2n/n = 2 = O(1) time amortized time 34/47 Aggregate analysis Aggregate analysis – a simple way to do amortized analysis – Treat all operations equally – Compute the worst-case running time of a sequence of n operations. – Divide by n to get an amortized running time 35/47 Another look at binary counter Another way of looking at it (proving the amortized time): – To assign a bit, I have to use one dollar – When I assign ―1‖, I use one dollar, plus I put one dollar in my ―savings account‖ associated with that bit. – When I assign ―0‖, I can do it using a dollar from the savings account on that bit – How much do I have to pay for the Increment(A) for this scheme to work? Only one assignment of “1” in the algorithm. Obviously, two dollars will always pay for the operation – Amortized complexity of the Increment(A) is 2 = O(1) 36/47 Accounting method Principles of the accounting method 1. Associate credit accounts with different parts of the structure 2. Associate amortized costs with operations and show how they credit or debit accounts Different costs may be assigned to different operations Requirement for all sequences of operations (c – real cost, c’ – amortized cost): n n c c i 1 i i 1 i This is equivalent to requiring that the sum of all credits in the data structure is non-negative holds for the binary counter if starting at 0 3. Show that this requirement is satisfied 37/47 Potential method We can have one account associated with the whole structure: – We call it a potential – It‘s a function that maps a state of the data structure after operation i to a number: F(Di) ci ci F( Di ) F( Di 1 ) The main step of this method is defining the potential function Requirement: F(Dn) – F(D0) 0 Once we have F, we can compute the amortized costs of operations 38/47 Binary counter example How do we define the potential function for the binary counter? – Potential of A: bi = a number of ―1‖s – What is F(Di) – F(Di-1), if the number of bits set to 0 in operation i is ti? – What is the amortized cost of Increment(A)? We showed that F(Di) – F(Di-1) 1 – ti Real cost ci = ti + 1 Thus, ci ci F( Di ) F( Di 1 ) (ti 1) (1 ti ) 2 39/47 Potential method We can analyze the counter even if it does not start at 0 using potential method: – Let‘s say we start with b0 and end with bn ―1‖s – Observe that: n n c c F ( D ) F ( D ) i 1 i i 1 i n 0 We have that: ci 2 n This means that: i c 2n b b n 0 i 1 Note that b0 k. This means that, if k = O(n) then the total actual cost is O(n). 40/47 Dynamic table It is often useful to have a dynamic table: – The table that expands and contracts as necessary when new elements are added or deleted. Expands when insertion is done and the table is already full Contracts when deletion is done and there is “too much” free space – Contracting or expanding involves relocating Allocate new memory space of the new size Copy all elements from the table into the new space Free the old space – Worst-case time for insertions and deletions: Without relocation: O(1) With relocation: O(m), where m – the number of elements in the table 41/47 Requirements Load factor – num – current number of elements in the table – size – the total number of elements that can be stored in the allocated memory – Load factor a = num/size It would be nice to have these two properties: – Amortized cost of insert and delete is constant – The load factor is always above some constant That is the table is not too empty 42/47 Naive insertions Let’s look only at insertions: Why not expand the table by some constant when it overflows? – What is the amortized cost of an insertion? 43/47 Aggregate analysis The “right” way to expand – double the size of the table – Let‘s do an aggregate analysis – The cost of i-th insertion is: i, if i–1 is an exact power of 2 1, otherwise – Let‘s sum up… – The total cost of n insertions is then < 3n – Accounting method gives the intuition: Pay $1 for inserting the element Put $1 into element’s account for reallocating it later Put $1 into the account of another element to pay for a later relocation of that element 44/47 Potential function What potential function do we want to have? Fi=2numi – sizei – It is always non-negative – Amortized cost of insertion: Insertion triggers an expansion Insertion does not trigger an expansion – Both cases: 3 45/47 Deletions Deletions: What if we contract whenever the table is about to get less than half full? – Would the amortized running times of a sequence of insertions and deletions be constant? – Problem: we want to avoid doing reallocations often without having accumulated ―the money‖ to pay for that! 46/47 Deletions Idea: delay contraction! – Contract only when num = size/4 – Second requirement still satisfied: a > ¼ How do we define the potential function? 2 num size if a 1/ 2 F size / 2 num if a 1/ 2 It is always non-negative Let’s compute the amortized running time of deletions: a ½ (with contraction, without contraction) 47/47

DOCUMENT INFO

Shared By:

Categories:

Tags:
Data Structures, Hash Tables, Binary Search Trees, hash function, linked lists, hash table, Office Hours, Linear probing, final exam, Computer Science

Stats:

views: | 90 |

posted: | 2/2/2011 |

language: | English |

pages: | 47 |

OTHER DOCS BY suchenfz

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.