VIEWS: 2 PAGES: 9 POSTED ON: 4/4/2011
DEPARTMENT OF SOFTWARE ENGINEERING COLLEGE OF INFORMATION TECHNOLOGY UNIVERSITI TENAGA NASIONAL CSEB324 DATA STRUCTURES & ALGORITHM SEARCHING One of the most common and most time consuming operation in computer science is searching, the process used to find the location of a target among a list of objects. In this chapter we will study two basic search algorithms, the sequential search and the binary search. The sequential search is most commonly used to locate data in a linked list while the binary search is a structure built to provide the efficiency of the binary search of a tree structure. We will consider the method of searching large amount of data to find one particular piece of information. Some terms to remember: Element in a file is called record. Table or file is a group of elements (record) Key is used to differentiate among records. This can be simply or complex and must be unique. Searching algorithm is a concept to accept an argument a and tried to find a record whose key is a. As the result, one can return the whole record or just a pointer to the record (commonly used). This searching process is also called data retrieval Page 1 of 9 Sequential Searching This is the simplest searching method. It is used whenever the list is not ordered. Generally, you will use this technique only for small lists or lists that are not searched often. In other cases you should first sort the list and then search it using the binary search discussed later. Assume K is an array containing n keys; K(0) …. K(n-1) elements. r array of record k(i) is the key of r(i) The algorithm for(j = 0; j <n ; j++) if (key == k(j)) return (j); return (-1); The algorithm examines each key in turn, upon finding one that matches the search argument, its index is returned. If no match is found, -1 is returned. Searching an Ordered Table If the table is stored in ascending or descending order of the record keys, there are several techniques that can be used to improve the efficiency of searching. This is true if the table is of fixed size. One obvious advantage in searching a sorted order file over searching an unsorted file is in the case that the argument key is absent from the file. In the case of an unsorted file, n comparisons are needed to detect this. Unlike the unsorted file, sorted file need only n/2 comparisons provided the data is uniformly distributed. This is because we know the data is given as we encounter a key that is greater than the argument. Because the simplicity and efficiency of sequential processing on sorted files, it may be worthwhile to sort a file before searching for keys in it. Page 2 of 9 Index Sequential Search. There is another technique to improve search efficiency for a sorted data, but occupy more space. This method is called the index sequential searching search method. An auxiliary table, called index table, is set aside in addition to the sorted file itself. Each element in the index table consists of a key kindex and a pointer pindex to the record in the file that corresponds to kindex The elements in the index, as well as the elements in the file, must be sorted on the key. key record 8 0 14 1 38 2 kindex pindex 72 3 8 0 115 4 115 4 321 5 500 8 400 6 Index table 412 7 500 8 indexsize=4 512 9 555 10 600 11 Data file The algorithm used for searching an indexed sequential file is straightforward. Let r, k, and key be defined as before, let kindex, be an array of the key in index table, and let the pindex, be an array of pointer within the index table to the actual record in file. We assume that the file is stored as an array, that n is the size of the file, and that indexsize is the size of the index. The algorithm: for (j = 0; j < indexsize && kindex(j) <= key; j++); if (j==0) lowlim = 0; else lowlim = pindex(j-1); if (j==indexsize) hilim = n-1; else hilim = pindex(j) – 1; for ( j = lowlim; j<= hilim && k(j) != key ; j++); if (j>hilim ) return -1; else return j; Page 3 of 9 The real advantage of the indexed sequential method is that the items in the table can be examine sequentially if all the record in the file must be accessed, yet the search time for a particular item is sharply reduced. A sequential search is performed on a smaller index rather on the larger table. Once the correct index is found, a second sequential search is performed on a small portion of the record table itself. If the table is large that even the use of an index does not achieve sufficient efficiency, a second index can be used. Page 4 of 9 Binary Search The most efficient method of searching a sequential table without the use of auxiliary indices or tables is binary search. Basically, the argument is compared with the key of the middle element in the table, if they are equal, the search end successfully; otherwise, either the upper or lower half of the table must be searched in the similar manner. The algorithm of binary search: low = 0; hi = n – 1; while (low <= hi ) mid = (low + hi) /2; if (key = = k(mid)) return(mid); if (key < k(mid)) hi = mid – 1; else low = mid + 1; end while return (-1); Each comparison in the binary search reduces the number of possible candidates by a factor of 2. Thus, the maximum number of key comparisons is approximately log2 n. Note the binary search may be used in conjunction with the indexed sequential table. Instead of searching the index sequentially, a binary search can be used. The binary search can also be used in searching the main table once two boundary records are identified. Unfortunately, binary search can only be used if the data is stored using an array. This is because the fact that the indices of array elements are consecutive integer. To search for an element, perform a binary search on the element array. If the argument key is not found, the element does not exist in the table. Page 5 of 9 Tree Searching As the continuation from previous chapter, we derive the algorithm of tree search as below: p = tree; while ( p != NULL && key != key(p)) if (key < key (p) ) p = left(p); else p = right(p); return (p); The advantage of using a binary search tree over an array is that a tree enables search, insertion and deletion operations to be performed efficiently. If an array is used, an insertion or deletion requires approximately half of the array to be moved. On the other hand, with binary tree only a few pointer adjustments are needed for deletion and insertion process. Page 6 of 9 Hashing In the data retrieval process, we assumed that the record is sought and stored in a table and it is necessary to pass through some number of keys before finding the desired one. The organization of file and the order in which the keys are inserted affect the number of keys that must be inspected before obtaining the desired one. Obviously, efficient search techniques are those that can minimize the number of comparison. Optimally, we would like to have a table organization and search technique in which there are no unnecessary comparisons. If each key is to be retrieved in a single access, the location of the record within the table can be depend only on the key. It may not depend on the location of other key as in tree. The most efficient way to organize such a table is an array. If the record keys are integers, the keys themselves can serve as indices to the array Let assume you have below declaration of an array that represent a collection of data partype part[100]; where part[i] represents the record whose part number is i. In this situation, the part numbers are keys that are used as indices to the array. Even if the total number of parts are fewer than 100, the same structure can be used to maintain the data, Although many locations in part may correspond to nonexistent keys, this waste is offset by advantage of direct access to each of the existent parts. Unfortunately, however, such a system is not always practical. For example, suppose that a company has an inventory file of more than 100 items and the key to each record is seven-digit part number. To use direct indexing using the entire seven-digit key, an array of 10 million elements is needed. This clearly wasted an unacceptably large amount of space. What is necessary is some method of converting a key into an integer within a limited range. Ideally, no two keys should be converted into the same integer. Unfortunately, such an ideal method usually does not exist. Let us develop a method to solve this problem. Let us reconsider the example is key by seven-digit part number. Suppose that the company has fewer than 1000 parts and that there is only a single record for each part. Then an array of 1000 elements is sufficient to store the entire file. The array is indexed by an integer between 0 to 999 inclusive. The last three digits of the part number are used as the index for the part’s record in the array. A function that transforms a key into a table index is called a hash function. If h is a hash function and key is a key, h(key) is called the hash of key and is the index at which a record with the key key should be placed. If r is a record whose key hashes into hr, hr is called the hash key of r, The hash function in the preceding example is h(k) = key % 1000. The values that h produces should cover the entire set indices in the table. For Page 7 of 9 example, the function x % 100 can produce any integer 0 and 999, depending on the value of x. As we shall see shortly, it is a good idea for the table size to be somewhat larger than the number of records that are to be inserted. The foregoing method has a flaw. Suppose the two key k1 and k2 are such that h(k1) equals h(k2). Then when a record with key k1 is entered into the table, it is inserted at position h(k1). But when k2 is hashed, because its hash key is the same as that of k2, an attempt may be made to insert the record into the same position where the record with key k1 is inserted. Clearly, two records cannot occupy the same position. Such a situation is called a hash collision or a hash clash. There are two basic methods of dealing with a hash clash. The first technique, called rehashing, involved using a secondary hash function on the hash key of the item. The rehash function is applied successively until an empty position is found where the item can be inserted. The second technique, called chaining, builds a linked list of all the items whose keys hash to the same location. During search, this short linked list is traversed sequentially for desired key. Choosing a Hash Function Let us know turn the question to how to choose a good hash function. Clearly, the function should produce a few clashes as possible, that is, it should spread the key uniformly over the possible array indices. Of course, unless the keys are known in advance, it cannot be determined whether a particular hash function disperses them properly. 1. Direct Hashing The key is the address without any algorithmic manipulation. The data structure must therefore contain an element for every possible key. While the situations where you can use direct hashing are limited, when it can be used it is very powerful because it guarantees that there are no synonyms. 2. Subtraction method Sometimes we have keys that are consecutive but do not start from one. For example, a company may have only 100 employees, but the employee numbers start from 1000 and go to 1100. In this case we use a very simple hashing function that subtracts 1000 from the key to determine the address. The beauty of this example is that it is simple and guarantees no collisions. Its limitations is similar to direct hashing, where it can only be used for small lists. 3. Division method Page 8 of 9 This method divides the key by the array size and uses the remainder plus one for the address. This gives us the simple hashing algorithm shown below when list size is the number of elements in the array. address = key % listSize + 1 4. Digit Extraction method Selected digits are extracted from the key and used as the address. For example, using a six-digit employee number to hash to a three-digit address (000-999), we could select the first, third, and fourth digits (from the left) and use them as the address. Example: 379452 394 121267 112 378845 388 Page 9 of 9