Arrays and Hashing

W
Shared by: MikeJenny
Categories
Tags
-
Stats
views:
24
posted:
8/2/2011
language:
English
pages:
59
Document Sample
scope of work template
							Arrays and Hashing
     Annatala Wolf
     222 Lecture 5
Arrays in Programming
   Most programming languages feature a
    primitive “array” type that lets you store a
    bunch of elements of the same type
    together.
   The reason arrays are useful is that, for
    most languages, you can access any cell in
    an array in constant time (well, almost).
       This means no matter how big the array gets, it’s
        just as fast to access array[n], for any n.
How Fast Access Works
    Primitive arrays are usually modeled in the computer by
     reserving a bunch of memory all in a row, at the same time.
    If the memory is contiguous, access array[3], all the machine
     has to do is take the location of the first datum, and add (3 *
     size of (type)) to get the location of the fourth datum.

     If a character array’s [0]                 Then, adding
    element is located here…                     3 * size of
                                               (character) to
                                              [0] gives us the
                                               location of [3].
                   H    e    l    l   o   !
Access Limitations
   This works for fast access as long as two
    things hold:
       The memory needs to be contiguous
            We need enough contiguous memory available to
             reserve in the first place
            We probably can’t extend the size later, since the
             memory nearby will be taken up eventually
       We can only access memory where the relative
        location of [n] can fit into an Integer
            This is rarely a limitation, but technically means it’s not
             quite a constant-time operation
Problems with Arrays
   In some languages (like C++), arrays have
    the same safety issues that pointers have.
       In fact, arrays are just like pointers in many ways
   In Java, arrays and generics (Java’s
    version of templates) should not be mixed,
    because type errors will result.
   We’ll be working with array types that are
    fully-fledged classes, not primitive arrays.
Static_Array
   Intuitively, a static array is like a sequence of fixed
    length, where the length is determined at compile time
       Acts like a statically allocated array (i.e., type myArray[constant])
       Each Static_Array type has its own preset bounds
   Mathematically, it’s actually not another string of Item,
    like Queue, Sequence, or Stack. (Surprise!)
   Static_Array is modeled by indexed table of Item
       This is a finite set of (integer, Item) pairs, where each integer is
        between lower and upper inclusive, and each integer is
        associated with exactly one Item.
       The bounds are part of the template that define the model.
Array Bounds
   Each Static_Array type has a preset upper and lower
    bound. These can be set to any integer value. You
    could have an array from:
       1 to 10
       -50 to 50
       -5 to -3
       …but lower ≤ upper is required. Otherwise, you could make an
        array type that could never hold information, which is silly.
   Any time a Static_Array type object is created, it will
    create a new object for each element of the array.
Selecting Array Bounds




                         xkcd.com
Static_Array Kernel Operations
   “Accessor”: [pos]
       requires: lower ≤ pos ≤ upper
   Lower_Bound( )
       requires: true
   Upper_Bound( )
       requires: true


   The last two operations don’t seem necessary,
    since the bounds are set at compile-time…but
    they help prevent using “magic numbers” in code.
“Magic Numbers”
   The term “magic numbers” in computer science refers to the
    practice of littering code with numeric literals, such as:
     while (counter < 11) {
         x = x – 5;
     }

   This practice leads to illegible code that is very difficult to
    change later without errors. Use constants instead.
     // at top of the section in code where these are used
     object Integer_constant LINE_WIDTH = 10, BREAK_WIDTH = 5;

     // much later on in code
     while (counter < (LINE_WIDTH + 1)) {        Some style guidelines suggest
         x = x – (BREAK_WIDTH);                  the only magic numbers in your
     }                                           code should be 0, 1, and -1.
Instantiating Static_Array
#include “CT/Static_Array/Kernel_1_C.h”

concrete_instance
class Real_Vector_100:
     instantiates
             Static_Array_Kernel_1_C <
                     Real,
                                lower
  type of Item       1,
                     100
                                upper
                 >
{};
 Using Static_Array
main() {
    // create 100 new Real objects named terms[1]...terms[100]
    object Real_Vector_100 terms;
    Set_Values(terms);             // pretend this sets all 100 values
    debug(Average(terms));         // tests the function below
}

// Get the average value of all the elements in vector
function_body Real Average(preserves Real_Vector_100& vector) {
    object Real sum;
    object Integer i = terms.Lower_Bound();
    while (i <= terms.Upper_Bound()) {
        sum += terms[i];
    }
    return sum / To_Real((terms.Upper_Bound()-terms.Lower_Bound()+1)));
}
Array
   This is the dynamic version of Static_Array.
       Just like Static_Array, except you don’t decide the array
        bounds until after you create an object (at runtime).
   The mathematical model for Array is slightly
    different, since the lower and upper bounds can
    change for different objects of the same type.
    This makes the abstract model a 3-tuple:
       (integer (self.lb),
       integer (self.ub),
       indexed table of Item (self.table))
Array Kernel Operations
   “Accessor”: [pos]                       same as Static_Array,
                                           except the bounds now
   Lower_Bound( )                         refer to abstract model
                                             pieces self.ub and
   Upper_Bound( )                                  self.lb

   Set_Bounds(lower, upper)
       requires: true
       resetting bounds will clear all array elements
   Default value of Array: (1, 0, { })
       In other words, it can’t contain anything yet!
Instantiating Array
#include “CT/Array/Kernel_1_C.h”
concrete_instance
class Array_Of_Text :
    instantiates
        Array_Kernel_1_C <Text>
{};
Using Array
main() {
    object Array_Of_Text names;      // no Text objects yet
    // create 70 new Text objects called names[-19]...[50]
    names.Set_Bounds(-19, 50);
    DoStuff(names);                  // do other stuff with names
}
// procedure to print out all Text objects in the array
procedure_body Print ( preserves Array_Of_Text& items,
                       alters Character_OStream& out) {
object Integer i = items.Lower_Bound();
while (i <= items.Upper_Bound()) {
    out << items[i] << „\n‟;
    i++;
}
Array Concepts
   Resolve/C++ Array behaves a lot like native “array”
    objects in C++ and Java. It’s useful when you want to
    create and hold a bunch of objects all at once.
   A common way of holding dynamic data is to reserve
    space for X contiguous elements in advance, and keep
    track of how many are actually “in use”.
       If you can reserve all of it at once, access is very fast
   You could use an Array to represent a Sequence. To
    add an item in the middle of the Sequence, you’d have to
    move all the items in the way one slot to the right. This
    would work when there was a free slot to the right.
Reaching the Limit
   But what do you do if you run out of slots completely?
    You can’t add more without clearing the Array!
   Solution: create a new, larger Array, and transfer all of
    the elements over.
   To do this quickly (over the long run), you must always
    increase the Array by a fixed percent of its current size.
       This means in the long run, access/add will still be constant
   Java uses this scheme for most of its collection objects
    (to speed access). When you reach a preset maximum,
    it either increases the total elements it can hold by either
    50% or 100%, depending on the implementation.
Using Array Elements

    H   l   l   o   !   \0               H   e    l      l   o   !


                              This idea may work, if
                              we’re not using the last
    e                        element for anything yet.           \0



   To get “Sequence-like” behavior, we’d have
    to move over the elements manually (using
    swap with Accessor, starting from right).
Extending an Array
                Wait: no
    R   F   L    room!                     \0   \0   \0             …and swap in
                                                                    the elements.




                            Make a new,
    O                      larger Array…   R    F    L    \0   \0     \0



   If we needed to make an array “bigger”, we’d have
    to swap over anything we wanted to keep.
    (Set_Bounds( ) clears the elements of the Array.)
     Sequence_Kernel_25::                                      Get into
                                                               groups
     Remove(pos, x)                                            of 2 to 5!
rep_field_name(Rep, 0, Array_Of_Item, array);
rep_field_name(Rep, 1, Integer, length);

/*! correspondence
        there exists string of Item: a where (self * a = self.array) and
        |self| = self.length

     convention
         self.array.lb = 0 and
         there exists integer: k where (self.array.ub = 2^k – 1) and
         if |self| = 0 then self.array.ub = 31    !*/



//   In English: the sequence is all of the elements of the array from slots [0]
//   to [self.length-1]. The array is initialized with indexes 0 to 31.
//   Whenever we need more space, we double the size of the array.
//   Note: Remove has a special contract case to fulfill when |self| becomes 0!
Remove(pos, x)
procedure_body Remove( preserves Integer pos, produces Item& x )
{
    x &= self[array][pos];
    object Integer pull_left = pos;
    while (pull_left < self[length]) {
        self[array][pull_left] &= self[array][(pull_left + 1)];
        pull_left++;
    }
    self[length]--;
    if (self[length] == 0) {
        self.Clear();
        // (alternately):      self[array].Set_Bounds(0, 31);
    }
}
Thinking about Remove( )
   It’s possible this could leave crap in the array…but
    that’s okay! This contract doesn’t say anything
    about stuff outside of the part that represents self.

      H    i   !   \0                  H   !   #   \0




       #                                            i
Kernel Implementation
   As we’ve seen, implementing a Kernel
    consists of two separate ideas
       Defining the Representation for the data that the
        Kernel will hold (and its correspondence)
       Writing the algorithms for Kernel operations that
        access the data (possibly using a convention)
   Both of these involve design choices
Example: Partial_Map_Kernel_1
   Data:
       Queue_Of_Record (order irrelevant)
   Code:
       Must search for a specific record when that record needs
        to be accessed
   The data structure you choose for Rep will
    constrain the algorithms you can use to implement
    the Kernel operations
   Performance is determined by these choices, as
    well as the performance of Rep components
Asymptotic Bounds
   In computer science, the most important
    performance metric is often the asymptotic
    bounds of an algorithm
       how much longer does it take to perform a task,
        as the size of the task gets larger
       ignores differences of a constant factor—the
        point is which will dominate in the long run
   Computers are fast, so small tasks take
    virtually no time
    Simple Running Times
   Constant time O(1): takes the same amount of time for any size data
   Log time O(lg n): each time you double the data size, it takes an
    additional segment of time (very fast—nearly as good as constant)
   Linear time O(n): when data doubles in size, so does the time it takes
   Log-linear time O(n lg n): this is slower than linear, but fast enough
    for most data sets (fastest comparison-based sorting algorithms here)
   Quadratic time O(n2): if you double the data, you quadruple the time it
    takes (slower, but still possible; slow sorting algorithms are here)
   Exponential time O(2n): intractable!
        no matter how good computers get, you’ll never get an answer if n > 200
Traveling Salesman (NP-Complete)
   Some solutions are better than others…




                                            xkcd.com
Asymptotic Growth Functions
Asymptotic Growth Functions
Asymptotic Growth Functions
Asymptotic Growth Functions
Asymptotic Growth Functions
Asymptotic Bounds: All That?
   But are asymptotic bounds all we care
    about?
   Compare n2 operations with 8000n
       Sure, n2 is slower, but only for data sets > 8000
   In the real world, constant factors can still
    make a big difference!
       This is mainly true for slower systems, or
        systems that have critical timing (r-t graphics)
Pizza Parlor Example
   Consider a pizza parlor with the following
    business rule for phone orders: the server
    must look you up in the system by your phone
    number before taking your order
       The computer hardware may be suboptimal, but it
        shouldn’t need to be fancy because software is
        simple
       They are using a Partial_Map to access your
        information from your phone number (this is a natural
        choice)…specifically, Partial_Map_Kernel_1
Pizza Parlor Example
   Now that the pizza parlor has become popular, they
    have more than 1000 customers—and it takes up to
    30 seconds to retrieve a customer’s information.
       This is an unacceptable delay! How can we fix this?
       Ideally, we want to knock the wait time down to less
        than one second.
   A different implementation of Partial_Map may be
    required, for performance reasons.
       Recall that PMK1 uses unordered Queue of Record
Running Time For
Partial_Map_Kernel_1
   Define:         ?
   Undefine:       ?
   Undefine_Any:   ?
   Accessor:       ?
   Is_Defined:     ?
   Size:           ?
   Constructor:    ?
   Destructor:     ?
   Swap:           ?
   Clear:          ?
Running Time For
Partial_Map_Kernel_1
   Define:         constant
   Undefine:       linear
   Undefine_Any:   constant
   Accessor:       linear
   Is_Defined:     linear
   Size:           constant
   Constructor:    constant
   Destructor:     linear (amortized: constant)
   Swap:           constant
   Clear:          linear (amortized: constant)
Linear Search
   Linear search is responsible for the linear
    running times in Partial_Map_Kernel_1.
   If you look at Partial_Map_Kernel_1, you’ll
    get a good idea on how this works (the
    implementation is online, in the RESOLVE
    component catalog).
       Each time you need an item, it moves the
        Representation’s queue to put the item at front
Is Improvement Possible?
   It may not always be possible to improve on the
    asymptotic bounds of an algorithm.
   Any method of holding a wide variety of information
    in a collection will require some additional cost as
    the size of the data increases.
   But sometimes improving by a constant factor is
    sufficient, in particular, when the size of the data
    can be bounded.
One Solution: Hashing
   Hashing is a way of improving the
    performance of a linear search algorithm by
    a constant factor.
   Key idea: we can reduce the time to search
    for an item if we place items into multiple
    buckets.
   For this to work, we have to (quickly!) know
    which bucket an item will be found in.
Hashing For Dummies
   To add an item to a collection:
       run a hash function on the item, to get its hash value
       fit the hash value to an array (value mod array_size)
       store the object in the kth bucket in the array, where k is
        the fitted hash value
   To access an item in the collection:
       run the same hash function on the item
       fit the hash value to the array, in the same way
       look in the kth bucket, where k is the fitted hash value
Why Separate Hashing?
   We separate the hash function from the “fitting it to
    the bounds of the array” problem, so that a client
    can use the same function with many array sizes.
       We’ll use the Hash function as a utility class
   In RESOLVE, you should assume the hash value
    you get back from a hash function can be any
    Integer value (even negative).
   To fit it to a table indexed from 0 to table_size - 1:
       array_index = hash_value mod table_size
Again With The Pizza Parlor
   If we implement Partial_Map with:
       an Array_Of_Queue_Of_Record
       using a hash function to index into the array
   At minimum, how large must our array be in order
    to reduce the wait time from 30s to 1s?
       Does this add a lot of data to our representation?
       What if we want to plan for 10x customer growth over the
        next year?
       If our hash table gets TOO large, however, it becomes a
        challenge to write a good hash function for the data…
Writing The Hash Function
   The hash function returns an Integer value
       ideally, values should have a wide range
   A hash function must be deterministic. It must return the same
    thing for a particular item, every single time it’s run.
       Else, we might add x to [2], and try to remove x from [3]
   So a random function is always wrong (unless it’s also
    deterministic…?)




                                                     xkcd.com
No Perfect Hash Function
   But even a legal hash function can be bad. Not
    all hash functions are created equal.
       If a hash function dumps everything into the same bucket,
        it doesn’t save us any time at all!
       An ideal hash function has a flat distribution
       Most data doesn’t lend itself to a flat distribution…
   Any hash function can be broken by bad data. If I
    know your hash function, I can always pass it data
    that will break it (data that only hits one bucket).
Pizza With Hash…
   The pizza parlor must make a hash
    function based on the D_Item in its
    Partial_Map (customer phone numbers)
       What would be an example of a bad hash
        function?
       What would be an example of a good hash
        function?
       How important is knowing the shape of the data
        to implementing a hash function?
Hash Test
   Let’s write a hash function for pizza phone numbers. Assume that
    each phone number is represented as a triple of Integers:

utility_function Integer Phone_Hash (
        preserves Integer areacode,
        preserves Integer prefix,
        preserves Integer suffix
    );

/*!
      ensures
          Phone_Hash = VALID_HASH of
                  (areacode, prefix, suffix)
!*/
Hash Test Answer
   Answers may vary!
       There’s no “right” answer, but some are more
        robust than others—if we make assumptions
        about the shape of the data (often, this is easy)
   What problems might these hashes have?
       return areacode;
       return suffix mod 10;
       return prefix * suffix;
Collisions
   Collisions are when things land in the same
    bucket. These are unavoidable, but you want
    to spread out the data as evenly as possible.
       Can happen if hash function doesn’t fit the data well
       Can happen if hash function returns values that are
        too small to take advantage of a larger table
       Multiplication in hash functions may clump common
        factors together, making the hash table more
        “ragged”; a “smoothing” effect may need to be added
            adding a small value, modulus with prime number, etc.
Other Hashing Concerns
   A hash function must be deterministic and
    based only on the value of the item it takes
    as its parameter.
   A hash function must be quick (otherwise it
    saves us nothing overall).
   Hashing won’t work forever. As the array
    size increases, it becomes harder to write a
    hash function which exploits the array size.
Layering for Performance
   Partial_Map_Kernel_2 uses an
    Array_Of_Queue_Of_Record to hold the
    (D_Item, R_Item) pairs of the Partial_Map.
       Seems like a sensible approach for a hash…
   However, Partial_Map_Kernel_6 is
    represented by an Array_Of_Partial_Map!
    Why on Earth would we use Partial_Map to
    implement Partial_Map?
Understanding PMK6
   The reason we can layer Partial_Map on
    top of Partial_Map is we use two different
    versions of Partial_Map.
       The outer version uses hashing to improve
        performance, so Define, Undefine, etc. need be
        run only on smaller Partial_Map components.
       The inner version can use whatever method it
        wants, as long as it doesn’t depend upon the
        outer version.
Code Dependencies
   In Resolve we make a big deal about avoiding
    concrete-to-concrete dependencies, but we still
    have them in places.
       If we select a particular implementation of something to
        work with something else in all cases; for example,
        Set_Kernel_1 uses Queue_Kernel_1a, not Queue_Base.
       If we did use Queue_Base, then when the client
        instantiated Set_Kernel_1, the dependency would appear.
        But since this is a simple template declaration by the
        client we don’t think of it as being “all-that-dependent”.
Cyclic Dependencies
   However, if many components are layered, even if
    these dependencies are “loosely structured”, it’s still
    important to prevent cyclic dependencies.
   If PMK6 uses PMK7, and PMK7 uses PMK6…then
    we have a big problem. This will break at compile
    time, though, and we’ll know what we did wrong.
   A dependency that covers a larger cycle may be
    harder to isolate and fix, however.
       Like, if PMK6 uses QK5 uses QK1 uses PMK6
Project Dependencies
   A more common problem that is easy to
    miss appears when code in large systems
    features hidden dependencies.
   It may be the case that you always compile
    projects A and B together, and over time,
    these projects may end up dependent on
    each other—such that you can no longer
    compile either one separately.
Different Approaches
   Partial_Map_Kernel_6 uses Partial_Map_Kernel_1
    as its implementation for the buckets. This yields
    comparable performance to Partial_Map_Kernel_2
    (which implements the same algorithm as PMK1
    directly, plus the hashing).
   However, we could design Partial_Map_Kernel_6 to
    use Partial_Map_Kernel_7 to hold the buckets (by
    changing only one number in PMK6’s code). This
    would give us the benefits of PMK6 (hashing) and
    PMK7 (binary search tree), together.
Benefits of Layering
   One of the potential benefits of layering one
    implementation of Partial_Map on another is
    that we can use two strategies for improving
    performance at the same time.
       We could do this without layering, but it would make
        the Kernel very tricky to write, read, and debug.
   It is very common to use multiple strategies in
    a layered fashion for improving performance.
Looking Forward
   Partial_Map_Kernel_7 would be a useful
    choice for the buckets because we’d get
    two powerful performance improvements.
       The use of tree structures + hashing is an
        extremely common combo in database indexing.
   We’ll see more about PMK7 soon, because
    we will be implementing it for Lab 4.

						
Related docs
Other docs by MikeJenny
South Moon Under
Views: 131  |  Downloads: 0
Siddhartha by Hermann Hesse
Views: 215  |  Downloads: 0
Name cardi
Views: 0  |  Downloads: 0
Solutions affaires int gr es et ing nierie
Views: 55  |  Downloads: 0
PY Personality Traits Hans Eysenck
Views: 455  |  Downloads: 0