Arrays and Hashing
Document Sample


Arrays and Hashing
Annatala Wolf
222 Lecture 5
Arrays in Programming
Most programming languages feature a
primitive “array” type that lets you store a
bunch of elements of the same type
together.
The reason arrays are useful is that, for
most languages, you can access any cell in
an array in constant time (well, almost).
This means no matter how big the array gets, it’s
just as fast to access array[n], for any n.
How Fast Access Works
Primitive arrays are usually modeled in the computer by
reserving a bunch of memory all in a row, at the same time.
If the memory is contiguous, access array[3], all the machine
has to do is take the location of the first datum, and add (3 *
size of (type)) to get the location of the fourth datum.
If a character array’s [0] Then, adding
element is located here… 3 * size of
(character) to
[0] gives us the
location of [3].
H e l l o !
Access Limitations
This works for fast access as long as two
things hold:
The memory needs to be contiguous
We need enough contiguous memory available to
reserve in the first place
We probably can’t extend the size later, since the
memory nearby will be taken up eventually
We can only access memory where the relative
location of [n] can fit into an Integer
This is rarely a limitation, but technically means it’s not
quite a constant-time operation
Problems with Arrays
In some languages (like C++), arrays have
the same safety issues that pointers have.
In fact, arrays are just like pointers in many ways
In Java, arrays and generics (Java’s
version of templates) should not be mixed,
because type errors will result.
We’ll be working with array types that are
fully-fledged classes, not primitive arrays.
Static_Array
Intuitively, a static array is like a sequence of fixed
length, where the length is determined at compile time
Acts like a statically allocated array (i.e., type myArray[constant])
Each Static_Array type has its own preset bounds
Mathematically, it’s actually not another string of Item,
like Queue, Sequence, or Stack. (Surprise!)
Static_Array is modeled by indexed table of Item
This is a finite set of (integer, Item) pairs, where each integer is
between lower and upper inclusive, and each integer is
associated with exactly one Item.
The bounds are part of the template that define the model.
Array Bounds
Each Static_Array type has a preset upper and lower
bound. These can be set to any integer value. You
could have an array from:
1 to 10
-50 to 50
-5 to -3
…but lower ≤ upper is required. Otherwise, you could make an
array type that could never hold information, which is silly.
Any time a Static_Array type object is created, it will
create a new object for each element of the array.
Selecting Array Bounds
xkcd.com
Static_Array Kernel Operations
“Accessor”: [pos]
requires: lower ≤ pos ≤ upper
Lower_Bound( )
requires: true
Upper_Bound( )
requires: true
The last two operations don’t seem necessary,
since the bounds are set at compile-time…but
they help prevent using “magic numbers” in code.
“Magic Numbers”
The term “magic numbers” in computer science refers to the
practice of littering code with numeric literals, such as:
while (counter < 11) {
x = x – 5;
}
This practice leads to illegible code that is very difficult to
change later without errors. Use constants instead.
// at top of the section in code where these are used
object Integer_constant LINE_WIDTH = 10, BREAK_WIDTH = 5;
// much later on in code
while (counter < (LINE_WIDTH + 1)) { Some style guidelines suggest
x = x – (BREAK_WIDTH); the only magic numbers in your
} code should be 0, 1, and -1.
Instantiating Static_Array
#include “CT/Static_Array/Kernel_1_C.h”
concrete_instance
class Real_Vector_100:
instantiates
Static_Array_Kernel_1_C <
Real,
lower
type of Item 1,
100
upper
>
{};
Using Static_Array
main() {
// create 100 new Real objects named terms[1]...terms[100]
object Real_Vector_100 terms;
Set_Values(terms); // pretend this sets all 100 values
debug(Average(terms)); // tests the function below
}
// Get the average value of all the elements in vector
function_body Real Average(preserves Real_Vector_100& vector) {
object Real sum;
object Integer i = terms.Lower_Bound();
while (i <= terms.Upper_Bound()) {
sum += terms[i];
}
return sum / To_Real((terms.Upper_Bound()-terms.Lower_Bound()+1)));
}
Array
This is the dynamic version of Static_Array.
Just like Static_Array, except you don’t decide the array
bounds until after you create an object (at runtime).
The mathematical model for Array is slightly
different, since the lower and upper bounds can
change for different objects of the same type.
This makes the abstract model a 3-tuple:
(integer (self.lb),
integer (self.ub),
indexed table of Item (self.table))
Array Kernel Operations
“Accessor”: [pos] same as Static_Array,
except the bounds now
Lower_Bound( ) refer to abstract model
pieces self.ub and
Upper_Bound( ) self.lb
Set_Bounds(lower, upper)
requires: true
resetting bounds will clear all array elements
Default value of Array: (1, 0, { })
In other words, it can’t contain anything yet!
Instantiating Array
#include “CT/Array/Kernel_1_C.h”
concrete_instance
class Array_Of_Text :
instantiates
Array_Kernel_1_C <Text>
{};
Using Array
main() {
object Array_Of_Text names; // no Text objects yet
// create 70 new Text objects called names[-19]...[50]
names.Set_Bounds(-19, 50);
DoStuff(names); // do other stuff with names
}
// procedure to print out all Text objects in the array
procedure_body Print ( preserves Array_Of_Text& items,
alters Character_OStream& out) {
object Integer i = items.Lower_Bound();
while (i <= items.Upper_Bound()) {
out << items[i] << „\n‟;
i++;
}
Array Concepts
Resolve/C++ Array behaves a lot like native “array”
objects in C++ and Java. It’s useful when you want to
create and hold a bunch of objects all at once.
A common way of holding dynamic data is to reserve
space for X contiguous elements in advance, and keep
track of how many are actually “in use”.
If you can reserve all of it at once, access is very fast
You could use an Array to represent a Sequence. To
add an item in the middle of the Sequence, you’d have to
move all the items in the way one slot to the right. This
would work when there was a free slot to the right.
Reaching the Limit
But what do you do if you run out of slots completely?
You can’t add more without clearing the Array!
Solution: create a new, larger Array, and transfer all of
the elements over.
To do this quickly (over the long run), you must always
increase the Array by a fixed percent of its current size.
This means in the long run, access/add will still be constant
Java uses this scheme for most of its collection objects
(to speed access). When you reach a preset maximum,
it either increases the total elements it can hold by either
50% or 100%, depending on the implementation.
Using Array Elements
H l l o ! \0 H e l l o !
This idea may work, if
we’re not using the last
e element for anything yet. \0
To get “Sequence-like” behavior, we’d have
to move over the elements manually (using
swap with Accessor, starting from right).
Extending an Array
Wait: no
R F L room! \0 \0 \0 …and swap in
the elements.
Make a new,
O larger Array… R F L \0 \0 \0
If we needed to make an array “bigger”, we’d have
to swap over anything we wanted to keep.
(Set_Bounds( ) clears the elements of the Array.)
Sequence_Kernel_25:: Get into
groups
Remove(pos, x) of 2 to 5!
rep_field_name(Rep, 0, Array_Of_Item, array);
rep_field_name(Rep, 1, Integer, length);
/*! correspondence
there exists string of Item: a where (self * a = self.array) and
|self| = self.length
convention
self.array.lb = 0 and
there exists integer: k where (self.array.ub = 2^k – 1) and
if |self| = 0 then self.array.ub = 31 !*/
// In English: the sequence is all of the elements of the array from slots [0]
// to [self.length-1]. The array is initialized with indexes 0 to 31.
// Whenever we need more space, we double the size of the array.
// Note: Remove has a special contract case to fulfill when |self| becomes 0!
Remove(pos, x)
procedure_body Remove( preserves Integer pos, produces Item& x )
{
x &= self[array][pos];
object Integer pull_left = pos;
while (pull_left < self[length]) {
self[array][pull_left] &= self[array][(pull_left + 1)];
pull_left++;
}
self[length]--;
if (self[length] == 0) {
self.Clear();
// (alternately): self[array].Set_Bounds(0, 31);
}
}
Thinking about Remove( )
It’s possible this could leave crap in the array…but
that’s okay! This contract doesn’t say anything
about stuff outside of the part that represents self.
H i ! \0 H ! # \0
# i
Kernel Implementation
As we’ve seen, implementing a Kernel
consists of two separate ideas
Defining the Representation for the data that the
Kernel will hold (and its correspondence)
Writing the algorithms for Kernel operations that
access the data (possibly using a convention)
Both of these involve design choices
Example: Partial_Map_Kernel_1
Data:
Queue_Of_Record (order irrelevant)
Code:
Must search for a specific record when that record needs
to be accessed
The data structure you choose for Rep will
constrain the algorithms you can use to implement
the Kernel operations
Performance is determined by these choices, as
well as the performance of Rep components
Asymptotic Bounds
In computer science, the most important
performance metric is often the asymptotic
bounds of an algorithm
how much longer does it take to perform a task,
as the size of the task gets larger
ignores differences of a constant factor—the
point is which will dominate in the long run
Computers are fast, so small tasks take
virtually no time
Simple Running Times
Constant time O(1): takes the same amount of time for any size data
Log time O(lg n): each time you double the data size, it takes an
additional segment of time (very fast—nearly as good as constant)
Linear time O(n): when data doubles in size, so does the time it takes
Log-linear time O(n lg n): this is slower than linear, but fast enough
for most data sets (fastest comparison-based sorting algorithms here)
Quadratic time O(n2): if you double the data, you quadruple the time it
takes (slower, but still possible; slow sorting algorithms are here)
Exponential time O(2n): intractable!
no matter how good computers get, you’ll never get an answer if n > 200
Traveling Salesman (NP-Complete)
Some solutions are better than others…
xkcd.com
Asymptotic Growth Functions
Asymptotic Growth Functions
Asymptotic Growth Functions
Asymptotic Growth Functions
Asymptotic Growth Functions
Asymptotic Bounds: All That?
But are asymptotic bounds all we care
about?
Compare n2 operations with 8000n
Sure, n2 is slower, but only for data sets > 8000
In the real world, constant factors can still
make a big difference!
This is mainly true for slower systems, or
systems that have critical timing (r-t graphics)
Pizza Parlor Example
Consider a pizza parlor with the following
business rule for phone orders: the server
must look you up in the system by your phone
number before taking your order
The computer hardware may be suboptimal, but it
shouldn’t need to be fancy because software is
simple
They are using a Partial_Map to access your
information from your phone number (this is a natural
choice)…specifically, Partial_Map_Kernel_1
Pizza Parlor Example
Now that the pizza parlor has become popular, they
have more than 1000 customers—and it takes up to
30 seconds to retrieve a customer’s information.
This is an unacceptable delay! How can we fix this?
Ideally, we want to knock the wait time down to less
than one second.
A different implementation of Partial_Map may be
required, for performance reasons.
Recall that PMK1 uses unordered Queue of Record
Running Time For
Partial_Map_Kernel_1
Define: ?
Undefine: ?
Undefine_Any: ?
Accessor: ?
Is_Defined: ?
Size: ?
Constructor: ?
Destructor: ?
Swap: ?
Clear: ?
Running Time For
Partial_Map_Kernel_1
Define: constant
Undefine: linear
Undefine_Any: constant
Accessor: linear
Is_Defined: linear
Size: constant
Constructor: constant
Destructor: linear (amortized: constant)
Swap: constant
Clear: linear (amortized: constant)
Linear Search
Linear search is responsible for the linear
running times in Partial_Map_Kernel_1.
If you look at Partial_Map_Kernel_1, you’ll
get a good idea on how this works (the
implementation is online, in the RESOLVE
component catalog).
Each time you need an item, it moves the
Representation’s queue to put the item at front
Is Improvement Possible?
It may not always be possible to improve on the
asymptotic bounds of an algorithm.
Any method of holding a wide variety of information
in a collection will require some additional cost as
the size of the data increases.
But sometimes improving by a constant factor is
sufficient, in particular, when the size of the data
can be bounded.
One Solution: Hashing
Hashing is a way of improving the
performance of a linear search algorithm by
a constant factor.
Key idea: we can reduce the time to search
for an item if we place items into multiple
buckets.
For this to work, we have to (quickly!) know
which bucket an item will be found in.
Hashing For Dummies
To add an item to a collection:
run a hash function on the item, to get its hash value
fit the hash value to an array (value mod array_size)
store the object in the kth bucket in the array, where k is
the fitted hash value
To access an item in the collection:
run the same hash function on the item
fit the hash value to the array, in the same way
look in the kth bucket, where k is the fitted hash value
Why Separate Hashing?
We separate the hash function from the “fitting it to
the bounds of the array” problem, so that a client
can use the same function with many array sizes.
We’ll use the Hash function as a utility class
In RESOLVE, you should assume the hash value
you get back from a hash function can be any
Integer value (even negative).
To fit it to a table indexed from 0 to table_size - 1:
array_index = hash_value mod table_size
Again With The Pizza Parlor
If we implement Partial_Map with:
an Array_Of_Queue_Of_Record
using a hash function to index into the array
At minimum, how large must our array be in order
to reduce the wait time from 30s to 1s?
Does this add a lot of data to our representation?
What if we want to plan for 10x customer growth over the
next year?
If our hash table gets TOO large, however, it becomes a
challenge to write a good hash function for the data…
Writing The Hash Function
The hash function returns an Integer value
ideally, values should have a wide range
A hash function must be deterministic. It must return the same
thing for a particular item, every single time it’s run.
Else, we might add x to [2], and try to remove x from [3]
So a random function is always wrong (unless it’s also
deterministic…?)
xkcd.com
No Perfect Hash Function
But even a legal hash function can be bad. Not
all hash functions are created equal.
If a hash function dumps everything into the same bucket,
it doesn’t save us any time at all!
An ideal hash function has a flat distribution
Most data doesn’t lend itself to a flat distribution…
Any hash function can be broken by bad data. If I
know your hash function, I can always pass it data
that will break it (data that only hits one bucket).
Pizza With Hash…
The pizza parlor must make a hash
function based on the D_Item in its
Partial_Map (customer phone numbers)
What would be an example of a bad hash
function?
What would be an example of a good hash
function?
How important is knowing the shape of the data
to implementing a hash function?
Hash Test
Let’s write a hash function for pizza phone numbers. Assume that
each phone number is represented as a triple of Integers:
utility_function Integer Phone_Hash (
preserves Integer areacode,
preserves Integer prefix,
preserves Integer suffix
);
/*!
ensures
Phone_Hash = VALID_HASH of
(areacode, prefix, suffix)
!*/
Hash Test Answer
Answers may vary!
There’s no “right” answer, but some are more
robust than others—if we make assumptions
about the shape of the data (often, this is easy)
What problems might these hashes have?
return areacode;
return suffix mod 10;
return prefix * suffix;
Collisions
Collisions are when things land in the same
bucket. These are unavoidable, but you want
to spread out the data as evenly as possible.
Can happen if hash function doesn’t fit the data well
Can happen if hash function returns values that are
too small to take advantage of a larger table
Multiplication in hash functions may clump common
factors together, making the hash table more
“ragged”; a “smoothing” effect may need to be added
adding a small value, modulus with prime number, etc.
Other Hashing Concerns
A hash function must be deterministic and
based only on the value of the item it takes
as its parameter.
A hash function must be quick (otherwise it
saves us nothing overall).
Hashing won’t work forever. As the array
size increases, it becomes harder to write a
hash function which exploits the array size.
Layering for Performance
Partial_Map_Kernel_2 uses an
Array_Of_Queue_Of_Record to hold the
(D_Item, R_Item) pairs of the Partial_Map.
Seems like a sensible approach for a hash…
However, Partial_Map_Kernel_6 is
represented by an Array_Of_Partial_Map!
Why on Earth would we use Partial_Map to
implement Partial_Map?
Understanding PMK6
The reason we can layer Partial_Map on
top of Partial_Map is we use two different
versions of Partial_Map.
The outer version uses hashing to improve
performance, so Define, Undefine, etc. need be
run only on smaller Partial_Map components.
The inner version can use whatever method it
wants, as long as it doesn’t depend upon the
outer version.
Code Dependencies
In Resolve we make a big deal about avoiding
concrete-to-concrete dependencies, but we still
have them in places.
If we select a particular implementation of something to
work with something else in all cases; for example,
Set_Kernel_1 uses Queue_Kernel_1a, not Queue_Base.
If we did use Queue_Base, then when the client
instantiated Set_Kernel_1, the dependency would appear.
But since this is a simple template declaration by the
client we don’t think of it as being “all-that-dependent”.
Cyclic Dependencies
However, if many components are layered, even if
these dependencies are “loosely structured”, it’s still
important to prevent cyclic dependencies.
If PMK6 uses PMK7, and PMK7 uses PMK6…then
we have a big problem. This will break at compile
time, though, and we’ll know what we did wrong.
A dependency that covers a larger cycle may be
harder to isolate and fix, however.
Like, if PMK6 uses QK5 uses QK1 uses PMK6
Project Dependencies
A more common problem that is easy to
miss appears when code in large systems
features hidden dependencies.
It may be the case that you always compile
projects A and B together, and over time,
these projects may end up dependent on
each other—such that you can no longer
compile either one separately.
Different Approaches
Partial_Map_Kernel_6 uses Partial_Map_Kernel_1
as its implementation for the buckets. This yields
comparable performance to Partial_Map_Kernel_2
(which implements the same algorithm as PMK1
directly, plus the hashing).
However, we could design Partial_Map_Kernel_6 to
use Partial_Map_Kernel_7 to hold the buckets (by
changing only one number in PMK6’s code). This
would give us the benefits of PMK6 (hashing) and
PMK7 (binary search tree), together.
Benefits of Layering
One of the potential benefits of layering one
implementation of Partial_Map on another is
that we can use two strategies for improving
performance at the same time.
We could do this without layering, but it would make
the Kernel very tricky to write, read, and debug.
It is very common to use multiple strategies in
a layered fashion for improving performance.
Looking Forward
Partial_Map_Kernel_7 would be a useful
choice for the buckets because we’d get
two powerful performance improvements.
The use of tree structures + hashing is an
extremely common combo in database indexing.
We’ll see more about PMK7 soon, because
we will be implementing it for Lab 4.
Get documents about "