Docstoc

Searching

Document Sample
Searching Powered By Docstoc
					Searching   Mechanical:
             Linear
             Sorting
             Hashing
            Logical
             Boolean
             Best match
            Intellectual:
             Strings
             Fields
             Concepts
      From keys to documents
The mechanical problem: given a set of words (or other
things to look for) appearing in a set of documents, how
to find the documents in which those words appear.
Goal: to do so quickly, but also flexibly.
Questions to ask:
  Is a lot of preprocessing involved?
 Can you search for complex expressions?
 Can you add and delete items?
 Is extra storage needed?
 Must everything fit in main memory: i.e. not use disk?
            Searching text files
Linear scan (grep): not for very big collections, no update
problem

Inverted files: tries, or just divide by blocks
May wish to compress occurrence lists, index by both ends,
allow fielded searching, and keep frequency information

Signature files: electronic edge-notched cards, trading space
for false drops

Bitmaps: best for very common words; add to inverted files

Clustering: for complex searching, summarizing results

Case folding, suffixing, stop lists.
                      Zipf’s law
                             Frequency of occurrence is
                             inversely proportional to rank.
                             For example, if the most frequent
                             word in English (“the”) is about
                             1/10 of the language, then the
                             next most frequent word (“and”)
                             will be about 1/20, and the next
                             (usually “of”) will be about 1/30.
                             Also described as “anything
                             plotted on log paper is linear”.

George Kingsley Zipf (1902-1950), a linguist; see his book
“Human Behavior and the Principle of Least Effort”
           Simplest: linear scan
Just go through a list one item at a time looking.
Good: no extra space needed, query can be complex,
and updating is easy.
Bad: slow. On big files, impossibly slow.


In perl:
  while (<>) {
       if (/searchterm/) {…..}
       }
                Binary search
Sort the items into order. When searching, poke in the
middle. Then see which half the thing you’re looking for
is in; and then repeat until you’re down to one item.


Takes log2N steps for N items, instead of N/2.
But: all you can search for is a plain string, not a complex
expression. With linear scan, if somebody wants to
search for *earch* there’s no problem.
Updating requires inserting the new item, which may
mean moving half the old ones.
                            Tries
Make a set of linked items, letter by letter. Follow along
as you are looking for the word.
Fast: number of letters is likely to be logarithmic in the
number of words; update is easy.
But: uses extra storage, and everything should fit in
random access memory.


Trie for the words bee,
beer, bet, pet, and peer.
Tries were invented by Ed
Sussenguth in 1963.
                Hash storage.
Compute a random number from each search key. Then
go to that address in memory.
Suppose hashing function is: count letters, take
remainder when divided by three (a terrible function).
            Bucket     Content
            0          Cat, dog, rabbit
            1          Wolf, giraffe, buffalo
            2          Sheep, elephant

Very fast. But uses extra storage, and you have to
resolve the collisions, and deletions are a pain.
Grab – an example compromise

Grab was an attempt to balance between the speed of inversion
and the compactness of linear search.

Bitmap vectors on hashed words, compressed 10bits to 4 bits.
Go back later and cast out false drops.

For 5% extra space, get 90% speedup on linear.

Never caught on. Space is too cheap today, and files are too
big. Might as well use full inversion.
              Why not a DBMS?

Why don't text retrieval systems use a DBMS underneath?

Few numerical entries, and vast numbers of items
Special needs, such as index browsing and truncation
searching

Input not neatly structured into records, and variable length of
items may have to be retrieved

Not much updating.

Parallel searching: just coming into vogue.
                  Cheap trick
Suppose you have ten thousand items. Sort them into
order. Take every 100th. There will be 100 such; scan the
list of 100 by linear scan, then that 1% of the full list.
The algorithmic types will complain that this takes (on
average) 100 operations (two 50-item sequential scans),
while binary search would take 17.
But if a disk fetch takes 30 times as much time as a disk
read (realistic) then 2 disk fetches plus 50 disk read
operations are faster than 17 disk fetches.
    What do real systems do?
They use “inverted files” to cope with the very large
amounts of material to index. Go through each
document, find the words, and make another file which for
each word tells you which documents it was found it.
Use various compression tricks on the list of documents,
e.g. store only the differences between the locations of
successive occurrences of a word.
For very frequent words, just store a vector of yes/no.
Store the list with the words forwards and backwards to
allow initial-truncation.
Keep separate files of updates.
Rely heavily on caching (storing recent results).
  What do the search engines do?

Very large inverted files and parallel search engines on a great many
machines (thousands).

Big caches. They may search only in the cache and avoid all disk
delays

Are willing to give different results depending on what data is in cache
              Logic of search
Boolean search: logical “and” and “or” – contrast with
Coordination level: as many of the words as you can find.
Boolean search often confuses users. Transcript of a
user OPAC session (20 years ago):
Navajos and magic
    No hits.
Navajos and magic and Arizona
Term weights: does it matter if a word appears more than
once? Do words have inherent importance?
Phrases: should words in sequence matter?
        Collaborative ranking and
                 filtering
Google is the best known search engine; it derives from “backrub”
at the Stanford digital library project.
See: http://www-db.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page.


Simply: pages pointed to by a lot of other people are probably
better.
Other work from Jon Kleinberg at Cornell has looked at links in
both directions, and this is all related to “collaborative filtering”.
     How represent meaning?
Words. What about affixes? Phrases?
Fields: words in some relationship.
Concepts: either manually made thesauri or some
statistical equivalent.
“bush” or
 title: Roughing it in the bush /Moodie, S
 author: Bush, Vannevar 1890-1974

       Or:
             Probabilistic indexing
How important is a word? Ideally, we would know if appears in
  our relevant document. We’re not likely to know that, but
  could learn which words are useful and which aren’t.
a) TF-IDF: term-frequency over inverse document frequency.
   A word is more important if it appears lots of times, but not if
   it appears in every document.
b) Weightings based on whether it appears in useful
   documents, or other people’s searches, or something.
c) We can also weight documents: the Google idea.

             Keith van Rijsbergen (left) and
             Karen Sparck-Jones (right): early
             work on probabilistic indexing.
                                        What’s in a word?
                                100
                                 90
                                 80                       "English"
                                                          "their"
   Relative frequency of word




                                 70
                                 60
                                 50                        39,000 titles
                                 40
                                 30
                                 20
                                 10
                                  0
                                      000 100 200 300 400 500 600 700 800 900
Each word about                               Dewey category ranges
300 occurrences
            Defining concepts
You can translate each word to a thesaurus category, if
  you have a thesaurus.
Is this worth doing? Not really, based on old experiments.
Yet there are lots of vocabulary confusions.
     Latent Semantic Analysis
One problem with words is that there are too many of
  them, and thus the typical word doesn’t appear in a
  random document.
It would be more convenient if we had a smaller number
   of words but each appeared in every document; we’d
   have better measures of similarity.
We can do this by replacing words by linear combinations
 of them; to do this, analyze the term-document matrix.
                    Sue Dumais: lead investigator on
                    latent semantic analysis, which does
                    this by singular value decomposition
                    of the matrix.
Experiments on LSA
                     Clustering
Often there are too many documents, and it would be nice to
see them grouped: if one searches on “rock” it would be
good to see answers from music and geology listed
separately.

 Most search engines
 don’t do this, because
 it is too expensive.
 There are many
 clustering algorithms;
 little time to discuss
 them.
          Searching technology
If you can get descriptors, we can do searching.
Ordinary words work pretty well for text.
Nothing else helps much.
There is so much scatter in the results that most people
can’t tell whether the search engine is doing well or not, so
why bother?
What actually matters in the real world is speed. Search
engines want to reply in under a second. Quality is
secondary.
Old saying: “fast, cheap, good, pick two”. Search engines
pick the first two.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:8/14/2012
language:
pages:24