Docstoc

PowerPoint Presentation - How Search Engines Work

Document Sample
PowerPoint Presentation - How Search Engines Work Powered By Docstoc
					How Search Engines Work:
 A Technology Overview


       Avi Rappoport
  Search Tools Consulting
   www.searchtools.com
 consult1@searchtools.com
UC Berkeley SIMS class 202
    September 16, 2004
               Purpose of Search Engines
   Helping people find what they’re
    looking for
        • Starts with an “information need”
        • Convert to a query
        • Gets results
   In the materials available
        • Web pages
        • Other formats
        • Deep Web
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                              2
                  Search is Not a Panacea
   Search can’t find what’s not there
        • The content is hugely important
   Information Architecture is vital
   Usable sites have good navigation and
    structure




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                            3
                     Search Looks Simple




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                           4
                               But It's Not
   Index ahead of time
        • Find files or records
        • Open each one and read it
        • Store each word in a searchable index
   Provide search forms
        • Match the query terms with words in the
          index
        • Sort documents by relevance
   Display results
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                    5
                        Search Processing




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                            6
                Search is Mostly Invisible

       Like an iceberg,
       2/3 below water

                                            user
                                         interface




                                                search
                            content          functionality


UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                             7
         Text Search vs. Database Query
   Text search works for structured
    content
   Keyword search vs. SQL queries
   Approximate vs. exact match
   Multiple sources of content
   Response time and database resources
   Relevance ranking, very important
   Works in the real world (e.g. EBay)
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                           8
   Search is Only as Good as the Content
   Users blame the search engine
        • Even when the content is unavailable
   Understand the scope of site or intranet
        •   Kinds of information
        •   Divided sites: products / corporate info
        •   Dates
        •   Languages
        •   Sources and data silos: CMSs, databases...
        •   Update processes
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                         9
               Making a Searchable Index
   Store text to search it later
   Many ways to gather text
        •   Crawl (spider) via HTTP
        •   Read files on file servers
        •   Access databases (HTTP or API)
        •   Data silos via local APIs
        •   Applications, CMSs, via Web Services
   Security and Access Control

UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                   10
                  Robot Indexing Diagram




                                           Sour




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                           11
                    What the Index Needs
   Basic information for document or record
        • File name / URL / record ID
        • Title or equivalent
        • Size, date, MIME type
   Full text of item
   More metadata
        • Product name, picture ID
        • Category, topic, or subject
        • Other attributes, for relevance ranking and display


UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                                12
                    Simple Index Diagram




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                           13
         More Complex Index Processing




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                         14
                              Index Issues
   Stopwords
   Stemming
   Metadata
        • Explicit (tags)
        • Implicit (context)
   Semantics
        • CMS and Database fields
        • XML tags and attributes

UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                             15
                 Search Query Processing
   What happens after you click the
    search button, and before retrieval
    starts.
   Usually in this order
        •   Handle character set, maybe language
        •   Look for operators and organize the query
        •   Look for field names or metadata
        •   Extract words (just like the indexer)
        •   Deal with letter casing
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                        16
                     Search and Retrieval
   Retrieval: find files with query terms
   Not the same as relevance ranking
   Recall: find all
    relevant items
   Precision: find
    only relevant items
   Increasing one
    decreases the
    other
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                             17
                     Retrieval = Matching
   Single-word queries
        • Find items containing that word
   Multi-word queries: combine lists
        • Any: every item with any query word
        • All: only items with every word
        • Phrases: find only items with all words in
          order
   Boolean and complex queries
        • Use algorithm to combine lists
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                       18
                        Why Searches Fail
   Empty search
   Nothing on the site on that topic
    (scope)
   Misspelling or typing mistakes
   Vocabulary differences
   Restrictive search defaults
   Restrictive search choices
   Software failure
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                            19
                LII.org No-Matches Page




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                          20
                       Relevance Ranking
   Theory: sort the matching items, so the
    most relevant ones appear first
   Can't really know what the user wants
   Relevance is hard to define and
    situational
   Short queries tend to be deeply
    ambiguous
        • What do people mean when they type
          “bank”?
   First 10 results are the most important
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                               21
                    Relevance Processing
   Sorting documents on various criteria
   Start with words matching query terms
   Citation and link analysis
        • Like old library Citation Indexes
        • Ted Nelson - not only hypertext, but the
          links
        • Google PageRank
            • Incoming links
            • Authority of linkers
   Taxonomies and external metadata
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                     22
               TF-IDF Ranking Algorithm
   Term frequency in the item
   Inverse document frequency of term
        • Rare words are likely to be more important
      wij = weight of Term Tj in Document Di
      tfij = frequency of Term Tj in
      Document Dj
      N = number of Documents in collection
      n = number of Documents where term
      Tj occurs at least once

      From Salton 1989



UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                       23
                         Other Algorithms
   Vector space
   Probabilistic (binary interdependence)
   Fuzzy set theory
   Bayesian statistical analysis
   Latent semantic indexing
   Neural networks
   Machine learning
   All require sophisticated queries
   See MIR, chapter 2
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                             24
                     Relevance Heuristics
   Heuristics are rules of thumb
        • Not algorithms, not math
   Search Relevance Ranking Heuristics
        •   Documents containing all search words
        •   Search words as a phrase
        •   Matches in title tag
        •   Matches in other metadata
   Based on real-word user behavior


UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                    25
                 Search Results Interface
   What users see after they click the
    Search button
   The most visible part of search
   Elements of the results page
        •   Page layout and navigation
        •   Results header
        •   List of results items
        •   Results footer

UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                            26
           Many Experiments in Interface




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                           27
                        Back to Simplicity




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                             28
      Search Suggestions (aka Best Bets)
   Human judgment beats algorithms
   Great for frequent, ambiguous searches
        • Use search log to identify best candidates
   Recommend good starting pages
            • Product information, FAQs, etc.
   Requires human resources
        • That means money and time
   More static than algorithmic search

UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                       29
                           MSU Keywords




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                          30
                          Siemens Results




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                            31
                       Cooks.com Results




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                           32
                        Salon.com Results




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                            33
     Faceted Metadata Search & Browse
   Leverage content structure
        • database fields (i.e. cruise amenities)
        • document metadata (news article bylines)
   Provide both search and browse
        •   Support information foraging
        •   Integrate navigation with results
        •   Not just subject taxonomies
        •   Display only fruitful paths, no dead ends
   Supported by academic research
        • Marti Hearst, UCB SIMS, flamenco.berkeley.edu
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                          34
             Faceted Search: Information




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                           35
          Faceted Search: Online Catalog




UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                           36
             Search Metrics and Analytics
   Metrics
        •   Number of searches
        •   Number of no-matches searches
        •   Traffic from search to high-value pages
        •   Relate search changes to other metrics
   Search Log Analysis
        • Top 5% searches: phrases and words
        • Top no-matches searches
            • Use as market research
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                      37
            Search Will Never Be Perfect
   Search engines can’t read minds
        • User queries are short and ambiguous
   Some things will help
        •   Design a usable interface
        •   Show match words in context
        •   Keep index current and complete
        •   Adjust heuristic weighting
        •   Maintain suggestions and synonyms
        •   Consider faceted metadata search
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                 38
   Search Engines, sorta Rocket Science

   Questions and discussion
   Contact me
        • consult1@searchtools.com
        • www.searchtools.com
   This presentation:
        • www.searchtools.com/slides/sims/202-04/



UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
                                                    39

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:10/4/2011
language:English
pages:39