02-IR_WebSearch by twittersubzero


									Web Search


                Pre-Web History
• Ted Nelson developed idea of hypertext in
• Doug Engelbart invented the mouse and built
  the first implementation of hypertext in the late
  1960’s at SRI.
• The basic technology was in place in the
  1970’s; but it took the PC revolution and
  widespread networking to inspire the web and
  make it practical.
• By late 1980’s many files were available by
  anonymous FTP.                                      2
           The World Wide Web

• Developed by Tim Berners-Lee in 1990 at
  CERN to organize research documents
  available on the Internet.
• Combined idea of documents available by
  FTP with the idea of hypertext to link
• Developed initial HTTP network protocol,
  URLs, HTML, and first “web server.”

             Web Browser History
• Early browsers were developed in 1992 (Erwise,
• In 1993, Marc Andreessen and Eric Bina at UIUC
  NCSA developed the Mosaic browser and
  distributed it widely.
• Andreessen joined with James Clark (Stanford Prof.
  and Silicon Graphics founder) to form Mosaic
  Communications Inc. in 1994 (which became
  Netscape to avoid conflict with UIUC).
• Microsoft licensed the original Mosaic from UIUC
  and used it to build Internet Explorer in 1995.

Search Engine Early History

 • In 1990, Alan Emtage of McGill Univ. developed
   Archie (short for “archives”)
    – Assembled lists of files available on many FTP servers.
    – Allowed regex search of these file names.

 • In 1993, Veronica and Jughead were developed to
   search names of text files available through Gopher

           Web Search History
• In 1993, early web robots (spiders) were
  built to collect URL’s:
  – Wanderer
  – ALIWEB (Archie-Like Index of the WEB)
  – WWW Worm (indexed URL’s and titles for
    regex search)
• In 1994, Stanford grad students David Filo
  and Jerry Yang started manually collecting
  popular web sites into a topical hierarchy
  called Yahoo.
           Web Search History (cont)
• In early 1994, Brian Pinkerton developed
  WebCrawler as a class project at U Wash. (eventually
  became part of Excite and AOL).
• A few months later, Fuzzy Maudlin, a grad student at
  CMU developed Lycos. First to use a standard IR
  system as developed for the DARPA Tipster project.
  First to index a large set of pages.
• In late 1995, DEC developed Altavista. Used a large
  farm of Alpha machines to quickly process large
  numbers of queries. Supported boolean operators,
  phrases, and “reverse pointer” queries.
       Web Search Recent History

• In 1998, Larry Page and Sergey Brin, Ph.D.
  students at Stanford, started Google. Main
  advance is use of link analysis to rank
  results partially based on authority.

            Web Challenges for IR
• Distributed Data: Documents spread over millions of
  different web servers.
• Volatile Data: Many documents change or
  disappear rapidly (e.g. dead links).
• Large Volume: Billions of separate documents.
• Unstructured and Redundant Data: No uniform
  structure, HTML errors, up to 30% (near) duplicate
• Quality of Data: No editorial control, false
  information, poor quality writing, typos, etc.
• Heterogeneous Data: Multiple media types (images,
  video, VRML), languages, character sets, etc.
Number of Web Servers

Number of Web Pages

     Number of Web Pages Indexed

           SearchEngineWatch, Aug. 15, 2001
Assuming about 20KB per page,
1 billion pages is about 20 terabytes of data.   12
 Growth of Web Pages Indexed

      SearchEngineWatch, Jan. 28, 2005

Google lists current number of pages searched.
          Some Recent Web Statistics
• As of January 2006, there are an estimated 440
  million hosts on the Internet http://www.isc.org/ds/
• As of August 2006, there are an estimated 96 million
  Web servers on the Internet
• As of September 2005, yahoo.com 20 billion items
  , google.com 8.1 billion web pages, search.msn.com
  5 billion web pages, alltheweb.com over 3 billion
  web pages (August 2003)
Total Sites Across All Domains
 August 1995 - February 2012

Market Share for Top Servers Across All
Domains August 1995 - February 2012

Totals for Active Sites Across All Domains
         June 2000 - February 2012

               The Web graph
• We can view the
  static Web consisting
  of static HTML pages
  together with the
  hyperlinks between
  them as a directed
  graph in which each
  web page is a node
  and each hyperlink a
  directed edge.

   Graph Structure in the Web

http://www9.org/w9cdrom/160/160.html (Report!!!)   19
                  Structure in the Web
This connected web breaks naturally into four pieces:
• The first piece is a central core, all of whose pages can
  reach one another along directed links -- this "giant
  strongly connected component" (SCC) is at the heart of
  the web.
• The second and third pieces are called IN and OUT.
   – IN consists of pages that can reach the SCC, but cannot be
     reached from it - possibly new sites that people have not yet
     discovered and linked to.
   – OUT consists of pages that are accessible from the SCC, but
     do not link back to it, such as corporate websites that contain
     only internal links.
             Structure in the Web

• Finally, the TENDRILS contain pages that
  cannot reach the SCC, and cannot be reached
  from the SCC.
• Perhaps the most surprising fact is that the
  size of the SCC is relatively small -- it
  comprises about 56 million pages. Each of the
  other three sets contain about 44 million
  pages -- thus, all four sets have roughly the
  same size.

                 Zipf's Law
• The probability of occurrence of words or
  other items starts high and tapers off. Thus,
  a few occur very often while many others
  occur rarely.
• Formal Definition: Pn = 1/na, where Pn is
  the frequency of occurrence of the nth
  ranked item and a is close to 1.
• Note: In the English language words like
  "and," "the," "to," and "of" occur often
  while words like "undeniable" are rare
          Zipf’s Law on the Web

• Zipf's Law: Zipf's Law is often used to
  predict the frequency of words within a text
• Number of in-links/out-links to/from a page
  has a Zipfian distribution.
• Length of web pages has a Zipfian
• Number of hits to a web page has a Zipfian

      Distribution of incoming page requests to

• Each data point represents one page, with the x-axis showing
  pages sorted according to popularity: the first page is the most
  popular one
           Zipf’s Law: Conclusion

• An empirical rule that describes the relation
  between the frequencies of appearances.
• Example -- text words: the n-th most frequent
  word appears as many times as the most
  frequent one divided by na, for some a  1.
• The same can be applied to in-link/out-link of a
  web page, length of a web page, and number of
  hits to a web page, among others.

 Manual Hierarchical Web Taxonomies
• Yahoo approach of using human editors to
  assemble a large hierarchically structured directory
  of web pages.
   – http://www.yahoo.com/
• Open Directory Project is a similar approach based
  on the distributed labor of volunteer editors (“net-
  citizens provide the collective brain”). Used by
  most other search engines. Started by Netscape.
   – http://www.dmoz.org/
   – 4,998,026 sites - 94,164 editors - over 1,010,070
     Automatic Document Classification

• Manual classification into a given hierarchy is
  labor intensive, subjective, and error-prone.

• Text categorization methods provide a way to
  automatically classify documents.

• Best methods based on training a machine
  learning (pattern recognition) system on a
  labeled set of examples (supervised learning).

    Guidelines for Evaluating Web Resources

•   Authority
•   Accuracy
•   Objectivity
•   Currency
•   Coverage
•   Design

•   Is there an author or sponsoring organization?
•   Is the page signed?
•   What are their qualifications? Are they well-known and
•   Is there a link to information about the author or sponsor?
•   Is there a way of verifying the legitimacy of the page's sponsor?
    That is, is there a phone number or address to contact for more
•   What is the website’s extension?
    –   .com
    –   .edu
    –   .org
    –   .gov
    –   .html (look for personal name, ~ or %)

• Is the information reliable and error-free,
  including grammatical, spelling, and
  typographical errors?
• Is there an editor or someone who verifies the
• Is there a bibliography or other resource links
  clearly listed so the information can be verified
  in another source?
• Does the information contradict information
  already gathered?
•   Purpose
    –   What is the purpose/objective of the website?
        •   Public service
        •   Educational
        •   Sway opinion
•   Advertising
    –   Is the information free of advertising?
    –   If there is any advertising on the page?
    –   Is it clearly differentiated from the informational content?
•   Bias
    –   Is the point of view balanced?
    –   Is the information based on fact? Opinion? Prejudice?
    –   Are both sides of an issue presented?

• Is the page dated?
• If so, when was the last update?
• Have the links been kept current?

•   Relevant
    –   Is the content relevant to my topic?
    –   Is the information useful?
•   Comprehensive
    –   Does the source tell the whole story, or is it too specific
        about one part?
•   Complete
    –   Is there an indication that the page has been completed, and
        is not still under construction?
•   Compare
    –   How does the website compare in content to similar
    –   What does this page offer that is not found elsewhere?         33
• Navigation
  – Are the links clearly labeled?
  – Can you move from page to page easily?
  – Can you find information easily?
• Interactivity
  – Does the user engage with the site?
  – How long does it take to load?
• Appearance
  – Is there good use of graphics and color?
  – Can the page be read without excessive scrolling?
Evaluate the following Web Resources
– http://www.ci.mankato.mn.us/
– http://descy.50megs.com/mankato/mankato.html
– http://www.genderandaids.org/
– http://www.martinlutherking.org/
– www.samford.edu
– www.samford.edu


To top