Docstoc

A Survey of Real-time Operating Systems

Document Sample
A Survey of Real-time Operating Systems Powered By Docstoc
					   Anatomy of a Large-Scale
Hypertextual Web Search Engine

       Computer Science Department
       Sergey Brin and Lawrence Page

           Speaker :: Jin O, Kim
             March 14, 2006
                             Table of Contents
        Abstract
        Introduction
        - Web Search Engines – Scaling Up : 1994 – 2000
        - Google : Scaling with the Web
        - Design Goals
        System Features
        - PageRank
        - Anchor Text
        - Other Features
        Related Work
        - Information Retreieval
        - Differences Between the Web and Well Controlled Collections
        System Anatomy
        - Google Architecture Overview
        - Major Data Structures
CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                                         Abstract
        Presented as a protoype of a large-scale search engine
        Google is designed to crawl and index the Web
        efficiently and produce much more satisfying searh
        results than exsting systems.
        To engineer a search engine is a challenging task
             10 ~ 100 million indexing
             Answer 10 millions of queries every day

        How to build a practical large-scale system which can
        exploit the additional information present in hypertext



CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                            Introduction (1/4)
        The amount of information on the web is growing rapidly,
        as well as the number of new users inexperienced in the
        art of web research.
        Automated search engines that rely on keyword
        matching usually return too many low quality matches
             advertisers

        Google : 10100, googol




CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                            Introduction (2/4)
        Web Search Engines -- Scaling Up : 1994 - 2000
             1994 : WWW, 110,000 pages index, 15000 queries
             1997 (now) : top search engines, 2 ~ 100 million pages index
                               altavista, 20 million queries
             2000 : 1 billion pages index, 100 million queries




CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                            Introduction (3/4)
     Google : Scaling with the Web
          gather the web documents and keep them up to date
           - fast crawling technology
           - storage space must be used efficiently
           - indexing system must process hundreds of gigabyes of data
            efficiently
           - queries must be handled quickly
          These tasks are becoming increasingly difficult as the Web grows.
          Hardware performance and cost have improved dramatically to
          partially offset the difficulty.
           - notable exception : disk seek time, operationg system robustness
          Designing Google : growth of the Web and technological changes
           - data structures : optimized ofr fast and efficient access

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                            Introduction (4/4)
      Google : Scaling with the Web
      Design Goals
           Improved Search Quality
           - very high precision (number of relevant documents returned, say in the
           top tens of results)
           - expense of recall (the total number of relevant documents the system is
           able to return)
           Academic Search Engine Research
           - 1993 : .com 1.5%, 1997 : .com 60%
           - One of our main goals in designing Google was to set up an
           environment where other researchers can come in quickly, process large
           chunks of the web, and produce interesting results that would have been
           very difficult to produce otherwise.



CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                       System Features (1/3)
        PageRank : Bringing Order to the Web
             Citation importance that corresponds well with people’s subjective idea
             of importance
             Simple text matching search
            - performs admirably when PageRank prioritizes the results
             Not counting links from all pages equally, and by the number of links on
             a page.




             - We assume page A has pages T1… Tn which point to it.
             - d (damping factor) : 0 ~ 1, 0.85
             - C : defined as the number of links going out of page A

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                      System Features (2/3)
        PageRank




             High PageRank : many pages that point to it, some pages that point to it
             and have a high PageRank


CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                      System Features (3/3)
        Anchor Text
             Associate it with the page the link points to
             1. Often provide more accurate descriptions of web pages than the page
             themselves
             2. May exist for documents which cannot be indexed by a text-based
             search engine, such as images, programes, and databases
             Progpagation mostly because anchor text can help provide better quality
             results
        Other Features
             1. location information for all hits and so it makes extensive use of
             proximity in search
             2. track of some visual presentation details such as font size of words
                - larger or bolder font : weighted higher than other words
             3. full raw HTML of pages is available in a repository

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                           Related Work (1/2)
        Information Retreieval
             TREC -> well controlled, homogenous collections
             Google : 147GB from our crawl of 24 million web pages
             Vector Space Model not enough
             Query : “Bill Clinton” -> Bill Clinton Sucks
            - high quality information available on this topic
        Differences Between the Web and Well Controlled Collections
             Web
              - extreme variation internal to the documents
              - no control over what people can put on the web
             metadata efforts have largely failed
              - companies which specialize in manipulating search engines for profit


CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                       System Anatomy (1/3)
        Google Architecture Overview




CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                      System Anatomy (2/3)
        Google Architecture Overview
             URL Server : sends lists of URLs to crawlers
             Crawler : downloads web pages
              Store Server : compresses & stores web pages into the
             repository
              Indexer
               - reads the repository & uncompresses the documents
               - parses the documents
               - creates forward index
               - parses out the links
              URL Resolver
               - converts relative URLs to absolute URLs and then to docIDs
               - generates a database of links
               - puts the anchor text into the barrels
              Sorter : generates the inverted index
              Searcher : answers queries
CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                      System Anatomy (3/3)
        Major Data Structures
             Bigfiles
             Respository
             Document Index
             Lexicon
             Hit Lists
             Forward Index
             Inverted Index


CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                      Major Data Structures
        Bigfiles
             The operating systems do not provide enough for our needs
             Virtual files spanning mutiple file systems : addressable by 64 bit
             intergers
        Respository
             Contains the full HTML of every web page : compressed using zlib
             Documents are stored one after the other and are prefixed by docID,
             length, and URL
             Rebuild all the other data structures from only the repository and a
             file which lists crawler errors




CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                      Major Data Structures
        Document Index
             Fixed width ISAM index
             Store document status, pointer to repository, document
             checksum
             If document has been crawled, ptr to variable length docinfo file
             stored

        Lexicon
             Fits in memory with a reasonable price
             - 256MB : 14 million words
             List of the words, Hash table of pointers



CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                      Major Data Structures
        Hit Lists
             plain hit, anchor hit, fancy hit
             Encoding uses 2 bytes for each hit
             Length of hit list stored before hit


        Forward Index
             Stored in 64 barrels
             If a document contains words in a barrel,
             then the docID is recorded into the barrel,
             with the list of wordID’s and hit lists.
             Each wordID stored as a relative difference from the minimum wordID in
             a barrel. (24 ibts for the wordID, 8 for hit list length)


CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr
                      Major Data Structures
        Inverted Index
             The same barrels as forward index, except that they have been
             processed by the sorter.
             For every wordID, doclist of docIDs generated, with
             corresponding hit lists
             Two sets of inverted barrels, one set for hit lists which include
             title or anchor hits and another set for all hit lists.




CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea   http://realtime.ssu.ac.kr

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:1/9/2013
language:Unknown
pages:18