Standard Web Search Engine Architecture

W
Shared by: HC120704015212
Categories
Tags
-
Stats
views:
109
posted:
7/3/2012
language:
English
pages:
101
Document Sample
scope of work template
							Standard Web Search Engine Architecture
                          Check for duplicates,
        crawl the              store the
           web                documents
                                              DocIds



 user                                                  create an
                                                        inverted
query                                                     index



                                    Search
                                                       Inverted
           Show results             engine
             To user                                     index
                                    servers
   More detailed
    architecture,
from Brin & Page 98.

  Only covers the
 preprocessing in
detail, not the query
       serving.
 Indexes for Web Search Engines
• Inverted indexes are still used, even though the
  web is so huge
• Most current web search systems partition the
  indexes across different machines
  – Each machine handles different parts of the data
    (Google uses thousands of PC-class processors and
    keeps most things in main memory)
• Other systems duplicate the data across many
  machines
  – Queries are distributed among the machines
• Most do a combination of these
            Search Engine Querying
In this example, the
data for the pages is
partitioned across
machines. Additionally,
each partition is
allocated multiple
machines to handle the
queries.

Each row can handle
120 queries per
second

Each column can
handle 7M pages

To handle more
queries, add another
row.

                                From description of the FAST search engine, by Knut Risvik
                          http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
  Querying: Cascading Allocation of CPUs

• A variation on this that produces a cost-
  savings:
  – Put high-quality/common pages on many
    machines
  – Put lower quality/less common pages on
    fewer machines
  – Query goes to high quality machines first
  – If no hits found there, go to other machines
                   Google
• Google maintains (probably) the worlds
  largest Linux cluster (over 15,000 servers)
• These are partitioned between index
  servers and page servers
  – Index servers resolve the queries (massively
    parallel processing)
  – Page servers deliver the results of the queries
• Over 8 Billion web pages are indexed and
  served by Google
      Search Engine Indexes
• Starting Points for Users include
• Manually compiled lists
  – Directories
• Page “popularity”
  – Frequently visited pages (in general)
  – Frequently visited pages as a result of a query
• Link “co-citation”
  – Which sites are linked to by other sites?
   Starting Points: What is Really
            Being Used?
• Todays search engines combine these
  methods in various ways
  – Integration of Directories
     • Today most web search engines integrate
       categories into the results listings
     • Lycos, MSN, Google
  – Link analysis
     • Google uses it; others are also using it
     • Words on the links seems to be especially useful
  – Page popularity
     • Many use DirectHit’s popularity rankings
            Web Page Ranking
• Varies by search engine
  – Pretty messy in many cases
  – Details usually proprietary and fluctuating
• Combining subsets of:
  –   Term frequencies
  –   Term proximities
  –   Term position (title, top of page, etc)
  –   Term characteristics (boldface, capitalized, etc)
  –   Link analysis information
  –   Category information
  –   Popularity information
        Ranking: Hearst ‘96
• Proximity search can help get high-
  precision results if >1 term
  – Combine Boolean and passage-level
    proximity
  – Proves significant improvements when
    retrieving top 5, 10, 20, 30 documents
  – Results reproduced by Mitra et al. 98
  – Google uses something similar
      Ranking: Link Analysis
• Assumptions:
  – If the pages pointing to this page are good,
    then this is also a good page
  – The words on the links pointing to this page
    are useful indicators of what this page is
    about
  – References: Page et al. 98, Kleinberg 98
       Ranking: Link Analysis
• Why does this work?
  – The official Toyota site will be linked to by lots
    of other official (or high-quality) sites
  – The best Toyota fan-club site probably also
    has many links pointing to it
  – Less high-quality sites do not have as many
    high-quality sites linking to them
          Ranking: PageRank
• Google uses the PageRank
• We assume page A has pages T1...Tn which
  point to it (i.e., are citations). The parameter d is
  a damping factor which can be set between 0
  and 1. d is usually set to 0.85. C(A) is defined as
  the number of links going out of page A. The
  PageRank of a page A is given as follows:
• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +
  PR(Tn)/C(Tn))
• Note that the PageRanks form a probability
  distribution over web pages, so the sum of all
  web pages' PageRanks will be one
                              PageRank
               Note: these are not real PageRanks, since they include values >= 1




               X2                                                                   T3
 X1                                                                                 Pr=1


                                                            T1
                                                           Pr=.725

    A                                                                               T4
                                                                                    Pr=1
Pr=4.2544375


                                                             T2
                                                              Pr=1                  T5
                                                                                     Pr=1

                     T8
                   Pr=2.46625
                                                             T7                     T6
                                                              Pr=1                  Pr=1
                  PageRank
• Similar to calculations used in scientific citation
  analysis (e.g., Garfield et al.) and social network
  analysis (e.g., Waserman et al.)
• Similar to other work on ranking (e.g., the hubs
  and authorities of Kleinberg et al.)
• How is Amazon similar to Google in terms of the
  basic insights and techniques of PageRank?
• How could PageRank be applied to other
  problems and domains?
                                Today
• Review
   – Web Crawling and Search Issues
   – Web Search Engines and Algorithms
• Web Search Processing
   – Parallel Architectures (Inktomi – Eric Brewer)
   – Cheshire III Design



Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer
         Presentation from DLF Forum April 2005



  Digital Library Grid Initiatives:
     Cheshire3 and the Grid
                          Ray R. Larson
                         University of California, Berkeley

                  School of Information Management and Systems

                         Rob Sanderson
                             University of Liverpool
                           Dept. of Computer Science
Thanks to Dr. Eric Yen and Prof. Michael Buckland for parts of this presentation
                 Overview
• The Grid, Text Mining and Digital Libraries
  – Grid Architecture
  – Grid IR Issues
• Cheshire3: Bringing Search to Grid-Based
  Digital Libraries
  – Overview
  – Grid Experiments
  – Cheshire3 Architecture
  – Distributed Workflows
 Grid Architecture --                                                                                    (Dr. Eric Yen, Academia Sinica, Taiwan.)




                                                                                                                         .….




                                                                                                                                         Astrophysics
                                   High energy




                                                                                                            Combustion
                                                                             Collaboratories Cosmology
                                     physics
                  Applications
Grid middleware




                                                             Visualization
                                                 Computing
                                                                                                                         ..…
                                 Data Grid


                                                  Remote



                                                               Remote




                                                                                                                                      Remote
                                                                                                                                      sensors
                                                                                                              Portals
                  Application
                   Toolkits


                     Grid            Protocols, authentication, policy, instrumentation,
                   Services           Resource management, discovery, events, etc.

                     Grid        Storage, networks, computers, display devices, etc.
                    Fabric               and their associated local services
                  Grid Architecture                                                                  (ECAI/AS Grid Digital Library Workshop)




                                                                                                                                                           Astrophysics
                                    High energy




                                                                                            Combustion

                                                                                                            Bio-Medical




                                                                                                                                       Humanities
                                                                              Cosmology




                                                                                                                                       computing
                                                                                                                           Libraries
                                      physics
                                                                                                                                                     …



                                                                                                                            Digital
                  Applications
Grid middleware




                                                                         Collaboratories




                                                                                                                          management
                                                         Visualization                                                                               …


                                                                                                                                       Text Mining
                                             Computing
                                 Data Grid




                                                                                                                           Metadata
                                                                                                         Search &
                                                                                                         Retrieval
                                              Remote

                                                           Remote




                                                                                                                                                         Remote
                                                                                                                                                         sensors
                                                                                           Portals
                  Application
                   Toolkits


                      Grid                     Protocols, authentication, policy, instrumentation,
                    Services
                                                Resource management, discovery, events, etc.

                     Grid                    Storage, networks, computers, display devices, etc.
                    Fabric
                                                     and their associated local services
   Grid-Based Digital Libraries
• Large-scale distributed storage
  requirements and technologies
• Organizing distributed digital collections
• Shared Metadata – standards and
  requirements
• Managing distributed digital collections
• Security and access control
• Collection Replication and backup
• Distributed Information Retrieval issues
  and algorithms
              Grid IR Issues
• Want to preserve the same retrieval
  performance (precision/recall) while hopefully
  increasing efficiency (I.e. speed)
• Very large-scale distribution of resources is a
  challenge for sub-second retrieval
• Different from most other typical Grid processes,
  IR is potentially less computing intensive and
  more data intensive
• In many ways Grid IR replicates the process
  (and problems) of metasearch or distributed
  search
         Cheshire3 Overview
• XML Information Retrieval Engine
  – 3rd Generation of the UC Berkeley Cheshire system,
    as co-developed at the University of Liverpool.
  – Uses Python for flexibility and extensibility, but
    imports C/C++ based libraries for processing speed
  – Standards based: XML, XSLT, CQL, SRW/U, Z39.50,
    OAI to name a few.
  – Grid capable. Uses distributed configuration files,
    workflow definitions and PVM (currently) to scale from
    one machine to thousands of parallel nodes.
  – Free and Open Source Software. (GPL Licence)
  – http://www.cheshire3.org/ (under development!)
              Server
Cheshire3 SERVER Overview
    Cheshire3
  C                    SERVER           USER
                                        INFO
  O                   CONTROL                       A
  N
                        API                         P   N
  F Normalization A                  Native calls   A       STAFF UI
  I    C          U            T                    C   E
  G I L           T
                                      P H Z39.50 H
                           R   R                        T CONFIG
           S                          R A SOAP E
  &N U         S H     X   E   A
  C        E      E                  O N OAI I          W
     D S      C N      S   C   N              SRW
  O       A                           T DFetch ID N           User/
  N  E T      A T      L   O   S                        O Client
          R       I                   O L Put ID T           Clients
  T X E C N            T   R   F
                                      C E  OpenURL
                                                    E
  R I R
                  C
                  A
                           D   O              UDDI
                                                    R   R
          H                           O R WSRP F
  ON I
  L G N
                  T            R
                                      L       OGIS
                                                    A   K REMOTE
                  I            M                    C
       G          O            S               JDBC
                                                    E       SYSTEMS
                  N
                      DB API                              (any protocol)
 LOCAL DB
                                CONFIG
                                               RESULT   ACCESS
   XML         INDEXES         & Metadata
                                                SETS     INFO
                                 INFO
       Cheshire3 Grid Tests
• Running on an 30 processor cluster in
  Liverpool using PVM (parallel virtual
  machine)
• Using 16 processors with one “master”
  and 22 “slave” processes we were able to
  parse and index MARC data at about
  13000 records per second
• On a similar setup 610 Mb of TEI data can
  be parsed and indexed in seconds
  SRB and SDSC Experiments
• We are working with SDSC to include SRB
  support
• We are planning to continue working with SDSC
  and to run further evaluations using the TeraGrid
  server(s) through a “small” grant for 30000 CPU
  hours
   –    SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes,
       each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak
       performance of 3.1 teraflops. The nodes are equipped with four
       gigabytes (GBs) of physical memory per node. The cluster is running
       SuSE Linux and is using Myricom's Myrinet cluster interconnect
       network.
• Planned large-scale test collections include
  NSDL, the NARA repository, CiteSeer and the
  “million books” collections of the Internet Archive
           Cheshire3 Object Model               Protocol                   Ingest
                                                Handler      Documents     Process
                   Object
 ConfigStore                                                                 Document
                            Server                         Transformer        Group
                   User                                        Records
                                        Query                                Document
  UserStore                                Database
                            ResultSet                                        PreParser
                                                                             PreParser
                                                                             PreParser
           Query
   Index                                                                     Document

  Extracter                              RecordStore
                                                                  Parser
 Normaliser


      Terms
                                           Record

IndexStore                                                                  DocumentStore
        Cheshire3 Data Objects
• DocumentGroup:
   – A collection of Document objects (e.g. from a file, directory, or
     external search)
• Document:
   – A single item, in any format (e.g. PDF file, raw XML string,
     relational table)
• Record:
   – A single item, represented as parsed XML
• Query:
   – A search query, in the form of CQL (an abstract query language
     for Information Retrieval)
• ResultSet:
   – An ordered list of pointers to records
• Index:
   – An ordered list of terms extracted from Records
    Cheshire3 Process Objects
• PreParser:
   – Given a Document, transform it into another Document (e.g. PDF
     to Text, Text to XML)
• Parser:
   – Given a Document as a raw XML string, return a parsed Record
     for the item.
• Transformer:
   – Given a Record, transform it into a Document (e.g. via XSLT,
     from XML to PDF, or XML to relational table)
• Extracter:
   – Extract terms of a given type from an XML sub-tree (e.g. extract
     Dates, Keywords, Exact string value)
• Normaliser:
   – Given the results of an extracter, transform the terms,
     maintaining the data structure (e.g. CaseNormaliser)
   Cheshire3 Abstract Objects
• Server:
  – A logical collection of databases
• Database:
  – A logical collection of Documents, their
    Record representations and Indexes of
    extracted terms.
• Workflow:
  – A 'meta-process' object that takes a workflow
    definition in XML and converts it into
    executable code.
          Workflow Objects
• Workflows are first class objects in
  Cheshire3 (though not represented in the
  model diagram)
• All Process and Abstract objects have
  individual XML configurations with a
  common base schema with extensions
• We can treat configurations as Records
  and store in regular RecordStores,
  allowing access via regular IR protocols.
       Workflow References
• Workflows contain a series of instructions
  to perform, with reference to other
  Cheshire3 objects
• Reference is via pseudo-unique identifiers
  … Pseudo because they are unique within
  the current context (Server vs Database)
• Workflows are objects, so this enables
  server level workflows to call database
  specific workflows with the same identifier
        Distributed Processing
• Each node in the cluster instantiates the
  configured architecture, potentially through a
  single ConfigStore.
• Master nodes then run a high level workflow to
  distribute the processing amongst Slave nodes
  by reference to a subsidiary workflow
• As object interaction is well defined in the model,
  the result of a workflow is equally well defined.
  This allows for the easy chaining of workflows,
  either locally or spread throughout the cluster.
             Workflow Example1
<subConfig id=“buildWorkflow”>
<objectType>workflow.SimpleWorkflow</objectType>
<workflow>
 <log>Starting Load</log>
 <object type=“recordStore” function=“begin_storing”/>
 <object type=“database” function=“begin_indexing”/>
 <for-each>
   <object type=“workflow” ref=“buildSingleWorkflow”>
 </for-each>
 <object type=“recordStore” function=“commit_storing”/>
 <object type=“database” function=“commit_indexing”/>
 <object type=“database” function=“commit_metadata”/>
</workflow>
</subConfig>
               Workflow Example2
<subConfig id=“buildSingleWorkflow”>
<objectType>workflow.SimpleWorkflow</objectType>
<workflow>
 <object type=“workflow” ref=“PreParserWorkflow”/>
 <try>
   <object type=“parser” ref=“NsSaxParser”/>
 </try>
 <except>
   <log>Unparsable Record</log>
   <raise/>
 </except>
 <object type=“recordStore” function=“create_record”/>
 <object type=“database” function=“add_record”/>
 <object type=“database” function=“index_record”/>
 <log>Loaded Record</log>
</workflow>
</subConfig>
         Workflow Standards
• Cheshire3 workflows do not conform to any
  standard schema
• Intentional:
  – Workflows are specific to and dependent on the
    Cheshire3 architecture
  – Replaces the distribution of lines of code for
    distributed processing
  – Replaces many lines of code in general
• Needs to be easy to understand and create
• GUI workflow builder coming (web and
  standalone)
         External Integration
• Looking at integration with existing cross-
  service workflow systems, in particular
  Kepler/Ptolemy
• Possible integration at two levels:
  – Cheshire3 as a service (black box) ... Identify
    a workflow to call.
  – Cheshire3 object as a service (duplicate
    existing workflow function) … But recall the
    access speed issue.
              Conclusions
• Scalable Grid-Based digital library
  services can be created and provide
  support for very large collections with
  improved efficiency
• The Cheshire3 IR and DL architecture can
  provide Grid (or single processor) services
  for next-generation DLs
• Available as open source via:
http://cheshire3.sourceforge.net or
http://www.cheshire3.org/
         Plan for today

• Wrap up spam
• Crawling
• Connectivity servers
         Link-based ranking
• Most search engines use hyperlink
  information for ranking
• Basic idea: Peer endorsement
  – Web page authors endorse their peers by
    linking to them
• Prototypical link-based ranking algorithm:
  PageRank
  – Page is important if linked to (endorsed) by
    many other pages
  – More so if other pages are themselves
    important
                    Link spam
• Link spam: Inflating the rank of a page by creating
  nepotistic links to it
   – From own sites: Link farms
   – From partner sites: Link exchanges
   – From unaffiliated sites (e.g. blogs, web forums, etc.)
• The more links, the better
   – Generate links automatically
   – Use scripts to post to blogs
   – Synthesize entire web sites (often infinite number of
     pages)
   – Synthesize many web sites (DNS spam; e.g.
     *.thrillingpage.info)
• The more important the linking page, the better
Link farms and link exchanges
       More spam techniques
• Cloaking
 –Serve fake content to search engine spider
 –DNS cloaking: Switch IP address.
  Impersonate
                                        SPAM
                                    Y

                 Is this a Search
                 Engine spider?

                                    N   Real
             Cloaking                   Doc
   Tutorial on
Cloaking & Stealth
   Technology
     More spam techniques
• Doorway pages
 – Pages optimized for a single keyword that re-
   direct to the real target page
• Robots
 – Fake query stream – rank checking programs
   • “Curve-fit” ranking programs of search engines
 – Millions of submissions via Add-Url
                 Acid test
• Which SEO’s rank highly on the query
  seo?
• Web search engines have policies on SEO
  practices they tolerate/block
  – See pointers in Resources
• Adversarial IR: the unending (technical)
  battle between SEO’s and web search
  engines
• See for instance
  http://airweb.cse.lehigh.edu/
Crawling
              Crawling Issues
• How to crawl?
  – Quality: “Best” pages first
  – Efficiency: Avoid duplication (or near duplication)
  – Etiquette: Robots.txt, Server load concerns


• How much to crawl? How much to index?
  – Coverage: How big is the Web? How much do we
    cover?
  – Relative Coverage: How much do competitors have?


• How often to crawl?
  – Freshness: How much has changed?
     Basic crawler operation
• Begin with known “seed” pages
• Fetch and parse them
  – Extract URLs they point to
  – Place the extracted URLs on a queue
• Fetch each URL on the queue and repeat
 Simple picture – complications
• Web crawling isn’t feasible with one
  machine
  – All of the above steps distributed
• Even non-malicious pages pose
  challenges
  – Latency/bandwidth to remote servers vary
  – Robots.txt stipulations
     • How “deep” should you crawl a site’s URL
       hierarchy?
  – Site mirrors and duplicate pages
• Malicious pages
                 Robots.txt
• Protocol for giving spiders (“robots”)
  limited access to a website, originally from
  1994
  – www.robotstxt.org/wc/norobots.html
• Website announces its request on what
  can(not) be crawled
  – For a URL, create a file URL/robots.txt
  – This file specifies access restrictions
         Robots.txt example
• No robot should visit any URL starting with
  "/yoursite/temp/", except the robot called
  “searchengine":

User-agent: *
Disallow: /yoursite/temp/

User-agent: searchengine
Disallow:
    Crawling and Corpus Construction
•   Crawl order
•   Distributed crawling
•   Filtering duplicates
•   Mirror detection
       Where do we spider next?



      URLs crawled
      and parsed


                 URLs in queue


Web
              Crawl Order
• Want best pages first
• Potential quality measures:
    • Final In-degree
    • Final Pagerank
                          What’s this?
              Crawl Order
• Want best pages first
• Potential quality measures:
    • Final In-degree          Measure of page
    • Final Pagerank           quality we’ll define
                               later in the course.
• Crawl heuristic:
    • Breadth First Search (BFS)
    • Partial Indegree
    • Partial Pagerank
    • Random walk
BFS & Spam (Worst case scenario)

        Start                                Start
        Page                                 Page


BFS depth = 2
                                BFS depth = 3
                                2000 URLs on the queue
Normal avg outdegree = 10
                                50% belong to the spammer
100 URLs on the queue
including a spam page.
                                BFS depth = 4
Assume the spammer is able to
                                1.01 million URLs on the queue
generate dynamic pages with
                                99% belong to the spammer
1000 outlinks
       Where do we spider next?



      URLs crawled
      and parsed


                 URLs in queue


Web
      Where do we spider next?
• Keep all spiders busy
• Keep spiders from treading on each
  others’ toes
    – Avoid fetching duplicates repeatedly
•   Respect politeness/robots.txt
•   Avoid getting stuck in traps
•   Detect/minimize spam
•   Get the “best” pages
    – What’s best?
    Where do we spider next?
• Complex scheduling optimization problem,
  subject to all the constraints listed
  – Plus operational constraints (e.g., keeping all
    machines load-balanced)
• Scientific study – limited to specific
  aspects
  – Which ones?
  – What do we measure?
• What are the compromises in distributed
  crawling?
             Parallel Crawlers
• We follow the treatment of Cho and
  Garcia-Molina:
  – http://www2002.org/CDROM/refereed/108/index.html

• Raises a number of questions in a clean
  setting, for further study
• Setting: we have a number of c-proc’s
  – c-proc = crawling process
• Goal: we wish to spider the best pages
  with minimum overhead
  – What do these mean?
           Distributed model
• Crawlers may be running in diverse
  geographies – Europe, Asia, etc.
  – Periodically update a master index
  – Incremental update so this is “cheap”
    • Compression, differential update etc.
  – Focus on communication overhead during the
    crawl
• Also results in dispersed WAN load
   c-proc’s crawling the web
                               Which c-proc
                               gets this URL?




URLs crawled
               URLs in
               queues

                  Communication: by URLs
                  passed between c-procs.
            Measurements
• Overlap = (N-I)/I where
  – N = number of pages fetched
  – I = number of distinct pages fetched
• Coverage = I/U where
  – U = Total number of web pages
• Quality = sum over downloaded pages of
  their importance                         x
  – Importance of a page = its in-degree
• Communication overhead =
  – Number of URLs c-proc’s exchange
          Crawler variations
• c-procs are independent
  – Fetch pages oblivious to each other.
• Static assignment
  – Web pages partitioned statically a priori, e.g.,
    by URL hash … more to follow
• Dynamic assignment
  – Central co-ordinator splits URLs among c-
    procs
           Static assignment
• Firewall mode: each c-proc only fetches
  URL within its partition – typically a
  domain
  – inter-partition links not followed
• Crossover mode: c-proc may following
  inter-partition links into another partition
  – possibility of duplicate fetching
• Exchange mode: c-procs periodically
  exchange URLs they discover in another
  partition
             Experiments
• 40M URL graph – Stanford Webbase
  – Open Directory (dmoz.org) URLs as seeds
• Should be considered a small Web
        Summary of findings
• Cho/Garcia-Molina detail many findings
  – We will review some here, both qualitatively
    and quantitatively
  – You are expected to understand the reason
    behind each qualitative finding in the paper
  – You are not expected to remember quantities
    in their plots/studies
     Firewall mode coverage
• The price of crawling in firewall mode
     Crossover mode overlap
• Demanding coverage drives up overlap
  Exchange mode communication
  • Communication overhead sublinear




Per
downloaded
URL
Connectivity servers
          Connectivity Server
     [CS1: Bhar98b, CS2 & 3: Rand01]
• Support for fast queries on the web graph
  – Which URLs point to a given URL?
  – Which URLs does a given URL point to?
Stores mappings in memory from
     • URL to outlinks, URL to inlinks
• Applications
  – Crawl control
  – Web graph analysis
     • Connectivity, crawl optimization
  – Link analysis
   Most recent published work
• Boldi and Vigna
  – http://www2004.org/proceedings/docs/1p595.pdf
• Webgraph – set of algorithms and a java
  implementation
• Fundamental goal – maintain node
  adjacency lists in memory
  – For this, compressing the adjacency lists is
    the critical component
            Adjacency lists
• The set of neighbors of a node
• Assume each URL represented by an
  integer
• Properties exploited in compression:
  – Similarity (between lists)
  – Locality (many links from a page go to
    “nearby” pages)
  – Use gap encodings in sorted lists
  – Distribution of gap values
              Storage
• Boldi/Vigna get down to an average
  of ~3 bits/link
                         Why is this remarkable?
  – (URL to URL edge)
  – For a 118M node web graph
• How?
     Main ideas of Boldi/Vigna
• Consider lexicographically ordered list of
  all URLs, e.g.,
  – www.stanford.edu/alchemy
  – www.stanford.edu/biology
  – www.stanford.edu/biology/plant
  – www.stanford.edu/biology/plant/copyright
  – www.stanford.edu/biology/plant/people
  – www.stanford.edu/chemistry
                Boldi/Vigna
• Each of these URLs has an adjacency Why 7?
                                           list
• Main thesis: because of templates, the
  adjacency list of a node is similar to one of
  the 7 preceding URLs in the lexicographic
  ordering
• Express adjacency list in terms of one of
  these
• E.g., consider these adjacency lists
  – 1, 2, 4, 8, 16, 32, 64
  – Encode as (-2), remove 9, add 8
    1, 4, 9, 16, 25, 36, 49, 64
              Resources
• www.robotstxt.org/wc/norobots.html
• www2002.org/CDROM/refereed/108/index.ht
  ml
• www2004.org/proceedings/docs/1p595.pdf

						
Related docs
Other docs by HC120704015212
DRAFT RESOLUTIONS- Third Draft
Views: 0  |  Downloads: 0
Top line of doc
Views: 4  |  Downloads: 0
Procedure Presentation1
Views: 0  |  Downloads: 0
American College of Sports Medicine - DOC
Views: 6  |  Downloads: 0
Wild Horse Complaint
Views: 0  |  Downloads: 0
Dr Luce presentation
Views: 1  |  Downloads: 0
Schroeder Evangelism
Views: 4  |  Downloads: 0
Customer Information
Views: 0  |  Downloads: 0