Standard Web Search Engine Architecture
Document Sample


Standard Web Search Engine Architecture
Check for duplicates,
crawl the store the
web documents
DocIds
user create an
inverted
query index
Search
Inverted
Show results engine
To user index
servers
More detailed
architecture,
from Brin & Page 98.
Only covers the
preprocessing in
detail, not the query
serving.
Indexes for Web Search Engines
• Inverted indexes are still used, even though the
web is so huge
• Most current web search systems partition the
indexes across different machines
– Each machine handles different parts of the data
(Google uses thousands of PC-class processors and
keeps most things in main memory)
• Other systems duplicate the data across many
machines
– Queries are distributed among the machines
• Most do a combination of these
Search Engine Querying
In this example, the
data for the pages is
partitioned across
machines. Additionally,
each partition is
allocated multiple
machines to handle the
queries.
Each row can handle
120 queries per
second
Each column can
handle 7M pages
To handle more
queries, add another
row.
From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Querying: Cascading Allocation of CPUs
• A variation on this that produces a cost-
savings:
– Put high-quality/common pages on many
machines
– Put lower quality/less common pages on
fewer machines
– Query goes to high quality machines first
– If no hits found there, go to other machines
Google
• Google maintains (probably) the worlds
largest Linux cluster (over 15,000 servers)
• These are partitioned between index
servers and page servers
– Index servers resolve the queries (massively
parallel processing)
– Page servers deliver the results of the queries
• Over 8 Billion web pages are indexed and
served by Google
Search Engine Indexes
• Starting Points for Users include
• Manually compiled lists
– Directories
• Page “popularity”
– Frequently visited pages (in general)
– Frequently visited pages as a result of a query
• Link “co-citation”
– Which sites are linked to by other sites?
Starting Points: What is Really
Being Used?
• Todays search engines combine these
methods in various ways
– Integration of Directories
• Today most web search engines integrate
categories into the results listings
• Lycos, MSN, Google
– Link analysis
• Google uses it; others are also using it
• Words on the links seems to be especially useful
– Page popularity
• Many use DirectHit’s popularity rankings
Web Page Ranking
• Varies by search engine
– Pretty messy in many cases
– Details usually proprietary and fluctuating
• Combining subsets of:
– Term frequencies
– Term proximities
– Term position (title, top of page, etc)
– Term characteristics (boldface, capitalized, etc)
– Link analysis information
– Category information
– Popularity information
Ranking: Hearst ‘96
• Proximity search can help get high-
precision results if >1 term
– Combine Boolean and passage-level
proximity
– Proves significant improvements when
retrieving top 5, 10, 20, 30 documents
– Results reproduced by Mitra et al. 98
– Google uses something similar
Ranking: Link Analysis
• Assumptions:
– If the pages pointing to this page are good,
then this is also a good page
– The words on the links pointing to this page
are useful indicators of what this page is
about
– References: Page et al. 98, Kleinberg 98
Ranking: Link Analysis
• Why does this work?
– The official Toyota site will be linked to by lots
of other official (or high-quality) sites
– The best Toyota fan-club site probably also
has many links pointing to it
– Less high-quality sites do not have as many
high-quality sites linking to them
Ranking: PageRank
• Google uses the PageRank
• We assume page A has pages T1...Tn which
point to it (i.e., are citations). The parameter d is
a damping factor which can be set between 0
and 1. d is usually set to 0.85. C(A) is defined as
the number of links going out of page A. The
PageRank of a page A is given as follows:
• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +
PR(Tn)/C(Tn))
• Note that the PageRanks form a probability
distribution over web pages, so the sum of all
web pages' PageRanks will be one
PageRank
Note: these are not real PageRanks, since they include values >= 1
X2 T3
X1 Pr=1
T1
Pr=.725
A T4
Pr=1
Pr=4.2544375
T2
Pr=1 T5
Pr=1
T8
Pr=2.46625
T7 T6
Pr=1 Pr=1
PageRank
• Similar to calculations used in scientific citation
analysis (e.g., Garfield et al.) and social network
analysis (e.g., Waserman et al.)
• Similar to other work on ranking (e.g., the hubs
and authorities of Kleinberg et al.)
• How is Amazon similar to Google in terms of the
basic insights and techniques of PageRank?
• How could PageRank be applied to other
problems and domains?
Today
• Review
– Web Crawling and Search Issues
– Web Search Engines and Algorithms
• Web Search Processing
– Parallel Architectures (Inktomi – Eric Brewer)
– Cheshire III Design
Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer
Presentation from DLF Forum April 2005
Digital Library Grid Initiatives:
Cheshire3 and the Grid
Ray R. Larson
University of California, Berkeley
School of Information Management and Systems
Rob Sanderson
University of Liverpool
Dept. of Computer Science
Thanks to Dr. Eric Yen and Prof. Michael Buckland for parts of this presentation
Overview
• The Grid, Text Mining and Digital Libraries
– Grid Architecture
– Grid IR Issues
• Cheshire3: Bringing Search to Grid-Based
Digital Libraries
– Overview
– Grid Experiments
– Cheshire3 Architecture
– Distributed Workflows
Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)
.….
Astrophysics
High energy
Combustion
Collaboratories Cosmology
physics
Applications
Grid middleware
Visualization
Computing
..…
Data Grid
Remote
Remote
Remote
sensors
Portals
Application
Toolkits
Grid Protocols, authentication, policy, instrumentation,
Services Resource management, discovery, events, etc.
Grid Storage, networks, computers, display devices, etc.
Fabric and their associated local services
Grid Architecture (ECAI/AS Grid Digital Library Workshop)
Astrophysics
High energy
Combustion
Bio-Medical
Humanities
Cosmology
computing
Libraries
physics
…
Digital
Applications
Grid middleware
Collaboratories
management
Visualization …
Text Mining
Computing
Data Grid
Metadata
Search &
Retrieval
Remote
Remote
Remote
sensors
Portals
Application
Toolkits
Grid Protocols, authentication, policy, instrumentation,
Services
Resource management, discovery, events, etc.
Grid Storage, networks, computers, display devices, etc.
Fabric
and their associated local services
Grid-Based Digital Libraries
• Large-scale distributed storage
requirements and technologies
• Organizing distributed digital collections
• Shared Metadata – standards and
requirements
• Managing distributed digital collections
• Security and access control
• Collection Replication and backup
• Distributed Information Retrieval issues
and algorithms
Grid IR Issues
• Want to preserve the same retrieval
performance (precision/recall) while hopefully
increasing efficiency (I.e. speed)
• Very large-scale distribution of resources is a
challenge for sub-second retrieval
• Different from most other typical Grid processes,
IR is potentially less computing intensive and
more data intensive
• In many ways Grid IR replicates the process
(and problems) of metasearch or distributed
search
Cheshire3 Overview
• XML Information Retrieval Engine
– 3rd Generation of the UC Berkeley Cheshire system,
as co-developed at the University of Liverpool.
– Uses Python for flexibility and extensibility, but
imports C/C++ based libraries for processing speed
– Standards based: XML, XSLT, CQL, SRW/U, Z39.50,
OAI to name a few.
– Grid capable. Uses distributed configuration files,
workflow definitions and PVM (currently) to scale from
one machine to thousands of parallel nodes.
– Free and Open Source Software. (GPL Licence)
– http://www.cheshire3.org/ (under development!)
Server
Cheshire3 SERVER Overview
Cheshire3
C SERVER USER
INFO
O CONTROL A
N
API P N
F Normalization A Native calls A STAFF UI
I C U T C E
G I L T
P H Z39.50 H
R R T CONFIG
S R A SOAP E
&N U S H X E A
C E E O N OAI I W
D S C N S C N SRW
O A T DFetch ID N User/
N E T A T L O S O Client
R I O L Put ID T Clients
T X E C N T R F
C E OpenURL
E
R I R
C
A
D O UDDI
R R
H O R WSRP F
ON I
L G N
T R
L OGIS
A K REMOTE
I M C
G O S JDBC
E SYSTEMS
N
DB API (any protocol)
LOCAL DB
CONFIG
RESULT ACCESS
XML INDEXES & Metadata
SETS INFO
INFO
Cheshire3 Grid Tests
• Running on an 30 processor cluster in
Liverpool using PVM (parallel virtual
machine)
• Using 16 processors with one “master”
and 22 “slave” processes we were able to
parse and index MARC data at about
13000 records per second
• On a similar setup 610 Mb of TEI data can
be parsed and indexed in seconds
SRB and SDSC Experiments
• We are working with SDSC to include SRB
support
• We are planning to continue working with SDSC
and to run further evaluations using the TeraGrid
server(s) through a “small” grant for 30000 CPU
hours
– SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes,
each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak
performance of 3.1 teraflops. The nodes are equipped with four
gigabytes (GBs) of physical memory per node. The cluster is running
SuSE Linux and is using Myricom's Myrinet cluster interconnect
network.
• Planned large-scale test collections include
NSDL, the NARA repository, CiteSeer and the
“million books” collections of the Internet Archive
Cheshire3 Object Model Protocol Ingest
Handler Documents Process
Object
ConfigStore Document
Server Transformer Group
User Records
Query Document
UserStore Database
ResultSet PreParser
PreParser
PreParser
Query
Index Document
Extracter RecordStore
Parser
Normaliser
Terms
Record
IndexStore DocumentStore
Cheshire3 Data Objects
• DocumentGroup:
– A collection of Document objects (e.g. from a file, directory, or
external search)
• Document:
– A single item, in any format (e.g. PDF file, raw XML string,
relational table)
• Record:
– A single item, represented as parsed XML
• Query:
– A search query, in the form of CQL (an abstract query language
for Information Retrieval)
• ResultSet:
– An ordered list of pointers to records
• Index:
– An ordered list of terms extracted from Records
Cheshire3 Process Objects
• PreParser:
– Given a Document, transform it into another Document (e.g. PDF
to Text, Text to XML)
• Parser:
– Given a Document as a raw XML string, return a parsed Record
for the item.
• Transformer:
– Given a Record, transform it into a Document (e.g. via XSLT,
from XML to PDF, or XML to relational table)
• Extracter:
– Extract terms of a given type from an XML sub-tree (e.g. extract
Dates, Keywords, Exact string value)
• Normaliser:
– Given the results of an extracter, transform the terms,
maintaining the data structure (e.g. CaseNormaliser)
Cheshire3 Abstract Objects
• Server:
– A logical collection of databases
• Database:
– A logical collection of Documents, their
Record representations and Indexes of
extracted terms.
• Workflow:
– A 'meta-process' object that takes a workflow
definition in XML and converts it into
executable code.
Workflow Objects
• Workflows are first class objects in
Cheshire3 (though not represented in the
model diagram)
• All Process and Abstract objects have
individual XML configurations with a
common base schema with extensions
• We can treat configurations as Records
and store in regular RecordStores,
allowing access via regular IR protocols.
Workflow References
• Workflows contain a series of instructions
to perform, with reference to other
Cheshire3 objects
• Reference is via pseudo-unique identifiers
… Pseudo because they are unique within
the current context (Server vs Database)
• Workflows are objects, so this enables
server level workflows to call database
specific workflows with the same identifier
Distributed Processing
• Each node in the cluster instantiates the
configured architecture, potentially through a
single ConfigStore.
• Master nodes then run a high level workflow to
distribute the processing amongst Slave nodes
by reference to a subsidiary workflow
• As object interaction is well defined in the model,
the result of a workflow is equally well defined.
This allows for the easy chaining of workflows,
either locally or spread throughout the cluster.
Workflow Example1
<subConfig id=“buildWorkflow”>
<objectType>workflow.SimpleWorkflow</objectType>
<workflow>
<log>Starting Load</log>
<object type=“recordStore” function=“begin_storing”/>
<object type=“database” function=“begin_indexing”/>
<for-each>
<object type=“workflow” ref=“buildSingleWorkflow”>
</for-each>
<object type=“recordStore” function=“commit_storing”/>
<object type=“database” function=“commit_indexing”/>
<object type=“database” function=“commit_metadata”/>
</workflow>
</subConfig>
Workflow Example2
<subConfig id=“buildSingleWorkflow”>
<objectType>workflow.SimpleWorkflow</objectType>
<workflow>
<object type=“workflow” ref=“PreParserWorkflow”/>
<try>
<object type=“parser” ref=“NsSaxParser”/>
</try>
<except>
<log>Unparsable Record</log>
<raise/>
</except>
<object type=“recordStore” function=“create_record”/>
<object type=“database” function=“add_record”/>
<object type=“database” function=“index_record”/>
<log>Loaded Record</log>
</workflow>
</subConfig>
Workflow Standards
• Cheshire3 workflows do not conform to any
standard schema
• Intentional:
– Workflows are specific to and dependent on the
Cheshire3 architecture
– Replaces the distribution of lines of code for
distributed processing
– Replaces many lines of code in general
• Needs to be easy to understand and create
• GUI workflow builder coming (web and
standalone)
External Integration
• Looking at integration with existing cross-
service workflow systems, in particular
Kepler/Ptolemy
• Possible integration at two levels:
– Cheshire3 as a service (black box) ... Identify
a workflow to call.
– Cheshire3 object as a service (duplicate
existing workflow function) … But recall the
access speed issue.
Conclusions
• Scalable Grid-Based digital library
services can be created and provide
support for very large collections with
improved efficiency
• The Cheshire3 IR and DL architecture can
provide Grid (or single processor) services
for next-generation DLs
• Available as open source via:
http://cheshire3.sourceforge.net or
http://www.cheshire3.org/
Plan for today
• Wrap up spam
• Crawling
• Connectivity servers
Link-based ranking
• Most search engines use hyperlink
information for ranking
• Basic idea: Peer endorsement
– Web page authors endorse their peers by
linking to them
• Prototypical link-based ranking algorithm:
PageRank
– Page is important if linked to (endorsed) by
many other pages
– More so if other pages are themselves
important
Link spam
• Link spam: Inflating the rank of a page by creating
nepotistic links to it
– From own sites: Link farms
– From partner sites: Link exchanges
– From unaffiliated sites (e.g. blogs, web forums, etc.)
• The more links, the better
– Generate links automatically
– Use scripts to post to blogs
– Synthesize entire web sites (often infinite number of
pages)
– Synthesize many web sites (DNS spam; e.g.
*.thrillingpage.info)
• The more important the linking page, the better
Link farms and link exchanges
More spam techniques
• Cloaking
–Serve fake content to search engine spider
–DNS cloaking: Switch IP address.
Impersonate
SPAM
Y
Is this a Search
Engine spider?
N Real
Cloaking Doc
Tutorial on
Cloaking & Stealth
Technology
More spam techniques
• Doorway pages
– Pages optimized for a single keyword that re-
direct to the real target page
• Robots
– Fake query stream – rank checking programs
• “Curve-fit” ranking programs of search engines
– Millions of submissions via Add-Url
Acid test
• Which SEO’s rank highly on the query
seo?
• Web search engines have policies on SEO
practices they tolerate/block
– See pointers in Resources
• Adversarial IR: the unending (technical)
battle between SEO’s and web search
engines
• See for instance
http://airweb.cse.lehigh.edu/
Crawling
Crawling Issues
• How to crawl?
– Quality: “Best” pages first
– Efficiency: Avoid duplication (or near duplication)
– Etiquette: Robots.txt, Server load concerns
• How much to crawl? How much to index?
– Coverage: How big is the Web? How much do we
cover?
– Relative Coverage: How much do competitors have?
• How often to crawl?
– Freshness: How much has changed?
Basic crawler operation
• Begin with known “seed” pages
• Fetch and parse them
– Extract URLs they point to
– Place the extracted URLs on a queue
• Fetch each URL on the queue and repeat
Simple picture – complications
• Web crawling isn’t feasible with one
machine
– All of the above steps distributed
• Even non-malicious pages pose
challenges
– Latency/bandwidth to remote servers vary
– Robots.txt stipulations
• How “deep” should you crawl a site’s URL
hierarchy?
– Site mirrors and duplicate pages
• Malicious pages
Robots.txt
• Protocol for giving spiders (“robots”)
limited access to a website, originally from
1994
– www.robotstxt.org/wc/norobots.html
• Website announces its request on what
can(not) be crawled
– For a URL, create a file URL/robots.txt
– This file specifies access restrictions
Robots.txt example
• No robot should visit any URL starting with
"/yoursite/temp/", except the robot called
“searchengine":
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow:
Crawling and Corpus Construction
• Crawl order
• Distributed crawling
• Filtering duplicates
• Mirror detection
Where do we spider next?
URLs crawled
and parsed
URLs in queue
Web
Crawl Order
• Want best pages first
• Potential quality measures:
• Final In-degree
• Final Pagerank
What’s this?
Crawl Order
• Want best pages first
• Potential quality measures:
• Final In-degree Measure of page
• Final Pagerank quality we’ll define
later in the course.
• Crawl heuristic:
• Breadth First Search (BFS)
• Partial Indegree
• Partial Pagerank
• Random walk
BFS & Spam (Worst case scenario)
Start Start
Page Page
BFS depth = 2
BFS depth = 3
2000 URLs on the queue
Normal avg outdegree = 10
50% belong to the spammer
100 URLs on the queue
including a spam page.
BFS depth = 4
Assume the spammer is able to
1.01 million URLs on the queue
generate dynamic pages with
99% belong to the spammer
1000 outlinks
Where do we spider next?
URLs crawled
and parsed
URLs in queue
Web
Where do we spider next?
• Keep all spiders busy
• Keep spiders from treading on each
others’ toes
– Avoid fetching duplicates repeatedly
• Respect politeness/robots.txt
• Avoid getting stuck in traps
• Detect/minimize spam
• Get the “best” pages
– What’s best?
Where do we spider next?
• Complex scheduling optimization problem,
subject to all the constraints listed
– Plus operational constraints (e.g., keeping all
machines load-balanced)
• Scientific study – limited to specific
aspects
– Which ones?
– What do we measure?
• What are the compromises in distributed
crawling?
Parallel Crawlers
• We follow the treatment of Cho and
Garcia-Molina:
– http://www2002.org/CDROM/refereed/108/index.html
• Raises a number of questions in a clean
setting, for further study
• Setting: we have a number of c-proc’s
– c-proc = crawling process
• Goal: we wish to spider the best pages
with minimum overhead
– What do these mean?
Distributed model
• Crawlers may be running in diverse
geographies – Europe, Asia, etc.
– Periodically update a master index
– Incremental update so this is “cheap”
• Compression, differential update etc.
– Focus on communication overhead during the
crawl
• Also results in dispersed WAN load
c-proc’s crawling the web
Which c-proc
gets this URL?
URLs crawled
URLs in
queues
Communication: by URLs
passed between c-procs.
Measurements
• Overlap = (N-I)/I where
– N = number of pages fetched
– I = number of distinct pages fetched
• Coverage = I/U where
– U = Total number of web pages
• Quality = sum over downloaded pages of
their importance x
– Importance of a page = its in-degree
• Communication overhead =
– Number of URLs c-proc’s exchange
Crawler variations
• c-procs are independent
– Fetch pages oblivious to each other.
• Static assignment
– Web pages partitioned statically a priori, e.g.,
by URL hash … more to follow
• Dynamic assignment
– Central co-ordinator splits URLs among c-
procs
Static assignment
• Firewall mode: each c-proc only fetches
URL within its partition – typically a
domain
– inter-partition links not followed
• Crossover mode: c-proc may following
inter-partition links into another partition
– possibility of duplicate fetching
• Exchange mode: c-procs periodically
exchange URLs they discover in another
partition
Experiments
• 40M URL graph – Stanford Webbase
– Open Directory (dmoz.org) URLs as seeds
• Should be considered a small Web
Summary of findings
• Cho/Garcia-Molina detail many findings
– We will review some here, both qualitatively
and quantitatively
– You are expected to understand the reason
behind each qualitative finding in the paper
– You are not expected to remember quantities
in their plots/studies
Firewall mode coverage
• The price of crawling in firewall mode
Crossover mode overlap
• Demanding coverage drives up overlap
Exchange mode communication
• Communication overhead sublinear
Per
downloaded
URL
Connectivity servers
Connectivity Server
[CS1: Bhar98b, CS2 & 3: Rand01]
• Support for fast queries on the web graph
– Which URLs point to a given URL?
– Which URLs does a given URL point to?
Stores mappings in memory from
• URL to outlinks, URL to inlinks
• Applications
– Crawl control
– Web graph analysis
• Connectivity, crawl optimization
– Link analysis
Most recent published work
• Boldi and Vigna
– http://www2004.org/proceedings/docs/1p595.pdf
• Webgraph – set of algorithms and a java
implementation
• Fundamental goal – maintain node
adjacency lists in memory
– For this, compressing the adjacency lists is
the critical component
Adjacency lists
• The set of neighbors of a node
• Assume each URL represented by an
integer
• Properties exploited in compression:
– Similarity (between lists)
– Locality (many links from a page go to
“nearby” pages)
– Use gap encodings in sorted lists
– Distribution of gap values
Storage
• Boldi/Vigna get down to an average
of ~3 bits/link
Why is this remarkable?
– (URL to URL edge)
– For a 118M node web graph
• How?
Main ideas of Boldi/Vigna
• Consider lexicographically ordered list of
all URLs, e.g.,
– www.stanford.edu/alchemy
– www.stanford.edu/biology
– www.stanford.edu/biology/plant
– www.stanford.edu/biology/plant/copyright
– www.stanford.edu/biology/plant/people
– www.stanford.edu/chemistry
Boldi/Vigna
• Each of these URLs has an adjacency Why 7?
list
• Main thesis: because of templates, the
adjacency list of a node is similar to one of
the 7 preceding URLs in the lexicographic
ordering
• Express adjacency list in terms of one of
these
• E.g., consider these adjacency lists
– 1, 2, 4, 8, 16, 32, 64
– Encode as (-2), remove 9, add 8
1, 4, 9, 16, 25, 36, 49, 64
Resources
• www.robotstxt.org/wc/norobots.html
• www2002.org/CDROM/refereed/108/index.ht
ml
• www2004.org/proceedings/docs/1p595.pdf
Get documents about "