Web Crawling

Document Sample
Web Crawling Powered By Docstoc
					     Web Crawling
   Notes by Aisha Walcott

 Modeling the Internet and the Web:
Probabilistic Methods and Algorithms
   Authors: Baldi, Frasconi, Smyth
   •   Basic Crawling
   •   Selective crawling
   •   Focused crawling
   •   Distributed crawling
   •   Web dynamics- age/lifetime of documents

-Anchors are very useful in search engines, they are the text “on
top” of a link on a webpage
                      Eg: <a href=“URL”> anchor text </a>
-Many topics presented here have pointers to a number of references
                 Basic Crawling
• A simple crawler uses a graph algorithm such as BFS
   – Maintains a queue, Q, that stores URLs
   – Two repositories: D- stores documents, E- stores URLs
• Given S0 (seeds): initial collection of URLs
• Each iteration
   – Dequeue, fetch, and parse document for new URLs
   – Enqueue new URLs not visited (web is acyclic)

• Termination conditions
   – Time allotted to crawling expired
   – Storage resources are full
   – Consequently Q, D have data, so anchors to the URLs in Q
     are used to return query results (many search engines do this)
 Practical Modifications & Issues
• Time to download a doc is unknown
   – DNS lookup may be slow
   – Network congestion, connection delays
   – Exploit bandwidth- run concurrent fetching threads
• Crawlers should be respectful of servers and not abuse resources
  at target site (robots exclusion protocol)
• Multiple threads should not fetch from same server
  simultaneously or too often
• Broaden crawling fringe (more servers) and increase time
  between requests to same server
• Storing Q, and D on disk requires careful external memory
• Crawlers avoid aliases “traps”- same doc is addressed by many
  different URLs
• Web is dynamic and changes in topology and content
              Selective Crawling
                       (Selective Crawling)

• Recognizing the relevance or importance of
  sites, limit fetching to most important subset
• Define a scoring function for relevance
      s ) (u ) : where u is a URL,

                 is the relevance criterion,
                 is the set of parameters.

• Eg. Best first search using score to enqueue
• Measure efficiency: rt/t, t = #pages fetched, rt =
  #fetched pages with score > st (ideally rt =t)
            Ex: Scoring Functions
                           (Selective Crawling)

• Depth- limit #docs downloaded from a single site by a)
  setting threshold, b) depth in dir tree, or c) limit path
  length; maximizes breadth
                     1, if |root(u) ~> u| < , root(u) is root of site with u
  s depth) (u ) 
                     0, otherwise

• Popularity- assigning importance by most popular; eg. a
  relevance function based on backlinks
                                     1, if indegree(u) >
             s( backlinks) (u )    0, otherwise

• PageRank- measure of popularity recursively assigns
  ea. link a weight proportional to popularity of doc
           Focused Crawling
• Searches for info related to certain topic not
  driven by generic quality measures
• Relevance prediction
• Context graphs
• Reinforcement learning
• Examples: Citeseer, Fish algm (agents
  accumulate energy for relative docs, consume
  energy for network resources)
            Relevance Prediction
                            (Focused Crawling)

• Define a score as cond. prob. that a doc is relevant
  given text in the doc.    c is topic of interest
      stopic ) (u )  P (c | d (u ), )
                                            are adjustable params of classifier
                                           d(u) is contents of doc at vertex u
• Strategies for approx topic score
   – Parent-based: score a fetched doc and extend score
     to all URLs in that doc, “topic locality”
            stopic ) (u )  P(c | d (v), ) v is parent of u

   – Anchor-based: just use text d(v,u) in the anchor(s)
     where link to u is referred to, “semantic linkage
• Eg. naïve Bayes classifier trained on relevant docs.
              Context Graphs
                    (Focused Crawling)

• Take adv of knowledge of internet topology
• Train machine learning system to predict “how far”
  relevant info can be expected to be found
  – Eg. 2 layered context graph, layered graph of node u
               Layer 2

                    Layer 1   u

  – After training, predict layer a new doc belongs to
    indicating # links to follow before relevant info reached
      Reinforcement Learning
                (Focused Crawling)

• Immediate rewards when crawler downloads a
  relevant doc
• Policy learned by RL can guide agent toward high
  long-term cumulative rewards
• Internal state of crawler- sets of fetched and
  discovered URLs
• Actions- fetching a URL in the queue of URLs
• State space too large
        Distributed Crawling
• Scalable system by “divide and conquer”
• Want to minimize significant overlap
• Characterize interaction between crawlers
  – Coordination
  – Confinement
  – Partitioning
                    (Distributed Crawling)

• The day different crawlers agree about the subset of
  pages ea. of them is responsible for
• If 2 crawlers are completely independent then overlap
  only controlled by having different seeds (URLs)
• Hard to compute the partition that minimizes overlap
• Partition web into subgraphs-crawler is responsible for
  fetching docs from their subgraphs
• Static or dynamic partition based on whether or not it
  changes during crawling (static more autonomous,
  dynamic is subject to reassignment from external
                    (Distributed Crawling)

• Assumes static coordination; defines how strict ea.
  crawler should operate within its own partition
• What happens when a crawler pops “foreign” URLs
  from its queue (URLs from another partition)
• 3 suggested modes
  – Firewall: never follow interpartition links
     • Poor coverage
  – Crossover: follow links when Q has no more local
     • Good coverage, potential high overlap
  – Exchange: never follows interpartition links, but
    periodically communicates foreign URLs w/ the correct
     • No overlap, potential perfect coverage, but extra bandwidth
               (Distributed Crawling)

• Strategy used to split URLs into non-
  overlapping subsets assigned to ea.
• Eg. Hash fn. of IPs assigning them to a
• Take into account geographical
              Web Dynamics
• How info on web changes over time
• SE w/ a collection of dos is (, )-current if the
  probability that a doc is -current is at least  (
  is the “grace period”)
   – Eg. How many docs per day to be (0.9, 1wk)-current
• Assume changes in the web are random and
   – Model this according to a Poisson process
   – “Dot coms” much more dynamic than “dot edu”
 Lifetime and Aging of Documents
• Model based on reliability theory in Ind
• Table cdfs pdf

Shared By: