nutch_crawling by YQg9V1b4

VIEWS: 0 PAGES: 20

									IST 441:Crawling and
 Indexing with Nutch
   Presented by: Sujatha Das
Slides courtesy: Saurabh Kataria
     Instructor :C. Lee Giles
Outline:
 Brief Overview
 Nutch as a complete web search
  engine
 Installation/Usage (with Demo)
Search Engine: Basic Workflow
Today’s class
   Build a complete web search engine with Nutch
   What is Nutch?
     Open   source search engine
     Written in Java
     Built on top of Apache Lucene
     Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
   Attractive Features:
     Customizable
     Extensible (e.g.      extend to Solr for enhanced
      portability)
                +Plugins
                +MapReduce & Distributed FS (Hadoop)
Why Nutch?
   Scalable
     Index    local host or entire Internet
   Portable
     Runs    anywhere with Java
   Flexible
     Plugin   system + API
     Java based, open source, many customizable scripts
      (http://lucene.apache.org/nutch/)
   Code pretty easy to read & work with
   Better than implementing it yourself!
Data Structures used by Nutch
   Web Database or WebDB
     Mirrors the properties/structure of web graph
      being crawled
   Segment
     Intermediateindex
     Contains pages fetched in a single run
   Index
          inverted index obtained by “merging”
     Final
      segments (Lucene)
WebDB
 Customized graph database
 Used by Crawler only
 Persistent storage for “pages” & “links”
     Page    DB: Indexed by URL and hash; contains
     content, outlinks, fetch information & score
     Link   DB: contains “source to target” links, anchor
     text
Segment
 Collection of pages fetched in a single run
 Contains:
     Output   of the fetcher
     List of the links to be fetched in the next run
      called “fetchlist”
   Limited life span (default 30 days)
Index
   To be discussed later
A generic Website Structure
                                    http://ist.psu.edu/
 A.html           B.html


     A_dup.html            C.html       C_dup.html




                  Wikipedia.org
Crawling
   Cyclic process
     crawler   generates a set of fetchlists from the
      WebDB
     fetchers downloads the content from the Web
     the crawler updates the WebDB with new links
      that were found
     and then the crawler generates a new set of
      fetchlists
     And Repeat till you reach the “depth”
Nutch as a crawler
           Initial URLs


               Injector                                    Web


               CrawlDB                                  Webpages/files
                                update                         get


   Generator              CrawlDBTool                     Fetcher


                                           read/write

         generate                        read/write
                          Segment                          Parser
Nutch as a complete web search
engine
              Partially
            Crawled Data
  Content                             Links


              Indexer      (Lucene)


               Index


              Searcher     (Lucene)



               GUI         (Tomcat)
Crawling: 10 stage process
    bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
    1. admin db –create: Create a new WebDB.
    2. inject: Inject root URLs into the WebDB.
     3. generate: Generate a fetchlist from the WebDB in a new
    segment.
    4. fetch: Fetch content from URLs in the fetchlist.
    5. updatedb: Update the WebDB with links from fetched pages.
    6. Repeat steps 3-5 until the required depth is reached.
    7. updatesegs: Update segments with scores and links from the
    WebDB.
    8. index: Index the fetched pages.
     9. dedup: Eliminate duplicate content (and duplicate URLs) from
    the indexes.
    10. merge: Merge the indexes into a single index for searching.
Demo: Configuration
   Configuration files (XML)
     Required      user parameters
        http.agent.name
        http.agent.description

        http.agent.url

        http.agent.email

     Adjustable      parameters for every component
          E.g. for fetcher:
             Threads-per-host
             Threads-per-ip
Configuration
  URL     Filters (Text file) (conf/crawl-urlfilter.txt)
    Regular expression to filter URLs during crawling
    E.g.
        To ignore files with certain suffix:
         -\.(gif|exe|zip|ico)$
        To accept host in a certain domain

         +^http://([a-z0-9]*\.)*apache.org/
Installation & Usage
   Installation
     Software   needed
        Nutch release
        Java

        Apache Tomcat (for GUI)

        Cgywin (for windows)
Installation & Usage
   Usage
       Crawling
            Initial URLs (text file or DMOZ file)
            Required parameters (conf/nutch-site.xml)
            URL filters (conf/crawl-urlfilter.txt)

       Indexing
            Automatic

       Searching
            Location of files (WAR file, index)
            The tomcat server
Demo:
   Site we would crawl: http://ist.psu.edu
       bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
   Analyze the database:
       bin/nutch readdb <db dir> –stats
       bin/nutch readdb <db dir> –dumppageurl
       bin/nutch readdb <db dir> –dumplinks
       bin/nutch readdb <db dir> -linkurl <linkurl>
       s=`ls -d <segment dir> /* | head -1`
       bin/nutch segread -dump $s
References
   http://lucene.apache.org/nutch/ -- Official
    website
   http://wiki.apache.org/nutch/ -- Nutch wiki
   http://lucene.apache.org/nutch/release/ Nutch
    source code
   www.nutchinstall.blogspot.com Installation guide
   http://www.robotstxt.org/wc/robots.html The web
    robot pages

								
To top