nutch_crawling by YQg9V1b4


									IST 441:Crawling and
 Indexing with Nutch
   Presented by: Sujatha Das
Slides courtesy: Saurabh Kataria
     Instructor :C. Lee Giles
 Brief Overview
 Nutch as a complete web search
 Installation/Usage (with Demo)
Search Engine: Basic Workflow
Today’s class
   Build a complete web search engine with Nutch
   What is Nutch?
     Open   source search engine
     Written in Java
     Built on top of Apache Lucene
     Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
   Attractive Features:
     Customizable
     Extensible (e.g.      extend to Solr for enhanced
                +Plugins
                +MapReduce & Distributed FS (Hadoop)
Why Nutch?
   Scalable
     Index    local host or entire Internet
   Portable
     Runs    anywhere with Java
   Flexible
     Plugin   system + API
     Java based, open source, many customizable scripts
   Code pretty easy to read & work with
   Better than implementing it yourself!
Data Structures used by Nutch
   Web Database or WebDB
     Mirrors the properties/structure of web graph
      being crawled
   Segment
     Intermediateindex
     Contains pages fetched in a single run
   Index
          inverted index obtained by “merging”
     Final
      segments (Lucene)
 Customized graph database
 Used by Crawler only
 Persistent storage for “pages” & “links”
     Page    DB: Indexed by URL and hash; contains
     content, outlinks, fetch information & score
     Link   DB: contains “source to target” links, anchor
 Collection of pages fetched in a single run
 Contains:
     Output   of the fetcher
     List of the links to be fetched in the next run
      called “fetchlist”
   Limited life span (default 30 days)
   To be discussed later
A generic Website Structure
 A.html           B.html

     A_dup.html            C.html       C_dup.html

   Cyclic process
     crawler   generates a set of fetchlists from the
     fetchers downloads the content from the Web
     the crawler updates the WebDB with new links
      that were found
     and then the crawler generates a new set of
     And Repeat till you reach the “depth”
Nutch as a crawler
           Initial URLs

               Injector                                    Web

               CrawlDB                                  Webpages/files
                                update                         get

   Generator              CrawlDBTool                     Fetcher


         generate                        read/write
                          Segment                          Parser
Nutch as a complete web search
            Crawled Data
  Content                             Links

              Indexer      (Lucene)


              Searcher     (Lucene)

               GUI         (Tomcat)
Crawling: 10 stage process
    bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
    1. admin db –create: Create a new WebDB.
    2. inject: Inject root URLs into the WebDB.
     3. generate: Generate a fetchlist from the WebDB in a new
    4. fetch: Fetch content from URLs in the fetchlist.
    5. updatedb: Update the WebDB with links from fetched pages.
    6. Repeat steps 3-5 until the required depth is reached.
    7. updatesegs: Update segments with scores and links from the
    8. index: Index the fetched pages.
     9. dedup: Eliminate duplicate content (and duplicate URLs) from
    the indexes.
    10. merge: Merge the indexes into a single index for searching.
Demo: Configuration
   Configuration files (XML)
     Required      user parameters
        http.agent.description

        http.agent.url


     Adjustable      parameters for every component
          E.g. for fetcher:
             Threads-per-host
             Threads-per-ip
  URL     Filters (Text file) (conf/crawl-urlfilter.txt)
    Regular expression to filter URLs during crawling
    E.g.
        To ignore files with certain suffix:
        To accept host in a certain domain

Installation & Usage
   Installation
     Software   needed
        Nutch release
        Java

        Apache Tomcat (for GUI)

        Cgywin (for windows)
Installation & Usage
   Usage
       Crawling
            Initial URLs (text file or DMOZ file)
            Required parameters (conf/nutch-site.xml)
            URL filters (conf/crawl-urlfilter.txt)

       Indexing
            Automatic

       Searching
            Location of files (WAR file, index)
            The tomcat server
   Site we would crawl:
       bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
   Analyze the database:
       bin/nutch readdb <db dir> –stats
       bin/nutch readdb <db dir> –dumppageurl
       bin/nutch readdb <db dir> –dumplinks
       bin/nutch readdb <db dir> -linkurl <linkurl>
       s=`ls -d <segment dir> /* | head -1`
       bin/nutch segread -dump $s
 -- Official
 -- Nutch wiki
 Nutch
    source code
 Installation guide
 The web
    robot pages

To top