Distributed Computing with Hadoop

Document Sample
Distributed Computing with Hadoop Powered By Docstoc
					Distributed Computing with Hadoop

   Stefan Theußl, Albert Weichselbraun

   WU Vienna University of Economics and Business
            Augasse 2–6, 1090 Vienna

                 15. May 2009

  Problem & Motivation

  The MapReduce Paradigm

  Distributed Text Mining in R

  Distributed Text Mining in Java

  Distributed Text Mining in Python

  Implementation Details
     Hadoop API
     Hadoop Streaming


   Main motivation: large scale data processing
       Many tasks, i.e. we produce output data via processing lots of
       input data
       Want to make use of many CPUs
       Typically this is not easy (parallelization, synchronization,
       I/O, debugging, etc.)
       Need for an integrated framework
The MapReduce Paradigm
The MapReduce Paradigm

      Programming model inspired by functional language
      Automatic parallelization and distribution
      Fault tolerance
      I/O scheduling
      Examples: document clustering, web access log analysis,
      search index construction, . . .

     Jeffrey Dean and Sanjay Ghemawat.
     MapReduce: Simplified data processing on large clusters.
     In OSDI’04, 6th Symposium on Operating Systems Design and
     Implementation, pages 137–150, 2004.
  Hadoop (http://hadoop.apache.org/core/) developed by
  the Apache project is an open source implementation of
The MapReduce Paradigm

                      Distributed Data

               Local Data          Local Data           Local Data

                  Map                   Map                Map

                    Intermediate Data

              Partial Result      Partial Result       Partial Result

                               Reduce         Reduce

                                Aggregated Result

                     Figure: Conceptual Flow
The MapReduce Paradigm

  A MapReduce implementation like Hadoop typically
  provides a distributed file system (DFS):
      Master/worker architecture (Namenode/Datanodes)
      Data locality
      Map tasks are applied to partitioned data
      Map tasks scheduled so that input blocks are on same
      Datanodes read input at local disk speed
      Data replication leads to fault tolerance
      Application does not care whether nodes are OK or not
Hadoop Streaming

       Utility allowing to create and run MapReduce jobs with any
       executable or script as the mapper and/or the reducer

  $HADOOP HOME/bin/hadoop jar $HADOOP HOME/hadoop-streaming.jar
       -input inputdir
       -output outputdir
       -mapper ./mapper
        -reducer ./reducer
Application: Text Mining in R
Why Distributed Text Mining?

      Highly interdisciplinary research field utilizing techniques from
      computer science, linguistics, and statistics
      Vast amount of textual data available in machine readable
          scientific articles, abstracts, books, . . .
          memos, letters, . . .
          online forums, mailing lists, blogs, . . .
      Data volumes (corpora) become bigger and bigger
      Steady increase of text mining methods (both in academia as
      in industry) within the last decade
      Text mining methods are becoming more complex and hence
      computer intensive
      Thus, demand for computing power steadily increases
Why Distributed Text Mining?

      High Performance Computing (HPC) servers available for a
      reasonable price
      Integrated frameworks for parallel/distributed computing
      available (e.g., Hadoop)
      Thus, parallel/distributed computing is now easier than ever
      Standard software for data processing already offer extensions
      to use this software
Text Mining in R

      tm Package
      Tailored for
          Plain texts, articles and papers
          Web documents (XML, SGML, . . . )
      Methods for
Text Mining in R

      I. Feinerer
      tm: Text Mining Package, 2009
      URL http://CRAN.R-project.org/package=tm
      R package version 0.3-3
      I. Feinerer, K. Hornik, and D. Meyer
      Text mining infrastructure in R
      Journal of Statistical Software, 25(5):1–54, March 2008
      ISSN 1548-7660
      URL http://www.jstatsoft.org/v25/i05
Distributed Text Mining in R

       Large data sets
       Corpus typically loaded into memory
       Operations on all elements of the corpus (so-called
   Available transformations: stemDoc(), stripWhitespace(),
   tmTolower(), . . .
Distributed Text Mining Strategies in R

           Text mining using tm and MapReduce (via Hadoop
           Text mining using tm and MPI/snow1

          Luke Tierney (version 0.3-3 on CRAN)
Distributed Text Mining in R

   Solution (Hadoop):
       Data set copied to DFS (DistributedCorpus)
       Only meta information about the corpus in memory
       Computational operations (Map) on all elements in parallel
       Work horse tmMap()
       Processed documents (revisions) can be retrieved on demand
Distributed Text Mining in R
Distributed Text Mining in R - Listing
   Mapper (called by tmMap):
    1   . hadoop _ generate _ tm _ mapper <- function ( script , FUN , ...){
    2      writeLines ( sprintf ( ’# ! / usr / bin / env Rscript
    3        ## load tm package
    4        require (" tm ")
    5        fun <- % s
    6        ## read from stdin
    7        input <- readLines ( file (" stdin ") )
    8        ## create object of class ’ PlainTextDocument ’
    9        doc <- new ( " PlainTextDocument " , . Data = input [ -1 L ] ,
   10           DateTimeStamp = Sys . time () )
   11        ## apply function on document
   12        result <- fun ( doc )
   13        ## write key
   14        writeLines ( input [1 L ] )
   15        ## write value
   16        writeLines ( Content ( result ) )
   17        ’ , FUN ) , script )
   18   }
Distributed Text Mining in R

   Example: Stemming
          Erasing word suffixes to retrieve their radicals
          Reduces complexity
          Stemmers provided in packages Rstem1 and Snowball2
          Wizard of Oz book series (http://www.gutenberg.org)
          20 books, each containing 1529 – 9452 lines of text

         Duncan Temple Lang (version 0.3-0 on Omegahat)
         Kurt Hornik (version 0.0-3 on CRAN)
Distributed Text Mining in R

      Start Hadoop framework
      Put files into DFS
      Apply map functions (e.g., stemming of document)
      Retrieve output data
Distributed Text Mining in R

          Computers of PC Lab used as worker nodes
          8 PCs with an Intel Pentium 4 CPU @ 3.2 GHz
          and 1 GB of RAM
          Each PC has > 20 GB reserved for DFS
          Development platform: 8-core Power 6 shared memory system
          No cluster installation (yet)
      MapReduce framework
          Implements MapReduce + DFS
          Local R and tm installation
          Code for R/Hadoop integration (upcoming package hive)

                                    Runtime                                     Speedup



   Runtime [s]






                        1   2   3    4   5    6   7   8                 1   2   3   4   5   6   7   8

                                     # CPU                                          # CPU
Lessons Learned

      Problem size has to be sufficiently large
      Location of texts in DFS (currently: ID = file path)
      Thus, serialization difficult (how to update text IDs?)
      Remote file operation on DFS around 2.5 sec.
Application: Text Mining in Java
Text Mining in Java

      GATE - General Architecture for Text Engineering
      (Cunningham, 2002)
      UIMA - Unstructured Information Management Architecture
      WEKA (Witten & Frank, 2005)
Text Mining in Java

      webLyzard Framework
      Tailored for
          Web documents (HTML)
          Common Text Formats (PDF, ODT, DOC, . . . )
      Annotation Components:
          geographic tagging
          Keywords and co-occurring terms
          Named entitiy tagging
          Sentiment tagging
      Search and Indexing:
Text Mining in Python

      Natural Language Toolkit (NLTK)
      easy Web Retrieval Toolkit (eWRT)
Text Mining in Python

      Annotation Components:
          Keywords and co-occurring terms
          Language detection (short words, trigrams, bayes)
          Part of speech (POS) tagging
      Data Sources
          Alexa web services (Web page rating)
          Amazon product reviews (sentiment)
          DBpedia (ontology data)
          Del.icio.us (social bookmarking)
          Facebook (in development)
          OpenCalais (named entity recognition)
          Scarlet (relation discovery)
          TripAdvisor (sentiment +)
          Wikipedia (disambiguation, synonyms)
          WordNet (disambiguation)
Implementation Details
Distributed Processing

      Hadoop API
          build in mechanism for processing arbitrary objects
            write specialized code
          programming language independend
          easy to use
             text only → objects?
      eWRT transparent caching
          transparent object caching
          low overhead
             currently Python specific
Hadoop API - Mapper and Reducer

      Mapper and Reducer
          Implement the Mapper, Reducer Interface
          Inherited from MapReduceBase
          Implement the Writable interface
          based on the DataInput, DataOutput = Serializable
          readFields() → initializes object
          write() → serializes object
Hadoop API - Mapper and Reducer
                  Mapper                                                   Reducer
   +map(K1:key,V1:value,OutputCollector<K2,          +reduce(K2:key,Iterator<V2>:values,OutputCollector<K3,
        V2>:output,Reporter:reporter)                        V3>:output,Reporter:reporter)


                        MyMapper                                  MyReducer

                                          use             use



                 {K1 key: V1 value} -> {K2 key: V2 values} -> {K3 key: V3 value}
Hadoop API - Distributed File System - Listing

    1   Configuration conf = new Configuration ();
    2   FileSystem hadoopFs = FileSystem . get ( conf );

    4   // Input and Output files
    5   Path inFile = new Path ( argv [0]);
    6   Path outFile = new Path ( argv [1]);

    8   // Check whether a file exists
    9   if (! hadoopFs . exists ( inFile )) {
   10     printAndExit ( " Input file not found " );
   11   }
Hadoop API - Distributed File System - Access

   2   FSDataInputStre am in = hadoopFs . open ( inFile );
   3   while (( bytesRead = in . read ( buffer )) > 0) {
   4      process ( buffer , bytesRead );
   5   }
   6   in . close ();
Hadoop Streaming - How to save your Objects? (1/4)

      Hadoop streaming:
          no fixed order
          on a per text line base
          do not use objects
          enclose meta data into the input
          straight forward solution
          → use your program languages serialization facilities
               Java: Serializable Interface
               Python: cPickle
               R: serialize
Hadoop Streaming - How to save your Objects? (2/4)

           Binary serialization formats
           More than one line serializations (Python)
           Transparent compression
           (lightweight: .gz; compare: Google’s Bigtable implementation)
           Encode the data stream using for instance Base64
           (Compare: XMLRPC, SOAP, ...)
Hadoop Streaming - How to save your Objects? (3/4)

   1   from cPickle import dumps

   3   class TestClass ( object ):
   4          def __init__ ( self ,a , b ):
   5            self . a = a
   6            self . b = b

   8   ti = TestClass ( " Tom " , " Susan " )
   9   print dumps ( ti )

Hadoop Streaming - How to save your Objects? (4/4)

   1   from base64 import encodestring
   2   from zlib import compress

   4   serialize = lambda obj : encodestring ( compress ( obj ))

            Object            Payload       Serialized Size
            String            8 Bytes       33 Bytes
            Tuple/List/Set    8 Bytes       41/49/86 Bytes
            TestClass         8 Bytes       150 Bytes
            Reuters Article   3474 Bytes    1699 Bytes
                      Table: Serialization Overhead
eWRT - Transparent Object Caching

          minimal code modifications
          adaptor pattern
      disk or in memory cache
      Current use:
          complex computations
          access to remote resources
          complex database queries
eWRT - Transparent Object Caching

      original code:
      1   def getCalaisTags ( self , document_id ):
      2       return self . calais . fetch_tags ( document_id )

      refined code:
      1   def __init__ ( self ):
      2       self . cache = \
      3         Cache ( " ./ cache " , c a c he _ n es t i ng _ l ev e l =2)

      5   def getCalaisTags ( self , document_id ):
      6       return self . cache . fetch ( \
      7          document_id , self . calais . fetch_tags )

      MapReduce has proven to be a useful abstraction
      Greatly simplifies distributed computing
      Developer focus on problem
      Implementations like Hadoop deal with messy details
             different approaches to facilitate Hadoop’s infrastructure
             language- and use case dependent
Thank You for Your Attention!

   Stefan Theußl
   Department of Statistics and Mathematics
   email: Stefan.Theussl@wu.ac.at
   URL: http://statmath.wu.ac.at/~theussl
   Albert Weichselbraun
   Department of Information Systems and Operations
   email: Albert.Weichselbraun@wu.ac.at
   URL: http://www.ai.wu.ac.at/~aweichse
   WU Vienna University of Economics and Business
   Augasse 2–6, A-1090 Wien

Shared By:
Description: Distributed computing is a computer science, which studies how a huge computing power needed to solve the problem into many small parts, and then assign these parts to many computer processing, the final results of these calculations together to get the final results.