Docstoc

Distributed Computing with Hadoop

Document Sample
Distributed Computing with Hadoop Powered By Docstoc
					Distributed Computing with Hadoop

   Stefan Theußl, Albert Weichselbraun

   WU Vienna University of Economics and Business
            Augasse 2–6, 1090 Vienna
            Stefan.Theussl@wu.ac.at
         Albert.Weichselbraun@wu.ac.at


                 15. May 2009
Agenda

  Problem & Motivation

  The MapReduce Paradigm

  Distributed Text Mining in R

  Distributed Text Mining in Java

  Distributed Text Mining in Python

  Implementation Details
     Hadoop API
     Hadoop Streaming
     eWRT

  Discussion
Motivation



   Main motivation: large scale data processing
       Many tasks, i.e. we produce output data via processing lots of
       input data
       Want to make use of many CPUs
       Typically this is not easy (parallelization, synchronization,
       I/O, debugging, etc.)
       Need for an integrated framework
The MapReduce Paradigm
The MapReduce Paradigm

      Programming model inspired by functional language
      primitives
      Automatic parallelization and distribution
      Fault tolerance
      I/O scheduling
      Examples: document clustering, web access log analysis,
      search index construction, . . .

     Jeffrey Dean and Sanjay Ghemawat.
     MapReduce: Simplified data processing on large clusters.
     In OSDI’04, 6th Symposium on Operating Systems Design and
     Implementation, pages 137–150, 2004.
  Hadoop (http://hadoop.apache.org/core/) developed by
  the Apache project is an open source implementation of
  MapReduce.
The MapReduce Paradigm

                      Distributed Data


               Local Data          Local Data           Local Data




                  Map                   Map                Map

                    Intermediate Data


              Partial Result      Partial Result       Partial Result




                               Reduce         Reduce




                                Aggregated Result




                     Figure: Conceptual Flow
The MapReduce Paradigm


  A MapReduce implementation like Hadoop typically
  provides a distributed file system (DFS):
      Master/worker architecture (Namenode/Datanodes)
      Data locality
      Map tasks are applied to partitioned data
      Map tasks scheduled so that input blocks are on same
      machine
      Datanodes read input at local disk speed
      Data replication leads to fault tolerance
      Application does not care whether nodes are OK or not
Hadoop Streaming




       Utility allowing to create and run MapReduce jobs with any
       executable or script as the mapper and/or the reducer

  $HADOOP HOME/bin/hadoop jar $HADOOP HOME/hadoop-streaming.jar
       -input inputdir
       -output outputdir
       -mapper ./mapper
        -reducer ./reducer
Application: Text Mining in R
Why Distributed Text Mining?

      Highly interdisciplinary research field utilizing techniques from
      computer science, linguistics, and statistics
      Vast amount of textual data available in machine readable
      format:
          scientific articles, abstracts, books, . . .
          memos, letters, . . .
          online forums, mailing lists, blogs, . . .
      Data volumes (corpora) become bigger and bigger
      Steady increase of text mining methods (both in academia as
      in industry) within the last decade
      Text mining methods are becoming more complex and hence
      computer intensive
      Thus, demand for computing power steadily increases
Why Distributed Text Mining?




      High Performance Computing (HPC) servers available for a
      reasonable price
      Integrated frameworks for parallel/distributed computing
      available (e.g., Hadoop)
      Thus, parallel/distributed computing is now easier than ever
      Standard software for data processing already offer extensions
      to use this software
Text Mining in R



      tm Package
      Tailored for
          Plain texts, articles and papers
          Web documents (XML, SGML, . . . )
          Surveys
      Methods for
          Clustering
          Classification
          Visualization
Text Mining in R



      I. Feinerer
      tm: Text Mining Package, 2009
      URL http://CRAN.R-project.org/package=tm
      R package version 0.3-3
      I. Feinerer, K. Hornik, and D. Meyer
      Text mining infrastructure in R
      Journal of Statistical Software, 25(5):1–54, March 2008
      ISSN 1548-7660
      URL http://www.jstatsoft.org/v25/i05
Distributed Text Mining in R



   Motivation:
       Large data sets
       Corpus typically loaded into memory
       Operations on all elements of the corpus (so-called
       transformations)
   Available transformations: stemDoc(), stripWhitespace(),
   tmTolower(), . . .
Distributed Text Mining Strategies in R




   Possibilities:
           Text mining using tm and MapReduce (via Hadoop
           framework)
           Text mining using tm and MPI/snow1




      1
          Luke Tierney (version 0.3-3 on CRAN)
Distributed Text Mining in R



   Solution (Hadoop):
       Data set copied to DFS (DistributedCorpus)
       Only meta information about the corpus in memory
       Computational operations (Map) on all elements in parallel
       Work horse tmMap()
       Processed documents (revisions) can be retrieved on demand
Distributed Text Mining in R
Distributed Text Mining in R - Listing
   Mapper (called by tmMap):
    1   . hadoop _ generate _ tm _ mapper <- function ( script , FUN , ...){
    2      writeLines ( sprintf ( ’# ! / usr / bin / env Rscript
    3        ## load tm package
    4        require (" tm ")
    5        fun <- % s
    6        ## read from stdin
    7        input <- readLines ( file (" stdin ") )
    8        ## create object of class ’ PlainTextDocument ’
    9        doc <- new ( " PlainTextDocument " , . Data = input [ -1 L ] ,
   10           DateTimeStamp = Sys . time () )
   11        ## apply function on document
   12        result <- fun ( doc )
   13        ## write key
   14        writeLines ( input [1 L ] )
   15        ## write value
   16        writeLines ( Content ( result ) )
   17        ’ , FUN ) , script )
   18   }
Distributed Text Mining in R


   Example: Stemming
          Erasing word suffixes to retrieve their radicals
          Reduces complexity
          Stemmers provided in packages Rstem1 and Snowball2
   Data:
          Wizard of Oz book series (http://www.gutenberg.org)
          20 books, each containing 1529 – 9452 lines of text




     1
         Duncan Temple Lang (version 0.3-0 on Omegahat)
     2
         Kurt Hornik (version 0.0-3 on CRAN)
Distributed Text Mining in R




   Workflow:
      Start Hadoop framework
      Put files into DFS
      Apply map functions (e.g., stemming of document)
      Retrieve output data
Distributed Text Mining in R


      Infrastructure
          Computers of PC Lab used as worker nodes
          8 PCs with an Intel Pentium 4 CPU @ 3.2 GHz
          and 1 GB of RAM
          Each PC has > 20 GB reserved for DFS
          Development platform: 8-core Power 6 shared memory system
          No cluster installation (yet)
      MapReduce framework
          Hadoop
          Implements MapReduce + DFS
          Local R and tm installation
          Code for R/Hadoop integration (upcoming package hive)
Benchmark

                                    Runtime                                     Speedup

                        q




                                                                    7
                 3500




                                                                    6
                                                                    5
   Runtime [s]

                 2500




                                                          Speedup

                                                                    4
                                                                    3
                 1500




                                                                    2
                 500




                                                                    1


                        1   2   3    4   5    6   7   8                 1   2   3   4   5   6   7   8

                                     # CPU                                          # CPU
Lessons Learned




      Problem size has to be sufficiently large
      Location of texts in DFS (currently: ID = file path)
      Thus, serialization difficult (how to update text IDs?)
      Remote file operation on DFS around 2.5 sec.
Application: Text Mining in Java
Text Mining in Java



   Frameworks
      GATE - General Architecture for Text Engineering
      (Cunningham, 2002)
      UIMA - Unstructured Information Management Architecture
      (IBM)
      OpenNLP
      WEKA (Witten & Frank, 2005)
      webLyzard
Text Mining in Java


      webLyzard Framework
      Tailored for
          Web documents (HTML)
          Common Text Formats (PDF, ODT, DOC, . . . )
      Annotation Components:
          geographic tagging
          Keywords and co-occurring terms
          Named entitiy tagging
          Sentiment tagging
      Search and Indexing:
          Lucene
Text Mining in Python




   Frameworks
      Natural Language Toolkit (NLTK)
      easy Web Retrieval Toolkit (eWRT)
      webLyzard
Text Mining in Python

      Annotation Components:
          Keywords and co-occurring terms
          Language detection (short words, trigrams, bayes)
          Part of speech (POS) tagging
          Sentiment
      Data Sources
          Alexa web services (Web page rating)
          Amazon product reviews (sentiment)
          DBpedia (ontology data)
          Del.icio.us (social bookmarking)
          Facebook (in development)
          OpenCalais (named entity recognition)
          Scarlet (relation discovery)
          TripAdvisor (sentiment +)
          Wikipedia (disambiguation, synonyms)
          WordNet (disambiguation)
Implementation Details
Distributed Processing

   Approaches
      Hadoop API
          efficient
          build in mechanism for processing arbitrary objects
            write specialized code
      Hadoop-streaming
          programming language independend
          easy to use
             text only → objects?
      eWRT transparent caching
          transparent object caching
          low overhead
             currently Python specific
Hadoop API - Mapper and Reducer



      Mapper and Reducer
          Implement the Mapper, Reducer Interface
          Inherited from MapReduceBase
      Objects
          Implement the Writable interface
          based on the DataInput, DataOutput = Serializable
          readFields() → initializes object
          write() → serializes object
Hadoop API - Mapper and Reducer
                  Mapper                                                   Reducer
   +map(K1:key,V1:value,OutputCollector<K2,          +reduce(K2:key,Iterator<V2>:values,OutputCollector<K3,
        V2>:output,Reporter:reporter)                        V3>:output,Reporter:reporter)



                                        MapReduceBase




                                                JobConf
                        MyMapper                                  MyReducer


                                          use             use




                                                  Writable

                                        +readFileds(DataOutput:in)
                                        +write(DataOutput:out)



                 {K1 key: V1 value} -> {K2 key: V2 values} -> {K3 key: V3 value}
Hadoop API - Distributed File System - Listing


    1   Configuration conf = new Configuration ();
    2   FileSystem hadoopFs = FileSystem . get ( conf );

    4   // Input and Output files
    5   Path inFile = new Path ( argv [0]);
    6   Path outFile = new Path ( argv [1]);

    8   // Check whether a file exists
    9   if (! hadoopFs . exists ( inFile )) {
   10     printAndExit ( " Input file not found " );
   11   }
Hadoop API - Distributed File System - Access




   2   FSDataInputStre am in = hadoopFs . open ( inFile );
   3   while (( bytesRead = in . read ( buffer )) > 0) {
   4      process ( buffer , bytesRead );
   5   }
   6   in . close ();
Hadoop Streaming - How to save your Objects? (1/4)



      Hadoop streaming:
          no fixed order
          on a per text line base
      Approaches
          do not use objects
          enclose meta data into the input
          straight forward solution
          → use your program languages serialization facilities
               Java: Serializable Interface
               Python: cPickle
               R: serialize
Hadoop Streaming - How to save your Objects? (2/4)



      Pitfalls:
           Binary serialization formats
           More than one line serializations (Python)
           Performance
      Suggestions:
           Transparent compression
           (lightweight: .gz; compare: Google’s Bigtable implementation)
           Encode the data stream using for instance Base64
           (Compare: XMLRPC, SOAP, ...)
Hadoop Streaming - How to save your Objects? (3/4)

   1   from cPickle import dumps

   3   class TestClass ( object ):
   4          def __init__ ( self ,a , b ):
   5            self . a = a
   6            self . b = b

   8   ti = TestClass ( " Tom " , " Susan " )
   9   print dumps ( ti )

   ccopy_reg\nreconstructor\np1\n(c__main__\nTestClass\np2
   c__builtin__\nobject\np3\nNtRp4
   (dp5\nS’a’\nS’Tom’\np6\nsS’b’\nS’Susan’\np7\nsb.
Hadoop Streaming - How to save your Objects? (4/4)


   1   from base64 import encodestring
   2   from zlib import compress

   4   serialize = lambda obj : encodestring ( compress ( obj ))


            Object            Payload       Serialized Size
            String            8 Bytes       33 Bytes
            Tuple/List/Set    8 Bytes       41/49/86 Bytes
            TestClass         8 Bytes       150 Bytes
            Reuters Article   3474 Bytes    1699 Bytes
                      Table: Serialization Overhead
eWRT - Transparent Object Caching



      transparent
          minimal code modifications
          adaptor pattern
      disk or in memory cache
      Current use:
          complex computations
          access to remote resources
          complex database queries
eWRT - Transparent Object Caching


      original code:
      1   def getCalaisTags ( self , document_id ):
      2       return self . calais . fetch_tags ( document_id )

      refined code:
      1   def __init__ ( self ):
      2       self . cache = \
      3         Cache ( " ./ cache " , c a c he _ n es t i ng _ l ev e l =2)

      5   def getCalaisTags ( self , document_id ):
      6       return self . cache . fetch ( \
      7          document_id , self . calais . fetch_tags )
Conclusion




      MapReduce has proven to be a useful abstraction
      Greatly simplifies distributed computing
      Developer focus on problem
      Implementations like Hadoop deal with messy details
             different approaches to facilitate Hadoop’s infrastructure
             language- and use case dependent
Thank You for Your Attention!


   Stefan Theußl
   Department of Statistics and Mathematics
   email: Stefan.Theussl@wu.ac.at
   URL: http://statmath.wu.ac.at/~theussl
   Albert Weichselbraun
   Department of Information Systems and Operations
   email: Albert.Weichselbraun@wu.ac.at
   URL: http://www.ai.wu.ac.at/~aweichse
   WU Vienna University of Economics and Business
   Augasse 2–6, A-1090 Wien

				
DOCUMENT INFO
Shared By:
Stats:
views:24
posted:9/9/2011
language:English
pages:42
Description: Distributed computing is a computer science, which studies how a huge computing power needed to solve the problem into many small parts, and then assign these parts to many computer processing, the final results of these calculations together to get the final results.