Hadoop in the Wild -- Our current understanding

Shared by: HC120831145656
Categories
Tags
-
Stats
views:
0
posted:
8/31/2012
language:
Latin
pages:
11
Document Sample
scope of work template
							Hadoop(MapReduce) in the Wild
—— Our current understandings & uses
            of Hadoop
    Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan
                  Presenter: Le Zhao
                      2008-05-08
                 The REAP Project
 REAP is an intelligent tutor for English language learning
 Intelligent tutors often use student models to generate individualized
  instruction for each student
 REAP cannot generate texts, but it can recognize texts
   – POS tagging, NE tagging, search, categorization, …
   – #filreq(
         #band( #greater( textquality 85 ) #greater( readinglevel 2 )
                #less( readinglevel 9 )    #less( doclength 1001 ) )
         #combine( advocate evidence destroy propose … acceptance ) )
 So, REAP needs a very large database of annotated texts
 Previous approach to collecting texts didn’t scale well
   – Gathered ~1 M high quality docs in ~1 year
   – Typical yield rate <1%, for docs fitting tutoring criteria
          Tasks Done on Hadoop
 Web crawling of 200 million Web documents
   – Two web crawls of 100 million web pages each
 Text annotation and categorization of Web pages
   – Part-of-speech, named-entity, sentence breaking
   – Reading level, text quality, topic classification
   – Output: {6 TB} + {42 GB} offset annotation
 Filtering documents according to text quality
   – Output: 6 million high quality HTML documents (114 GB)
 Generating graph structure & PageRank
   – Class project
Getting Started With Hadoop Quickly
Hadoop Streaming has been our most important tool for porting legacy tasks
  and tools to hadoop
 Runs any program with STDIN -> STDOUT
 No need to recompile or relink with hadoop libraries
 For 1 file/record streaming, Not the most efficient implementation
   – Poor data locality
 But very efficient in human time
   – A day or two to get something running on 100 nodes
                 Hadoop in the Wild:
                     Trick #1
 Q: My annotator takes file input, not STDIN
 Solution: still Hadoop Streaming
   – Prepare a list of filenames
   – Distribute the filenames instead of file contents
   – map.pl
         takes one filename
         download the file from HDFS (Hadoop Distributed File System)
         apply the annotator
         upload resulting files to HDFS
   – No reducer needed
 Can port any data distributive program onto Hadoop in a day
 Efficient enough for computation intensive tasks
   – Even though low data locality
                           Trick #2
 Q: My annotator is a directory of programs, but Hadoop Streaming only
  accepts files.
 Solution: still Hadoop Streaming
   – Make a tar ball of your directory of programs
   – map.pl needs to extract & launch the program
                           Trick #3
 Q: Hadoop programs are running on backend nodes, and are difficult to
  debug
 Use STDERR for debugging
   – Also, if using HOD for managing the cluster
        Views STDERR thru Web monitoring interface
        Sees time spent on each Map/Reduce task
                      Pitfall #1:
           It’s All a Matter of Balance
 For higher performance:
 It is important to have the right balance between Map & Reduce tasks
 The default number of Map/Reduce processes per node is 2
   – But some multicore / multiprocessor nodes can easily handle more (e.g., 6
       on M45)

There is no good way to determine the right balance, except by parameter
  sweeps
                   Pitfall #2:
            Things Die, No Idea Why
 Fault Tolerance and Diagnosis
 If a Reduce task becomes unresponsive, it is killed
   – E.g., if it is overwhelmed with work
   – E.g., if its sort task is overwhelmed with work
   – Diagnosing the cause of an unresponsive Reduce process is not always
      easy
   – Sometimes solved by increasing number of reducers
               Unsolved Problems
 Monitoring cluster for diagnostics
   – CPU, Network, Disk I/O, Swap, etc.
   – Simon web interface, but not working..
 HOD (or Torque?) does not allow scheduling and prioritizing jobs
 Reduce happens in a few nodes, waiting time for other idle nodes can be
  long.
 Shuffle & sort is opaque
   – yet another black box
      Thanks!
Comments?   Ideas?

						
Related docs
Other docs by HC120831145656
Cold Mountain In Class Writing Tasks
Views: 24  |  Downloads: 0
TERESA BRUNSON
Views: 4  |  Downloads: 0
classregistration
Views: 0  |  Downloads: 0
TASK WORKSHEET
Views: 0  |  Downloads: 0
registration form 12
Views: 0  |  Downloads: 0
Registration Conference
Views: 0  |  Downloads: 0
Chapter President of the Year SCORESHEETS
Views: 0  |  Downloads: 0
CRN HS pec Sheet
Views: 1  |  Downloads: 0