Hadoop in the Wild -- Our current understanding
Shared by: HC120831145656
-
Stats
- views:
- 0
- posted:
- 8/31/2012
- language:
- Latin
- pages:
- 11
Document Sample


Hadoop(MapReduce) in the Wild
—— Our current understandings & uses
of Hadoop
Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan
Presenter: Le Zhao
2008-05-08
The REAP Project
REAP is an intelligent tutor for English language learning
Intelligent tutors often use student models to generate individualized
instruction for each student
REAP cannot generate texts, but it can recognize texts
– POS tagging, NE tagging, search, categorization, …
– #filreq(
#band( #greater( textquality 85 ) #greater( readinglevel 2 )
#less( readinglevel 9 ) #less( doclength 1001 ) )
#combine( advocate evidence destroy propose … acceptance ) )
So, REAP needs a very large database of annotated texts
Previous approach to collecting texts didn’t scale well
– Gathered ~1 M high quality docs in ~1 year
– Typical yield rate <1%, for docs fitting tutoring criteria
Tasks Done on Hadoop
Web crawling of 200 million Web documents
– Two web crawls of 100 million web pages each
Text annotation and categorization of Web pages
– Part-of-speech, named-entity, sentence breaking
– Reading level, text quality, topic classification
– Output: {6 TB} + {42 GB} offset annotation
Filtering documents according to text quality
– Output: 6 million high quality HTML documents (114 GB)
Generating graph structure & PageRank
– Class project
Getting Started With Hadoop Quickly
Hadoop Streaming has been our most important tool for porting legacy tasks
and tools to hadoop
Runs any program with STDIN -> STDOUT
No need to recompile or relink with hadoop libraries
For 1 file/record streaming, Not the most efficient implementation
– Poor data locality
But very efficient in human time
– A day or two to get something running on 100 nodes
Hadoop in the Wild:
Trick #1
Q: My annotator takes file input, not STDIN
Solution: still Hadoop Streaming
– Prepare a list of filenames
– Distribute the filenames instead of file contents
– map.pl
takes one filename
download the file from HDFS (Hadoop Distributed File System)
apply the annotator
upload resulting files to HDFS
– No reducer needed
Can port any data distributive program onto Hadoop in a day
Efficient enough for computation intensive tasks
– Even though low data locality
Trick #2
Q: My annotator is a directory of programs, but Hadoop Streaming only
accepts files.
Solution: still Hadoop Streaming
– Make a tar ball of your directory of programs
– map.pl needs to extract & launch the program
Trick #3
Q: Hadoop programs are running on backend nodes, and are difficult to
debug
Use STDERR for debugging
– Also, if using HOD for managing the cluster
Views STDERR thru Web monitoring interface
Sees time spent on each Map/Reduce task
Pitfall #1:
It’s All a Matter of Balance
For higher performance:
It is important to have the right balance between Map & Reduce tasks
The default number of Map/Reduce processes per node is 2
– But some multicore / multiprocessor nodes can easily handle more (e.g., 6
on M45)
There is no good way to determine the right balance, except by parameter
sweeps
Pitfall #2:
Things Die, No Idea Why
Fault Tolerance and Diagnosis
If a Reduce task becomes unresponsive, it is killed
– E.g., if it is overwhelmed with work
– E.g., if its sort task is overwhelmed with work
– Diagnosing the cause of an unresponsive Reduce process is not always
easy
– Sometimes solved by increasing number of reducers
Unsolved Problems
Monitoring cluster for diagnostics
– CPU, Network, Disk I/O, Swap, etc.
– Simon web interface, but not working..
HOD (or Torque?) does not allow scheduling and prioritizing jobs
Reduce happens in a few nodes, waiting time for other idle nodes can be
long.
Shuffle & sort is opaque
– yet another black box
Thanks!
Comments? Ideas?
Related docs
Other docs by HC120831145656
Get documents about "