Introduction to Hadoop

Introduction to Hadoop Driven by Python Jon Miller jonEbird@gmail.com http://jonebird.com/ What is Hadoop? 09/27/09 2 What is Hadoop? ● Doug Cutting's daughter's stuffed toy elephant Distributed MapReduce System Apache Project with multiple sub-projects Core, HDFS then HBase, Hive, Pig, ZooKeeper ● ● 09/27/09 3 Where is the Python? 09/27/09 4 Where is the Python? ● ● Hadoop Streaming Automatically copies your python script to nodes Uses STDIN / STDOUT to communicate ● 09/27/09 5 Hadoop Architecture 09/27/09 6 Hadoop Architecture ● ● Expect hardware failures Take the computing to the data, NOT pull data to compute Datanodes, Tasktrackers & Jobtracker ● 09/27/09 7 Web Analytics Example 09/27/09 8 Mapper #!/usr/bin/env python import sys IGNORE_SITES = [ 'http://jonebird.com/', 'http://www.jonebird.com/' ] for line in sys.stdin: if line.count('"') == 6: # some entries I do not care about: # 1. Discard if referer is myself # 2. Discard if there is _no_ referer. i.e. "-" referer = line.split('"')[3] can_ignore = any( referer.startswith(site) for site in IGNORE_SITES ) if referer != '-' and not can_ignore: print '%s\t%d' % (referer, 1) 09/27/09 9 Reducer #!/usr/bin/env python import sys referer_count = {} # parse input from the mapping process for line in sys.stdin: try: referer, count = line.strip().split('\t', 1) count = int(count) referer_count[referer] = referer_count.get(referer, 0) + count except ValueError: # ignoring odd failures pass # Report our results for referer, count in referer_count.iteritems(): print '%s\t%s' % (referer, count) 09/27/09 10 Invocation # With $HADOOP_HOME PATH=$PATH:${HADOOP_HOME}/bin hadoop dfs -copyFromLocal /var/log/httpd/ apache_logs export HSTREAM="${HADOOP_HOME}/bin/hadoop jar \ ${HADOOP_HOME}/contrib/streaming/hadoop-${HADOOP_VERSION}-streaming.jar" # Now run the following command to get a quick # usage statement about using the streamer $HSTREAM -info $HSTREAM -D mapred.job.name='Apache Referer' \ -input apache_logs/access_log* \ -output apache_referer \ -mapper $(pwd)/mapper.py \ -reducer $(pwd)/reducer.py 09/27/09 11 Results # With $HADOOP_HOME PATH=$PATH:${HADOOP_HOME}/bin # View the resultant data sets in the HDFS hadoop dfs -ls apache_referer hadoop dfs -cat apache_referer/part* 09/27/09 12 Why Should I Care? 09/27/09 13 09/27/09 14 Questions? 09/27/09 Creative Commons License v3.0 15 Interwebs http://hadoop.apache.org/ http://cloudera.com/ http://developer.yahoo.com/hadoop/tutorial/ Books Hadoop: The Definitive Guide by Tom White Pro Hadoop by Jason Venner Videos Google MapReduce Lectures http://www.youtube.com/watch?v=yjPBkvYh-ss 09/27/09 Creative Commons License v3.0 16

Related docs
Introduction to Hadoop
Views: 389  |  Downloads: 69
Introduction to Hadoop
Views: 0  |  Downloads: 0
Introduction to MapReduce and Hadoop
Views: 228  |  Downloads: 25
Hadoop and HBase vs RDBMS
Views: 11333  |  Downloads: 384
AN INTRODUCTION
Views: 5  |  Downloads: 0
introduction of
Views: 25  |  Downloads: 0
the introduction
Views: 18  |  Downloads: 2
introduction to the
Views: 32  |  Downloads: 2
AN INTRODUCTION
Views: 9  |  Downloads: 0
Introduction to Cloud Computing
Views: 231  |  Downloads: 41
Introduction To Pig
Views: 39  |  Downloads: 7
introduction
Views: 1  |  Downloads: 0
INTRODUCTION
Views: 11  |  Downloads: 0
INTRODUCTION
Views: 6  |  Downloads: 0
Other docs by Civet
r491
Views: 318  |  Downloads: 3
adopt315
Views: 109  |  Downloads: 0
Corio Inc Ammendments and By laws
Views: 253  |  Downloads: 0
Transmittal Letter to SEC Enclosing Form D 2
Views: 209  |  Downloads: 1
Time off policies
Views: 429  |  Downloads: 12
iVillage Inc Ammendments and Bylaws
Views: 210  |  Downloads: 0
BILL OF SALE WITH WARRANTY OF TITLE
Views: 262  |  Downloads: 1
Response to Preliminary Allegations
Views: 202  |  Downloads: 3
Business selection checklist
Views: 494  |  Downloads: 16
pos030
Views: 180  |  Downloads: 0