Introduction to Hadoop

Introduction to Hadoop Driven by Python Jon Miller jonEbird@gmail.com http://jonebird.com/ What is Hadoop? 09/27/09 2 What is Hadoop? ● Doug Cutting's daughter's stuffed toy elephant Distributed MapReduce System Apache Project with multiple sub-projects Core, HDFS then HBase, Hive, Pig, ZooKeeper ● ● 09/27/09 3 Where is the Python? 09/27/09 4 Where is the Python? ● ● Hadoop Streaming Automatically copies your python script to nodes Uses STDIN / STDOUT to communicate ● 09/27/09 5 Hadoop Architecture 09/27/09 6 Hadoop Architecture ● ● Expect hardware failures Take the computing to the data, NOT pull data to compute Datanodes, Tasktrackers & Jobtracker ● 09/27/09 7 Web Analytics Example 09/27/09 8 Mapper #!/usr/bin/env python import sys IGNORE_SITES = [ 'http://jonebird.com/', 'http://www.jonebird.com/' ] for line in sys.stdin: if line.count('"') == 6: # some entries I do not care about: # 1. Discard if referer is myself # 2. Discard if there is _no_ referer. i.e. "-" referer = line.split('"')[3] can_ignore = any( referer.startswith(site) for site in IGNORE_SITES ) if referer != '-' and not can_ignore: print '%s\t%d' % (referer, 1) 09/27/09 9 Reducer #!/usr/bin/env python import sys referer_count = {} # parse input from the mapping process for line in sys.stdin: try: referer, count = line.strip().split('\t', 1) count = int(count) referer_count[referer] = referer_count.get(referer, 0) + count except ValueError: # ignoring odd failures pass # Report our results for referer, count in referer_count.iteritems(): print '%s\t%s' % (referer, count) 09/27/09 10 Invocation # With $HADOOP_HOME PATH=$PATH:${HADOOP_HOME}/bin hadoop dfs -copyFromLocal /var/log/httpd/ apache_logs export HSTREAM="${HADOOP_HOME}/bin/hadoop jar \ ${HADOOP_HOME}/contrib/streaming/hadoop-${HADOOP_VERSION}-streaming.jar" # Now run the following command to get a quick # usage statement about using the streamer $HSTREAM -info $HSTREAM -D mapred.job.name='Apache Referer' \ -input apache_logs/access_log* \ -output apache_referer \ -mapper $(pwd)/mapper.py \ -reducer $(pwd)/reducer.py 09/27/09 11 Results # With $HADOOP_HOME PATH=$PATH:${HADOOP_HOME}/bin # View the resultant data sets in the HDFS hadoop dfs -ls apache_referer hadoop dfs -cat apache_referer/part* 09/27/09 12 Why Should I Care? 09/27/09 13 09/27/09 14 Questions? 09/27/09 Creative Commons License v3.0 15 Interwebs http://hadoop.apache.org/ http://cloudera.com/ http://developer.yahoo.com/hadoop/tutorial/ Books Hadoop: The Definitive Guide by Tom White Pro Hadoop by Jason Venner Videos Google MapReduce Lectures http://www.youtube.com/watch?v=yjPBkvYh-ss 09/27/09 Creative Commons License v3.0 16

Related docs
Introduction to Hadoop
Views: 380  |  Downloads: 66
Introduction to MapReduce and Hadoop
Views: 207  |  Downloads: 23
Hadoop and HBase vs RDBMS
Views: 10672  |  Downloads: 357
Introduction to Cloud Computing
Views: 208  |  Downloads: 37
Introduction To Pig
Views: 18  |  Downloads: 2
Introduction
Views: 39  |  Downloads: 0
Introduction
Views: 38  |  Downloads: 0
INTRODUCTION
Views: 4  |  Downloads: 0
Introduction
Views: 0  |  Downloads: 0
Introduction
Views: 21  |  Downloads: 0
INTRODUCTION-TO-THE
Views: 6  |  Downloads: 0
INTRODUCTION
Views: 5  |  Downloads: 0
An-introduction
Views: 0  |  Downloads: 0
Other docs by Civet
Stock Ledger and Capitalization Summary
Views: 1206  |  Downloads: 89
Homeopathic Questionnaire for Case Taking
Views: 883  |  Downloads: 41
LAST WILL AND TESTAMENT ALTERNATIVE
Views: 740  |  Downloads: 25
Board Resolution Authorizing Payment of Expenses
Views: 197  |  Downloads: 2
Articles of IncorporationCalifornia Simple
Views: 140  |  Downloads: 0
Dirty Joke Cheat
Views: 961  |  Downloads: 11
Stock Subscription Package
Views: 404  |  Downloads: 30
china paper1
Views: 235  |  Downloads: 2
iVillage Inc Ammendments and Bylaws
Views: 195  |  Downloads: 0
2007 Inst W-2G and 5754 (PDF) Instructions
Views: 204  |  Downloads: 1