Introduction to Hadoop - PowerPoint

Document Sample
Introduction to Hadoop - PowerPoint Powered By Docstoc
					Introduction to Apache Hadoop

 CSCI 572: Information Retrieval and
           Search Engines
            Summer 2010
                      Outline
•   What is Hadoop?
•   Where did it come from?
•   What are the current versions of Hadoop?
•   What can it do?




    May-20-10          CS572-Summer2010        CAM-2
               Apache Hadoop
• The brainchild of Doug
  Cutting
• Built out by brilliant engineers and contributors
  from Yahoo, and Facebook and Cloudera and
  other companies
• Started in 2007/2008 when code was spun out of
  Nutch
• Has grown into really large project at Apache with
  significant ecosystem
  May-20-10           CS572-Summer2010         CAM-3
                    How to get started
• Hadoop (0.20.0/0.20.2)
   – Put your Java hat on
   – Go here:
         • http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html
         • If you want to do this on Windows, get Cygwin, or VMWare or
           something that you can run Linux on
         • Run the Map Reduce examples on local mode
         • Check on the data generated in your HDFS
   – Scaling it out
         • Amazon Elastic Map Reduce
         • Setting it up on your own cluster: DataNodes and
           Task/JobTracker
  May-20-10                     CS572-Summer2010                CAM-4
                  Basic Operations
• Listing files
   – ./bin/hadoop fs –ls
• Writing files
   – ./bin/hadoop fs –put
• Running Map Reduce Jobs
   – mkdir input
   – cp conf/*.xml input
   – ./bin/hadoop jar hadoop-*-examples.jar grep input
     output 'dfs[a-z.]+’
   – cat output/*
  May-20-10                 CS572-Summer2010         CAM-5
                Advanced Topics
• Writing your Mappers and Reducers
   – Check out Map Reduce Tutorial here:
   – http://hadoop.apache.org/common/docs/r0.20.0/mapred
     _tutorial.html
   – Code for several examples including Word Count




  May-20-10             CS572-Summer2010           CAM-6
       Other Hadoop ecosystem projects
• HBase
   – Big Table
• HIVE
   – Built at FB, provides SQL interface on HDFS
• Chukwa
   – Log Processing
• Pig
   – Scientific data analysis language on top of M/R and HDFS
• Zookeeper
   – Distributed Systems management

  May-20-10                  CS572-Summer2010                   CAM-7
              No releases in a while
• Stick with 0.20.x




  May-20-10           CS572-Summer2010   CAM-8
                      Wrapup
• Lots more information at
   – http://hadoop.apache.org
   – http://hadoop.apache.org/mapreduce/
   – http://hadoop.apache.org/hdfs/
• Project ideas
   – Implement GIS or geometrical algorithm in Map
     Reduce
   – Write REST interface to control HDFS and to M/R
   – Add new Writeable input data formats
   – Integrate Solr and Hadoop
  May-20-10             CS572-Summer2010           CAM-9
              Acknowledgements
• Material inspired by discussions and talks on the
  Apache Mailing lists for Hadoop and through
  discussions with the rest of the Hadoop community




  May-20-10           CS572-Summer2010        CAM-10

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:76
posted:11/29/2011
language:English
pages:10