Docstoc

Azure MapReduce

Document Sample
Azure MapReduce Powered By Docstoc
					  Azure MapReduce

      Thilina Gunarathne
Salsa group, Indiana University
                    Agenda
•   Recap of Azure Cloud Services
•   Recap of MapReduce
•   Azure MapReduce Architecture
•   Pairwise distance alignment implementation
•   Next steps
              Cloud Computing
• On demand computational services over web
   – Backed by massive commercial infrastructures giving
     economies of scale
   – Spiky compute needs of the scientists
• Horizontal scaling with no additional cost
   – Increased throughput
• Cloud infrastructure services
   – Storage, messaging, tabular storage
   – Cloud oriented services guarantees
   – Virtually unlimited scalability
• Future seems to be CLOUDY!!!
               Azure Platform
• Windows Azure Compute
  – .net platform as a service
  – Worker roles & web roles
• Azure Storage
  – Blobs
  – Queues
  – Table
• Development SDK, fabric and storage
                    MapReduce
•   Automatic parallelization & distribution
•   Fault-tolerant
•   Provides status and monitoring tools
•   Clean abstraction for programmers
    – map (in_key, in_value) ->
       (out_key, intermediate_value) list

    – reduce (out_key, intermediate_value list) ->
       out_value list
                Motivation
• Currently no parallel programming framework
  on Azure
  – No MPI, No Dryad
• Well known, easy to use programming model
• Cloud nodes are not as reliable as
  conventional cluster nodes
      Azure MapReduce Concepts
• Take advantage of the cloud services
    – Distributed services, Unlimited scalability
    – Backed by industrial strength data centers and
      technologies
•   Decentralized control
•   Dynamically scale up/down
•   Eventual consistency
•   Large latencies
    – Coarser grained map tasks
• Global queue based scheduling
  1




1.Client driver loads the map & reduce tasks to the queues
                                2




2. Map workers retrieve map tasks from the queue
                                                        3




3. Map workers download data from the Blob storage and start processing
                                        4

4. Reduce workers pick the tasks from the queue and start
  monitoring the reduce task tables
                                                  5




5. Finished map tasks upload the results to Blob storage. Add
  entries to the respective reduce task tables.
                                             6




6. Reduce tasks download the intermediate data products
                                               7


7. Start reducing when all the map tasks are finished and when a
  reduce task is finished downloading the intermediate data products
    Azure MapReduce Architecture
•   Client API and driver
•   Map tasks
•   Reduce tasks
•   Intermediate data transfer
•   Monitoring
•   Configurations
                Fault tolerance
• Use the visibility timeout of the queues
  – Currently maximum is 3 hours
  – Delete the message from the queue only after
    everything is successful
  – Execution, upload, update status
• Tasks will rerun when timeout happens
  – Ensures eventual completion
  – Intermediate data are persisted in blob storage
  – Retry up to 3 times
• Many retries in service invocations
                Apache Hadoop         Microsoft Dryad [25]      Twister [19]       Azure Map
                [24] /(Google MR)                                                  Reduce/Twister
Programming     MapReduce             DAG execution,            Iterative          MapReduce-- will
Model                                 Extensible to             MapReduce          extend to Iterative
                                      MapReduce and other                          MapReduce
                                      patterns
Data Handling   HDFS (Hadoop          Shared Directories &      Local disks and    Azure Blob Storage
                Distributed File      local disks               data management
                System)                                         tools
Scheduling      Data Locality; Rack   Data locality;            Data Locality;     Dynamic task
                aware, Dynamic        Network                   Static task        scheduling through
                task scheduling       topology based            partitions         global queue
                through global        run time graph
                queue                 optimizations; Static task
                                      partitions
Failure Handling Re-execution of      Re-execution of failed     Re-execution of Re-execution of
                 failed tasks;        tasks; Duplicate           Iterations      failed tasks;
                 Duplicate execution  execution of slow tasks                    Duplicate execution
                 of slow tasks                                                   of slow tasks
Environment      Linux Clusters,     Windows HPCS cluster      Linux Cluster     Window Azure
                 Amazon Elastic Map                            EC2               Compute, Windows
                 Reduce on EC2                                                   Azure Local
                                                                                 Development
                                                                                 Fabric
Intermediate    File, Http            File, TCP pipes, shared- Publish/Subscribe Files, TCP
data transfer                         memory FIFOs             messaging
           Why Azure Services
• Virtually unlimited scalable distributed
  services
• No need to install software stacks
  – In fact you can’t 
  – Eg: NaradaBrokering, HDFS, Database
• Zero maintenance
  – Let the platform take care of you
• Availability guarantees
                      API
• ProcessMapRed(jobid, container, params,
     numReduceTasks, storageAccount,
     mapQName,         reduceQName,List
     mapTasks)
• Map(key, value, programArgs, Dictionary
     outputCollector)
• Reduce(key, List values, programArgs, Dictionary
     outputCollector)
   Develop applications using Azure
            MapReduce
• Local debugging using Azure development
  fabric
• DistributedCache
  – Bundle with Azure Package
• Compile in release mode before creating the
  package.
• Deploy using Azure web interface
• Errors logged to a Azure Table
 SWG Pairwise Distance Alignment
• SmithWaterman-GOTOH
• Pairwise sequence alignment
  – Align each sequence with all the other sequences
               Application architecture
                Block decomposition
               1          2         3         4
            (1-100)   (101-200) (201-300) (301-400)

    1
             M1         M2      from M6      M3       Reduce 1
 (1-100)


    2
          from M2       M4        M5      from M9
(101-200)                                             Reduce 2


    3
             M6       from M5     M7         M8
(201-300)                                             Reduce 3


    4
          from M3       M9      from M8     M10
(301-400)                                             Reduce 4
                            AzureMR SWG Performance
                                 10k Sequences
                     9000
                     8000
Execution Time (s)




                     7000                                         Execution Time(s)
                     6000
                     5000
                     4000
                     3000
                     2000
                     1000
                        0
                            0   32           64          96            128      160
                                     Number of Azure Small Instances
                          AzureMR SWG Performance
                               10k Sequences
                 7
                 6
Alignment Time (ms)




                 5
                 4
                 3
                 2
                                                Time Per Alignment Per Instance
                 1
                 0
                      0     32          64           96          128              160
                                 Number of Azure Small Instances
                       AzureMR SWG Performance on
                          Different Instance Types

                     700
                     600
Execution Time (s)




                     500
                     400
                                                           Execution Time
                     300
                     200
                     100
                      0
                           Small   Medium          Large   ExtraLarge
                                       Instance Type
                                        AzureMR SWG Performance on
                                             Different Data Sizes
                                    8
Time for an Actual Aligement (ms)




                                    7
                                    6
                                    5
                                    4
                                                               Time Per Alignment Per Core (ms)
                                    3
                                    2
                                    1
                                    0
                                    4000   5000   6000     7000       8000       9000       10000
                                                     Number of Sequences
                   Next Steps
• In the works
  – Monitoring web interface
  – Alternative intermediate data communication
    mechanisms
  – Public release
• Future plans
  – AzureTwister
     • Iterative MapReduce
                 Thanks!!
• Questions? 
                      References
• J. Dean, and S. Ghemawat, “MapReduce: simplified data
  processing on large clusters,” Commun. ACM, vol. 51, no. 1,
  pp. 107-113., 2008.
• J.Ekanayake, H.Li, B.Zhang et al., “Twister: A Runtime for
  iterative MapReduce,” in Proceedings of the First International
  Workshop on MapReduce and its Applications of ACM HPDC
  2010 conference June 20-25, 2010, Chicago, Illinois, 2010.
• Cloudmapreduce,
  http://sites.google.com/site/huanliu/cloudmapreduce.pdf
• "Apache Hadoop," http://hadoop.apache.org/
• M. Isard, M. Budiu, Y. Yu et al., "Dryad: Distributed data-
  parallel programs from sequential building blocks." pp. 59-72.
            Acknowledgments
• Prof. Geoffrey Fox, Dr. Judy Qiu and the Salsa
  group
• Dr. Ying Chen and Alex De Luca from IBM
  Almaden Research Center
• Virtual School Organizers

				
DOCUMENT INFO
Shared By:
Tags: Azure
Stats:
views:36
posted:10/19/2011
language:English
pages:30
Description: With the cloud computing era, software development models and business models will enter a new era of open portfolio. Microsoft's cloud computing platform Windows Azure, Microsoft, which will bring a new era. Azure comes from the French word meaning sky blue color, which is what Microsoft had hoped to fight the bearer of all cloud applications and services on the blue sky. Since Microsoft is beginning to show in the field of cloud computing a go, of course, will not let us down. In Microsoft Visual Studio 2010 products, you can see the cloud shadow.