Programming In Hadoop

Document Sample
Programming In Hadoop Powered By Docstoc
					                               Cloud Computing
                                  @Yahoo!

Dekel Tankel
Director, Product Management
Yahoo! Cloud Computing

dekel@yahoo-inc.com

                               IGT, June 2009
           What we’ll cover today…

• Why Cloud?
   – Scale and Abstraction; Quality and Agility
   – Yahoo!’s unique footprint
• Yahoo!’s Cloud Strategy
   – Overview of the Yahoo! Cloud vision and portfolio
   – Deep dive on Horizontal & Functional Cloud Services
• The Yahoo! Open Strategy
   – Marrying Yahoo!’s “Open Strategy”, its platforms and ethic
     with external Cloud services
              Why Cloud? Benefits for Yahoo!

Higher Agility & Stability while maintaining Scale
                                                                       Cloud is pushing up the
• Abstraction                                                        Operation Excellence Curve
   – Enable developers to focus on




                                              Agility & Innovation
     their applications, not infrastructure
• Accelerating innovation
   – Adding new features and products
     at an ever faster rate
• Increasing Scale & Availability
   – More robustly, more globally,                                     Quality & Stability
     more completely, for a given budget
           Yahoo!’s Unique Cloud:
           Unprecedented Scale
• Massive user base and engagement
  – 500M+ unique users per month
  – Hundreds of petabyte of storage
  – Hundreds of billions of objects
  – Hundred of thousands of requests/sec

• Global
   – Tens of globally distributed data centers
   – Serving each region at low latencies

• Challenging Users
   – Rapidly extracting value from voluminous data
   – Downtime is not an option (outages cost $millions)
   – Variable usage patterns
              Yahoo! Cloud Services

                                          ROI & Innovation

        Y!OS, BOSS,         Users
         YQL, APT,
        Analytics, …
                         Applications


                       Functional Cloud
  Storage,                Services
Batch, Edge
 Serving,…             Horizontal Cloud
                           Services


                        Physical Layer
  Yahoo! Cloud Services:
  Focus on PaaS offerings
                             ROI & Innovation

               Users

SaaS        Applications


          Functional Cloud
PaaS         Services

          Horizontal Cloud
              Services

IaaS
           Physical Layer
           From Infrastructure to
           Shareholders benefit
• Horizontal Cloud
   – Focus on open source and collaborative R&D with industry,
     academia and government
• Functional Cloud
   – Focus on developing "open strategy" frameworks, tools
     and services for developers (at Yahoo! and beyond)
• Combined Together
   – Leverage our unique scale, assets and data to drive
     disruptive innovations in the market and expand Yahoo!’s
     competitive differentiation
           Yahoo! Cloud Strategy in Action:
           The Front Page Case Study

• Horizontal Cloud – Storage & Hadoop
   – Analyze extremely large content data sets

• Functional Cloud – Content Optimization
   – Rate content items based on various parameters

• Applications – Yahoo’s Front Page
   – Display “high rating” items to the right users
   – Benefit consumers and advertisers and grow
     Yahoo!’s revenue
           Yahoo! Cloud Strategy in Action:
           The Inquisitor Case Study
• Horizontal Cloud – Hadoop
   – Analyze large search-index data sets

• Functional Cloud - BOSS
   – Expose the data in a structured, open, flexible
     and “cloud like” way

• Applications - iPhoneTM Inquisitor
   – Leverage BOSS to provide innovative consumer experience
   – Benefit consumers and grow Yahoo!’s revenue
Horizontal Cloud Services
                            ROI & Innovation

              Users

           Applications


         Functional Cloud
            Services

        Horizontal Cloud
            Services

          Physical Layer
         Horizontal Cloud Services

• Optimized for Yahoo!-scale
   – Yahoo!-internal focus
   – Data processing and serving environments
• Drive faster innovation and agility
   – Shorter product development cycles
   – Reduce labor and costs for infrastructure
• Multi-year effort
   – Strategic investment across the company
      Horizontal Cloud Services:
      Conceptual View
                             Simple API’s



   Operational         Batch         Edge Content       Online
    Storage          Storage &         Services         Serving
                     Processing
    Structured,                         Caching,        Web, Data
   unstructured      Hadoop, PIG         Proxies


  ID & Account       Security and       Metering,       Monitoring &
  Management        Authentication       Billing           QoS

                  Provisioning & Virtualization (Xen)
                      Shared Infrastructure
        Common Approaches to QA, Production Engineering,
Performance Engineering, Datacenter Management, and Optimization
Horizontal Cloud Services: Use Cases



           Content                   Search Index
         Optimization
                Machine
                Learning
             (e.g. Spam filters)
                                          Ads
                                      Optimization

              Attachment
                Storage
                                   Image/Video
                                    Storage &
                                     Delivery
          Yahoo! Distribution of Hadoop

• Hadoop in a nutshell
   – Open source distributed file system & parallel execution
     environment to process massive amounts of data
   – Started in 2005, became top-level Apache project in 2008
   – Simple Design for Horizontal Scaling on commodity HW

• Yahoo! Distribution of Hadoop
   – Source distribution of Yahoo!’s implementation of Hadoop
     (Based entirely on code found in the Apache Hadoop)
   – Tested and deployed at Yahoo!’s massive scale
   – Benefit the larger ecosystem , Increase pace of innovation
   – http://developer.yahoo.com/hadoop
              Yahoo! runs the largest
            Hadoop Clusters in the World
• 25,000+ nodes
   – Clusters of up to 4,000 nodes
• 4 Tiers of clusters
   – Development & Testing, POCs, Science & Research, Production
• Terasort Benchmarks
   – 62 seconds to sort One Terabyte (run on 1,500 nodes)
   – 16.25 hours to sort One Petabyte (run on 3,700 nodes)
• Webmap application
   – ~490 TB shuffling
   – ~280 TB output
            Case Study - Search Assist™




•   Database for Search Assist™ is built using Hadoop.
•   3 years of log-data, 20-steps of map-reduce
•   Leverage Hadoop’s scalability, load balancing and resiliency
•   Simplified access, flexibility for rapid innovation (from C++ to Python)


                            Before Hadoop            After Hadoop
    Time                    26 days                  20 minutes
    Development Time        2-3 weeks                2-3 days

                                                                               16
Functional Cloud Services
                            ROI & Innovation

              Users

           Applications


        Functional Cloud
           Services

         Horizontal Cloud
             Services


          Physical Layer
            Functional Cloud Services

• Provides functional capabilities for applications
   – Help developers to accomplish integrated web experiences
     in a faster and easier way
   – Provides common set of functional “building blocks”

• “Powered by” the horizontal cloud services
  – Abstracts infrastructure services from the Application
       • E.g. Storage, Compute, Serving, Robustness and Scalability
   – Self-Served, Global, Managed, Elastic and Metered
                  Functional Cloud Services:
                  YQL & BOSS



                 •

    Yahoo! Query Language                 Build your Own Search Service



A single endpoint service that enables         Providing Yahoo! Search
    developers to query, filter and        infrastructure and technology to
         combine data across              developers and companies to help
          Yahoo! and beyond                  them build their own search
                                                      experiences

http://developer.yahoo.com/yql/console/   http://developer.yahoo.com/search/boss/
          Build your Own Search Service (BOSS)

• Yahoo!'s open search web services platform
• Serving hundreds of millions of users across the Web.
• Goal: foster innovation in the search industry
   – Build and launch web-scale search products that utilize the
     entire Yahoo! Search index.
   – Access to Yahoo!'s investments in crawling and indexing,
     ranking and relevancy algorithms
           Yahoo! Query Language (YQL)

• Single endpoint service to query, filter and combine data
  across Yahoo! and beyond
   – The “Internet API”
• SQL-like SELECT syntax for getting the right data
   – Quickly discover available data sources and structure
   – Combined data from a single web browser
• Easy-to-use Consol
   – http://developer.yahoo.com/yql/console/
Y!OS and Cloud
Yahoo! Open Stagey (Y!OS): Goals




                                   23
Y!OS and Cloud Strategy




         CLOUD SERVICES




                          24
                 Open Collaborations
                  around the globe
• M45 - Yahoo!’s supercomputing cluster
  – 4,000 cores, 3 TB RAM, 1.5 PB disks, 27 teraflops!
  – Operational since November 2007, 4 major Universities
  – Focus on highly parallel computing

• Open Cirrus™ with HP & Intel
   – A global, multi-data center, open source test bed
   – Target to advance cloud computing research & education
   – Simulates a real-life, Internet-scale environment
   – 9 Global sites, more than 50 research projects
                               Questions?

Dekel Tankel
Director, Product Management
Yahoo! Cloud Computing

dekel@yahoo-inc.com