Docstoc

A Berkeley View of Big Data - Amazon Web Services

Document Sample
A Berkeley View of Big Data - Amazon Web Services Powered By Docstoc
					                                                                     5/10/2012




                           A Berkeley View of Big Data

                                     Anthony D. Joseph
                                        UC Berkeley


                                   EDUSERV Symposium
                                      10 May 2012




                                    Who Am I?
                    • Research:
                       – Internet-scale systems (RAD Lab, AMP Lab)
                       – Security (DETERlab Testbed)
                       – Adversarial machine learning (SecML)

                    • Teaching (undergrad/grad): operating
                      systems and systems, security, networking

                    Disclaimer: I don’t speak for UC or our
                     research sponsors




AMPLab Overview -
franklin@cs.berkeley.edu                                                    1
                                                              5/10/2012




                                 Big Data is Massive…
                  • Facebook:
                     – 130TB/day: user logs
                     – 200-400TB/day: 83 million pictures
                     – >40 Billion photos

                  • Google: > 25 PB/day processed data

                  • Data generated by LHC: 1 PB/sec

                  • Total data created in 2010: 1.ZettaByte
                    (1,000,000 PB)/year
                     – ~60% increase every year
                     3




                                      …and Diverse…
                     • Walmart
                           – >1 million customer
                             transactions/hr
                           – >2.5 PByte customer DB

                     • Human genome sequencing
                           – Analyzing 3 billion base pairs
                           – Ten years for first one (2003)
                           – Today, less than one week



                     4




AMPLab Overview -
franklin@cs.berkeley.edu                                             2
                                                                         5/10/2012




                                     …and Novel…
                   • Analyzing data from user behavior vs user input

                   • USGS TED
                     – Twitter-based Earthquake Detector




                   • Google Trends: “nowcasting”
                         – http://www.google.org/flutrends/
                         – US 2009 “Cash for Clunkers” program success
                         – US State unemployment rates
                     5




                               …and Grows Bigger…
                  • More and more devices



                  • More and more people




                  • Cheaper and cheaper storage
                     – ~50% increase in GB/$ every year
                     6




AMPLab Overview -
franklin@cs.berkeley.edu                                                        3
                                                                             5/10/2012




                                     …and Bigger!
                  • Log everything!
                     – Don’t always know what question you’ll need
                       to answer


                  • Stored data
                    growing faster
                    than both
                    available
                    storage and GB/$

                     7




                            Which Big Data to Keep?

                    • Hard to decide what to delete




                         – Thankless decision: people know only when you
                           are wrong!
                         – “Climate Research Unit (CRU) scientists admit
                           they threw away key data used in global warming
                           calculations”
                     8




AMPLab Overview -
franklin@cs.berkeley.edu                                                            4
                                                                                   5/10/2012




                          Data Retention Requirements

                     • New NSF data retention requirements
                           – Proposals submitted after 18 January 2011
                             must include a “Data Management Plan”
                           – Have to keep all data (including metadata) for
                             3 years after research award conclusion
                           – Institutional/org considerations:
                             • Opportunity to invest in pooled storage: campus,
                               systemwide, regional, …
                             • Typical cost: 8TB chunks at $1.44/GB/year
                               collaborative space and $0.17/GB/year for archive
                     9
                               space




                              Big Data Isn’t Always Big


                           Data that is expensive to manage,
                            and hard to extract value from


                  • You don’t need to be big to have big data problem!
                     – Inadequate tools to analyze data
                     – Data management may dominate infrastructure cost


                     10




AMPLab Overview -
franklin@cs.berkeley.edu                                                                  5
                                                                                                                     5/10/2012




                              Big Data is not Cheap!
                  • Storing and managing 1PB
                    data: $500K-$1M/ year
                     – Facebook: 200 PB/year

                                                          100%
                  • “Typical” cloud-based


                                               Infrastructure cost
                                                                     80%
                    service startup (e.g.,                           60%   ~1PB storage capacity
                    Conviva)                                         40%

                     – Log storage dominates                         20%

                       infrastructure cost                           0%
                                                                            2007         2008        2009     2010
                                                                                   Storage cluster    Other
                     11




                     Hard to Extract Value from Data!
                  • Data is
                     – Diverse, variety of sources
                     – Uncurated, no schema, inconsistent semantics, syntax
                     – Integration a huge challenge

                  • No easy way to get answers that are
                     – High-quality
                     – Timely

                  • Challenge: maximize value from data by getting
                    best possible answers
                     12




AMPLab Overview -
franklin@cs.berkeley.edu                                                                                                    6
                                                                                  5/10/2012




                     Requires Multifaceted Approach
                     • Three dimensions to improve data
                       analysis
                           – Improving scale, efficiency, and quality of
                             algorithms (Algorithms)
                           – Scaling up datacenters (Machines)
                           – Leverage human activity and intelligence
                             (People)


                     • Need to adaptively and flexibly combine all
                       three dimensions
                     13




                                  The State of the Art
                     • Today’s apps: fixed point in solution space
                                         Algorithms
                                                          Watson/IBM
                                                                       search




                                                                       Machines



                            People
                   Need techniques to dynamically pick best
                     14
                               operating point



AMPLab Overview -
franklin@cs.berkeley.edu                                                                 7
                                                                                       5/10/2012




                      What Is the Big Data Problem?
                     • For two main reasons:
                           – the more data the greater chance to find any
                             pattern you’d like to find
                              • the more rows in a table, the more columns
                              • the more columns, the more hypotheses that can
                                be considered
                              • indeed, the number of hypotheses grows
                                exponentially in the number of columns
                           – the more data the less likely a sophisticated
                             ML algorithm will run in an acceptable time
                             frame
                              • and then we have to back off to cheaper
                                algorithms that may be more error-prone




                       A Formulation of the Problem

                     • Given an inferential goal and a fixed
                       computational budget, provide a guarantee
                       (supported by an algorithm and an analysis) that
                       the quality of inference will increase
                       monotonically as data accrue (without bound)
                           – This is far from being achieved in the current state of
                             the literature!
                     • It can be achieved by building a scalable system
                       that blends statistical and computational design
                       principles




AMPLab Overview -
franklin@cs.berkeley.edu                                                                      8
                                                                                    5/10/2012




                                    Big Data in the US
                     • Many Fortune 1000+ companies with huge write
                       once, read none big data collections
                           – For all the reasons I’ve already outlined…

                     • US Government agencies in same situation
                           – New R&D funding

                     • Many companies developing proprietary solutions

                     • Very active open source big data tools committee
                           – Broad international participation
                           – Data Without Borders helping non-profits through pro
                             bono data collection, analysis, and visualization
                     17




                            Significant USG Investment
                     • 29 March 2012
                           – US federal agencies announced more than
                             $200 million in new commitments
                           – Dept of Defense, Dept of Homeland Security,
                             Dept of Energy, Veterans Administration, Office
                             of Scientific and Technical Information, Health
                             and Human Services, Food and Drug Admin,
                             National Archives & Records Admin, National
                             Aerospace & Space Admin, National Institutes
                             of Health, National Science Foundation,
                             National Security Agency, US Geological
                     18      Service




AMPLab Overview -
franklin@cs.berkeley.edu                                                                   9
                                                                               5/10/2012




                     Active Open Source Community

                     • On-going development of several elements
                       of Big Data analysis pipeline
                           •   Apache Hadoop (MapReduce)
                           •   Hive
                           •   Apache Pig
                           •   R / Octave
                     • Much more is needed!
                           • E.g., new analysis environments

                     19




                                        The AMP Lab
                        Make sense of data at scale by tightly
                    integrating algorithms, machines, and people
                                        Algorithms
                                                       Watson/IBM
                                                                    search




                                                                    Machines



                     20
                               People




AMPLab Overview -
franklin@cs.berkeley.edu                                                             10
                                                                                     5/10/2012




                               AMP Faculty and Sponsors
                     • Faculty
                           –   Alex Bayen (mobile sensing platforms)
                           –   Armando Fox (systems)
                           –   Michael Franklin (databases): Director
                           –   Michael Jordan (machine learning): Co-director
                           –   Anthony Joseph (security & privacy)
                           –   Randy Katz (systems)
                           –   David Patterson (systems)
                           –   Ion Stoica (systems): Co-director
                           –   Scott Shenker (networking)
                     • Sponsors:



                     21




                                             Algorithms
                     • State-of-art Machine Learning (ML)
                       algorithms do not scale
                           – Prohibitive to process all data points
                           Estimate




                                                                       true answer



                                           How do you know
                                           when to stop?

                                                      # of data points
                     22




AMPLab Overview -
franklin@cs.berkeley.edu                                                                   11
                                                                                5/10/2012




                                        Algorithms
                     • Given any problem, data and a budget
                           – Immediate results with continuous improvement
                           – Calibrate answer: provide error bars
                           Estimate




                                                                  true answer



                                              Error bars on every
                                              answer!

                                                 # of data points
                     23




                                        Algorithms
                     • Given any problem, data and a time budget
                           – Immediate results with continuous improvement
                           – Calibrate answer: provide error bars
                           Estimate




                                                                  true answer



                                           Stop when error
                                           smaller than a given
                                           threshold
                                                 # of data points
                     24                                time




AMPLab Overview -
franklin@cs.berkeley.edu                                                              12
                                                                                           5/10/2012




                                              Algorithms
                     • Given any problem, data and a time budget
                               – Automatically pick the best algorithm
                    Estimate




                                                         simple
                                                                             true answer
                                                             sophisticated



                                error    pick
                                too high sophisticated     pick simple
                                                                              time
                     25




                                               Machines
                     • “The datacenter as a computer” still in its
                       infancy
                               – Special purpose clusters, e.g., Hadoop cluster
                               – Highly variable performance
                               – Hard to program
                               – Hard to debug



                                                                  =?

                     26




AMPLab Overview -
franklin@cs.berkeley.edu                                                                         13
                                                                                                             5/10/2012




                                                                   Machines
                      • Make datacenter a real computer!


                  • Share datacenter between multiple cluster computing
                  apps
                  • Provide new abstractions and services
                                                                                                  AMP
                                                                                                  stack
                                      Datacenter “OS” (e.g., Mesos)
                                                                                                  Existing
                     Node OS            Node OS                           …         Node OS
                    (e.g. Linux)     (e.g. Windows)                                (e.g. Linux)   stack

                      27




                                                                   Machines
                      • Make datacenter a real computer!


                                                                     Support existing
                    Hive
                                                       Cassandra
                                          Hypertbale




                                                                     cluster computing
                               MPI
                      Hadoop




                                     …
                                                                     apps
                                                                                                  AMP
                                                                                                  stack
                                      Datacenter “OS” (e.g., Mesos)

                     Node OS            Node OS                                     Node OS
                                                                                                  Existing
                                                                          …                       stack
                    (e.g. Linux)     (e.g. Windows)                                (e.g. Linux)

                      28




AMPLab Overview -
franklin@cs.berkeley.edu                                                                                           14
                                                                                                                5/10/2012




                                                                   Machines
                      • Make datacenter a real computer!
                                                                                          Predictive &
                   Support interactive                                                    insightful query
                   and iterative data                                                     language
                   analysis (e.g., ML
                    Hive
                                                       Cassandra
                                          Hypertbale
                                                                                     PIQL


                                                                    Spark
                               MPI




                   algorithms)…
                      Hadoop




                                                                            …

                                                                                    SCADS
                                                                                                     AMP
                                                                                                     stack
                                                        Consistency
                                      Datacenter “OS” (e.g., Mesos)
                                                        adjustable data
                     Node OS            Node OS         store                          Node OS
                                                                                                     Existing
                                                                                …                    stack
                    (e.g. Linux)     (e.g. Windows)                                   (e.g. Linux)

                      29




                                                                   Machines
                      • Make datacenter a real computer!

                                                   Applications, tools
                    Hive
                                                       Cassandra
                                          Hypertbale




                                                                                     PIQL
                                                                    Spark




                                                                   • Advanced ML algorithms
                               MPI
                      Hadoop




                                     …                                     …

                                                                   • Interactive data mining
                                                                                    SCADS            AMP
                                                                   • Collaborative visualization     stack
                                      Datacenter “OS” (e.g., Mesos)

                     Node OS            Node OS                                        Node OS
                                                                                                     Existing
                                                                                …                    stack
                    (e.g. Linux)     (e.g. Windows)                                   (e.g. Linux)

                      30




AMPLab Overview -
franklin@cs.berkeley.edu                                                                                              15
                                                                                              5/10/2012




                                               People
                     • Humans can make sense of messy data!




                     31




                                               People
                  • Make people an integrated part of
                    the system!
                     – Leverage human activity
                                                                   Machines +
                     – Leverage human intelligence
                       (crowdsourcing):
                                                                   Algorithms
                           • Curate and clean dirty data
                                                                        Questions
                                                             activity




                                                                                    Answers




                           • Answer imprecise questions
                                                             data,




                           • Test and improve algorithms


                  • Challenge
                     – Inconsistent answer quality in all
                       dimensions (e.g., type of question,
                       time, cost)
                     32




AMPLab Overview -
franklin@cs.berkeley.edu                                                                            16
                                                           5/10/2012




                                  Real Applications
                  • Mobile Millennium Project
                     – Alex Bayen, Civil and Environment
                       Engineering, UC Berkeley
                  • Microsimulation of urban
                    development
                     – Paul Waddell, College of
                       Environment Design, UC Berkeley
                  • Crowd based opinion formation
                     – Ken Goldberg, Industrial
                       Engineering and Operations
                       Research, UC Berkeley
                  • Personalized Sequencing
                     – Taylor Sittler, UCSF
                     33




                           Personalized Sequencing




                     34




AMPLab Overview -
franklin@cs.berkeley.edu                                         17
                                                                                        5/10/2012




                                       The AMP Lab
                        Make sense of data at scale by tightly
                    integrating algorithms, machines, and people
                                         Algorithms
                                                      Microsimulation
                                    Mobile
                                    Millennium

                                                                Sequencing




                                                                             Machines



                     35
                           People




                                    Big Data in 2020
                                       Are you prepared?
                   • To create a new generation of big data scientist
                   • For ML to become an engineering discipline
                   • For people to be deeply integrated in big data
                     analysis pipeline
                   • Will your institution
                       – offer a big data curriculum touching all fields?
                       – have hired cross-disciplinary faculty?
                       – have invested in (pooled) storage infrastructure?
                       – have invested in public/private clouds?
                     36
                       – have built inter/intra campus networks?




AMPLab Overview -
franklin@cs.berkeley.edu                                                                      18
                                                                                5/10/2012




                                          Summary
                     • Goal: Tame Big Data Problem
                           – Get results with right quality at the right time
                     • Approach: Holistic integration of
                       Algorithms, Machines, and People
                     • Huge research issues across many
                       domains




                     37




AMPLab Overview -
franklin@cs.berkeley.edu                                                              19

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:1/29/2013
language:Unknown
pages:19