The Claremont Report on Database Research.ppt by handongqp


									Computing in the Clouds

       Aaron Weiss
  Cloud Computing

      Brian Hayes
Communications of the ACM
       July 2008
             Cloud Computing
• The next big thing
• But what is it?
  – Web-based applications (thin client)
  – Utility computing – grid that changes rates for
    processing time
  – Distributed or parallel computing designed to
    scale for efficiency
• Also called:
  – On-demand computing, software as a service,
    Internet as platform
                   Data Centers
• Decades ago – computing power in mainframes in
  computer rooms
• Personal computers changed that
• Now, in network data centers with centralized
  computing are back in vogue
   – But: no longer a hub-and-spoke
• Although Google famous for innovating web
  searching, Google’s architecture as much a
   – Instead of few expensive servers, use many cheap
     servers (1/2M servers in ~ 12 locations)
• With thin, wide network
  – Derive more from scale of the whole than any one
    part – no hub
• Cloud – robust and self-healing
  – Uses too much power
  – Cheaper power solutions we’ve talked about
    earlier in class
  – Heavy utilization of virtualization
     • Single server multiple OS instances, minimize CPU idle
• CloudOS (VMWare and Cisco)
  – Instead of each server running own copy of OS
    (Current Google model)
  – Should have single OS treats everything in data
    center as another resource
     • Network channels to coordinate events
     • Cloud more cohesive entity
• Entire user interface resides in single window
  – Provide all facilities of OS inside a browser
• Program must continue running even as
  number of users grows
• Communication model is many-to-many
• To move applications to cloud must master
  multiple languages and operating
  – In many cloud applications, back-end process
    relies on relational DB so part of code in SQL
     • Client side in JavaScript or embedded within HTML
     • Server application in between written in scripting
         Distributed Computing
• Speed of cloud depends on delegation
  – Break up into subtasks
     • Retrieving results of search
     • DB query – parse results, construct result sets, formal
       results, etc.
     • If tasks small enough, simultaneous
     • Dependencies? Complex
  – Distributed computing not new
     • SETI, Folding
     • Hadoop – Apache Foundation
        – No need for creating specialized custom software
        – Distributes petabytes of data projects, 1000s nodes
                    A Utility Grid
• In past, pay for cost of cycles used
• Today most organizations create own data centers
   – But cost to run
   – Use 99% of capacity only 10% of time
• In Web service, lots of hosting providers
   – Typically do not replicate distributed computing
• Amazon, Google, etc. should scale up data centers, create
  business models to support third party use
   – Amazon EC2 fee based public use 10/07
       •   Customers create virtual image of SW environment
       •   Create instance of machine in Amazon’s cloud
       •   Appears to user as dedicated server
       •   Customers choose configuration
       •   Customers can create/destroy at will
             – If surge in visitors, additional instances on demand
             – If slows down, terminate extra instances
       • Charges $0.10 per instance hour based on compute units regardless of
         underlying hardware
       • Data cost $0.10 to $0.18 per Gig
• Google and IBM similar cloud utility model to
  CS education
• Provide CS students access to distributed
  computing environment

• In future businesses will not need to invest in
  a data center
            Software as Service
• Move all processing power to the cloud and
  carry ultralight input device
  – Already happening?
     • E-mail on Internet, then Web
     • Google Docs
     • Implications for Microsoft, software as purchasable
       local application
        – Windows Live (Microsoft’s cloud)
        – Adobe web based photoshop
• Cloud
   – Paradigm shift and disruptive force
      • Google and Apple will pair
           – Lightweight mobile device by Apple tapping into
             Google’s cloud
• But
   – Failed thin clients of past
      • Larry Ellison in 90s trouble create cost-effective thin
   – Difficult to produce powerless thin client at low enough
   – Yet, Non-thin-clients can fail, SW needs care
• Networks will need to be robust
  – In U.S. broadband quality poor
     • Broadband advances slow, bottleneck for clouds
  – Privacy ???
     • What if 3rd party has your data and government
       subpoena’s them? Do you even know?
     • Can you lose access to your info if you don’t pay bill?
  – Vendor lock-in – need certain client to access
    cloud operator
     • Not open like the Internet today
                Partly Cloudy
• New name, same familiar computing models?
• New because integrates models of centralized
  computing, utility computing, distributed computing
  and software as service
• Power shifts from processing unit to network
• Processors commodities
• Network connects all
 Cloud computing leaving
relational databases behing
       Joab Jackson, 9/08
   Government Computer News
“One thing you won’t find underlying a cloud initiative
  is a relational database.
And this is no accident: Relations databases are ill-
  suited for use within cloud computing environments”

Geir Magnusson, VP 10Gen, on-demand platform
  service provider
• DBs specifically designed to work in cloud
  – Google – BigTable
  – Amazon – SimpleDB
  – 10Gen – Mongo
  – AppJet – AppJetDB
  – Oracle open-source - Berkely DB
  – MySQL for Web - Drizzle
     Characteristics of Cloud DBs
• Run in distributes environments
• None are transactions in nature
• Sacrifice advanced querying capability for
  faster performance
  – Queried using object calls instead of SQL
• Very Large relational like Oracle implemented in data
   – DB material spread across different locations
   – Executing complex queries over vast locations can
     slow response time
   – Difficult to design and maintain an architecture to
     replicate data
• Instead: Data targeted in a clustered fashion
The Claremont Report on
   Database Research
      SIGMOD 2008
                 What is it?
• May, 2008 prominent DB researchers,
  architects, users, pundits met in Berkeley, CA
  at Claremont Resort
• Seventh meeting in 20 years
• Report based on discussion of new directions
  in DBs
    Turning point in DB Research
• New opportunities for technical advances,
  impact on society, etc.
1. Big Data
  – not only traditional enterprises, but also e-
    science, digital entertainment, natural language
    processing, social network analysis
  – Design new custom data management
  – solutions from simpler components
2. Data analysis as profit center
  – Barriers between IT dept. and business units
  – Data is the business
  – Data capture, integration, etc. keys to efficiency
    and profit
  – BI vendors - $10B (only front-end)
  – Also need better analytics, sophisticated analysis
  – non-technical decision makers want data
3. Ubiquity of structured and unstructured data
  – Structured data – extracted from text, SW logs,
    sensors and deep web crawl
  – Semi-structured – blogs, Web 2.0 communities,
    instant messaging
  – Publish and curate structured data
  – Develop techniques to extract useful data, enable
    deeper explorations, connect datasets
4. Expanded developer demands
  – Adoption of relational DBMS and query languages
    has grown
    • MySQL, PostegreSQL, Ruby on Rails
    • Less interest in SQL, view DBMS as too much to learn
      relative to other open source components
  – Need new programming models for Data
5. Architectural Shifts in computing
  – Computing substrates for DM are shifting
  – Macro: Rise of cloud computing
     • Democratizes access to parallel clusters
  – Micro: shift from increasing chip clock speed to
    increase number of cores, threads
     • Changes in memory hierarchy
     • Power consumption
  – New DM technologies
          Research Opportunities
• Impact of DB research has not evolved beyond
  traditional DBs
• Reformation
   – Reform data centric ideas for new applications and
• Synthesis
   – Data integration, information extraction, data privacy
• Some topics not mentioned, because still part of
  significant effort
   – Must continue with these efforts
   – Also must continue with
      • Uncertain data, data privacy and security, e-science, human-
        centric interactions, social networks, etc.
                  DB Engines
• Big market relational DBs well known
• Peak performance:
  – OLTP with lots of small, concurrent transactions
    debit/credit workloads
  – OLAP with few real-mostly, large join, aggregation
• Bad for:
  – Text indexing, server web pages, media delivery
• DB engine technology could be useful in
  sciences and Web 2.0 applications, but not in
  current bundled DB systems
• Petabytes of storage and 1000s processors,
  but current DB cannot scale
• Need schema evolution, versioning, etc
• Currently, many DB engine startup companies
1. Broaden range for multi-purpose DBs
2. Design special purpose DBs
• Topics in DB engine area:
  – Systems for clusters of many processors
  – Exploit remote RAM and Flash as persistent
  – Query opt. and data layout continuous
  – Compress and encrypt data integrated with data
    layout and optimization
  – Embrace non-relational DB models
  – Trade off consistency/availability for performance
  – Design power aware dBMS
• Declarative programming for emerging
• Programmer productivity is important
  – Non-expert must be able to write robust code
  – Data Centric programming techniques
     • Map reduce – language and data parallelism
     • Declarative languages – Data log
     • Enterprise application programming – Ruby Rails, LINQ
• New challenges – programming across multiple machines
• Data independence valuable, no assumptions about where
  data stored
• XQuery for declarative programming?
• Also need language design, efficient compilers, optimize code
  across parallel processors and vertical distribution of tiers
• Need more expressive languages
• Attractive syntax, development tools, etc
• Data management – not only storage service, but
  programming paradigm
        Interplay of Structured and
            Unstructured Data
• Data behind forms – Deep Web
• Data items in HTML
• Data in Web 2.0 services (photo, video sites)

• Transition from traditional DBs to managing
  structured, semi-structured and unstructured data in
  enterprises and on the web
• Challenge of managing dataspaces
• On the web
  – Vertical search engines
  – Domain independent technology for crawling
• Within the enterprise
  – Discover relationships between structured and
    unstructured data
• Extract structure and meaning from un- and semi-
  structured data
• Information extraction technology – pull entities and
  relationships from unstructured text
• Need: apply and management predictions from
  independent extractors
   – Algorithms to determine correctness of extraction
   – Join with IR and ML communities
• Better DB technology needed to manage data in
   – Discover implicit relationships, maintain context
     through storage and computation
• Query and derive insight from heterogeneous data
   – Answer keyword queries over heterogeneous data
   – Analysis to extract semantics
   – Cannot assume have semantic mappings or
     domain is known
• Develop algorithms to provide best-effort
  services on loosely integrated data
  – Pay as you go as semantic relationships discovered
• Develop index structures to support querying
  hybrid data
• New notions of correctness and consistency
• Innovate on creating data collections
• Ad-hoc communities to collaborate
  – Schema will be dynamic
  – Consensus to guide users
  – Need visualization tools to create data that are
    easy to use
     • Result of tools may be easier to extract info
          Cloud Data Services
• Infrastructures providing software and
  computing facilities as a service
• Efficient for applications
  – Limit up-front capitol expenses
  – reduce cost of ownership over time
• Services hosted in a data center
  – Shared commodity hardware for computation and
   Cloud services available today
• Application services (
• Storage services (Amazon S3)
• Compute services (Google App Enginer,
  Amazon EC2)
• Data services (Amazon SimpleDB, SQL Server
  Data Services, Google’s Datastore)
• Cloud data services offer API more restricted
  than traditional DBs
  – Minimalist query languages, limited consistency
  – More predictable services
     • Difficult if had to provide full-function SQL data service
  – Managability important in cloud environments
     • Limited human intervention
     • High workloads
     • Variety of shared infrastructures
• No DBA or system admin
• Automatically by platform
• Large variations in workloads
  – Economical to user more resources for short
  – Service tuning depends upon virtualization
     • HW virtual machines as programming interface (EC2)
     • Multi-tenant hosting many independent schemas in
       single managed DBMS (
• Need for manageability
• Adaptive online techniques
• New architectures and APIs
  – Depart from SQL and transations semantics when
• SQL DBs cannot scale to thousands of nodes
  – Different transactional implementation techniques
    or different storage semantics?
• Query processing and optimization
  – Cannot exhaust search plan if 1000s sites
• More work needed to understand scaling
• Data security and privacy
  – No longer physical boundaries of machines or
• New scenarios
  – Specialized services with pre-loaded data sets
    (stock prices, weather)
• Combine data from private and public
• Reaching across clouds (scientific grids)
  – Federated cloud architectures
 Mobile applications and virtual worlds
• Manage massive amounts of diverse user-created
  data, synthesize intelligently and provide real-time
• Mobile space
   – Large user bases
   – Emergence of mobile search and social networks
      • Timely information to users depending on locations,
        preference, social circles, extraneous factor and context
        in which operate
      • Synthesize user input and behavior to determine
        location and intent
• Virtual worlds – Second Life
  – Began as simulations for multiple users
     • Blur distinction with real-world
     • Co-space, for both virtual and physical worlds
        – Events in physical captured by sensors, materialized in virtual
        – Events in virtual can affect physical
     • Need to process heterogeneous data streams
     • Balance privacy against sharing person RT info
     • Virtual actors requires large-scale parallel programs
        – Efficient storage, data processing, power sensitive
                  Moving Forward
• DB research community doubles in size last decade
• Increasing technical scope make it difficult to keep track of
• Review load for papers growing
   – Quality of reviews decreasing over time
• Need more technical books, blogs, wikis
• Open source software development in DB
   – Competition: system components for cloud computing
   – Large-scale information extraction

To top