The Claremont Report on Database Research

Document Sample
The Claremont Report on Database Research Powered By Docstoc
					Computing in the Clouds

       Aaron Weiss
  Cloud Computing

      Brian Hayes
Communications of the ACM
       July 2008
             Cloud Computing
• The next big thing
• But what is it?
  – Web-based applications (thin client)
  – Utility computing – grid that changes rates for
    processing time
  – Distributed or parallel computing designed to
    scale for efficiency
• Also called:
  – On-demand computing, software as a service,
    Internet as platform
                   Data Centers
• Decades ago – computing power in mainframes in
  computer rooms
• Personal computers changed that
• Now, in network data centers with centralized
  computing are back in vogue
   – But: no longer a hub-and-spoke
• Although Google famous for innovating web
  searching, Google’s architecture as much a
  revolution
   – Instead of few expensive servers, use many cheap
     servers (1/2M servers in ~ 12 locations)
• With thin, wide network
  – Derive more from scale of the whole than any one
    part – no hub
• Cloud – robust and self-healing
  – Uses too much power
  – Cheaper power solutions we’ve talked about
    earlier in class
  – Heavy utilization of virtualization
     • Single server multiple OS instances, minimize CPU idle
       time
• CloudOS (VMWare and Cisco)
  – Instead of each server running own copy of OS
    (Current Google model)
  – Should have single OS treats everything in data
    center as another resource
     • Network channels to coordinate events
     • Cloud more cohesive entity
• Entire user interface resides in single window
  – Provide all facilities of OS inside a browser
• Program must continue running even as
  number of users grows
• Communication model is many-to-many
• To move applications to cloud must master
  multiple languages and operating
  environments
  – In many cloud applications, back-end process
    relies on relational DB so part of code in SQL
     • Client side in JavaScript or embedded within HTML
       documents
     • Server application in between written in scripting
       language
         Distributed Computing
• Speed of cloud depends on delegation
  – Break up into subtasks
     • Retrieving results of search
     • DB query – parse results, construct result sets, formal
       results, etc.
     • If tasks small enough, simultaneous
     • Dependencies? Complex
  – Distributed computing not new
     • SETI, Folding
     • Hadoop – Apache Foundation
        – No need for creating specialized custom software
        – Distributes petabytes of data projects, 1000s nodes
                    A Utility Grid
• In past, pay for cost of cycles used
• Today most organizations create own data centers
   – But cost to run
   – Use 99% of capacity only 10% of time
• In Web service, lots of hosting providers
   – Typically do not replicate distributed computing
• Amazon, Google, etc. should scale up data centers, create
  business models to support third party use
   – Amazon EC2 fee based public use 10/07
       •   Customers create virtual image of SW environment
       •   Create instance of machine in Amazon’s cloud
       •   Appears to user as dedicated server
       •   Customers choose configuration
       •   Customers can create/destroy at will
             – If surge in visitors, additional instances on demand
             – If slows down, terminate extra instances
       • Charges $0.10 per instance hour based on compute units regardless of
         underlying hardware
       • Data cost $0.10 to $0.18 per Gig
• Google and IBM similar cloud utility model to
  CS education
• Provide CS students access to distributed
  computing environment

• In future businesses will not need to invest in
  a data center
            Software as Service
• Move all processing power to the cloud and
  carry ultralight input device
  – Already happening?
     • E-mail on Internet, then Web
     • Google Docs
     • Implications for Microsoft, software as purchasable
       local application
        – Windows Live (Microsoft’s cloud)
        – Adobe web based photoshop
• Cloud
   – Paradigm shift and disruptive force
      • Google and Apple will pair
           – Lightweight mobile device by Apple tapping into
             Google’s cloud
• But
   – Failed thin clients of past
      • Larry Ellison in 90s trouble create cost-effective thin
         clients
   – Difficult to produce powerless thin client at low enough
     cost
   – Yet, Non-thin-clients can fail, SW needs care
• Networks will need to be robust
  – In U.S. broadband quality poor
     • Broadband advances slow, bottleneck for clouds
  – Privacy ???
     • What if 3rd party has your data and government
       subpoena’s them? Do you even know?
     • Can you lose access to your info if you don’t pay bill?
  – Vendor lock-in – need certain client to access
    cloud operator
     • Not open like the Internet today
                Partly Cloudy
• New name, same familiar computing models?
• New because integrates models of centralized
  computing, utility computing, distributed computing
  and software as service
• Power shifts from processing unit to network
• Processors commodities
• Network connects all
 Cloud computing leaving
relational databases behing
       Joab Jackson, 9/08
   Government Computer News
“One thing you won’t find underlying a cloud initiative
  is a relational database.
And this is no accident: Relations databases are ill-
  suited for use within cloud computing environments”

Geir Magnusson, VP 10Gen, on-demand platform
  service provider
• DBs specifically designed to work in cloud
  computing
  – Google – BigTable
  – Amazon – SimpleDB
  – 10Gen – Mongo
  – AppJet – AppJetDB
  – Oracle open-source - Berkely DB
  – MySQL for Web - Drizzle
     Characteristics of Cloud DBs
• Run in distributes environments
• None are transactions in nature
• Sacrifice advanced querying capability for
  faster performance
  – Queried using object calls instead of SQL
• Very Large relational like Oracle implemented in data
  centers
   – DB material spread across different locations
   – Executing complex queries over vast locations can
     slow response time
   – Difficult to design and maintain an architecture to
     replicate data
• Instead: Data targeted in a clustered fashion
The Claremont Report on
   Database Research
      SIGMOD 2008
                 What is it?
• May, 2008 prominent DB researchers,
  architects, users, pundits met in Berkeley, CA
  at Claremont Resort
• Seventh meeting in 20 years
• Report based on discussion of new directions
  in DBs
    Turning point in DB Research
• New opportunities for technical advances,
  impact on society, etc.
1. Big Data
  – not only traditional enterprises, but also e-
    science, digital entertainment, natural language
    processing, social network analysis
  – Design new custom data management
  – solutions from simpler components
2. Data analysis as profit center
  – Barriers between IT dept. and business units
    dropping
  – Data is the business
  – Data capture, integration, etc. keys to efficiency
    and profit
  – BI vendors - $10B (only front-end)
  – Also need better analytics, sophisticated analysis
  – non-technical decision makers want data
3. Ubiquity of structured and unstructured data
  – Structured data – extracted from text, SW logs,
    sensors and deep web crawl
  – Semi-structured – blogs, Web 2.0 communities,
    instant messaging
  – Publish and curate structured data
  – Develop techniques to extract useful data, enable
    deeper explorations, connect datasets
4. Expanded developer demands
  – Adoption of relational DBMS and query languages
    has grown
    • MySQL, PostegreSQL, Ruby on Rails
    • Less interest in SQL, view DBMS as too much to learn
      relative to other open source components
  – Need new programming models for Data
    management
5. Architectural Shifts in computing
  – Computing substrates for DM are shifting
  – Macro: Rise of cloud computing
     • Democratizes access to parallel clusters
  – Micro: shift from increasing chip clock speed to
    increase number of cores, threads
     • Changes in memory hierarchy
     • Power consumption
  – New DM technologies
          Research Opportunities
• Impact of DB research has not evolved beyond
  traditional DBs
• Reformation
   – Reform data centric ideas for new applications and
     architectures
• Synthesis
   – Data integration, information extraction, data privacy
• Some topics not mentioned, because still part of
  significant effort
   – Must continue with these efforts
   – Also must continue with
      • Uncertain data, data privacy and security, e-science, human-
        centric interactions, social networks, etc.
                  DB Engines
• Big market relational DBs well known
  limitations
• Peak performance:
  – OLTP with lots of small, concurrent transactions
    debit/credit workloads
  – OLAP with few real-mostly, large join, aggregation
• Bad for:
  – Text indexing, server web pages, media delivery
• DB engine technology could be useful in
  sciences and Web 2.0 applications, but not in
  current bundled DB systems
• Petabytes of storage and 1000s processors,
  but current DB cannot scale
• Need schema evolution, versioning, etc
• Currently, many DB engine startup companies
1. Broaden range for multi-purpose DBs
2. Design special purpose DBs
• Topics in DB engine area:
  – Systems for clusters of many processors
  – Exploit remote RAM and Flash as persistent
  – Query opt. and data layout continuous
  – Compress and encrypt data integrated with data
    layout and optimization
  – Embrace non-relational DB models
  – Trade off consistency/availability for performance
  – Design power aware dBMS
• Declarative programming for emerging
  platforms
• Programmer productivity is important
  – Non-expert must be able to write robust code
  – Data Centric programming techniques
     • Map reduce – language and data parallelism
     • Declarative languages – Data log
     • Enterprise application programming – Ruby Rails, LINQ
• New challenges – programming across multiple machines
• Data independence valuable, no assumptions about where
  data stored
• XQuery for declarative programming?
• Also need language design, efficient compilers, optimize code
  across parallel processors and vertical distribution of tiers
• Need more expressive languages
• Attractive syntax, development tools, etc
• Data management – not only storage service, but
  programming paradigm
        Interplay of Structured and
            Unstructured Data
• Data behind forms – Deep Web
• Data items in HTML
• Data in Web 2.0 services (photo, video sites)

• Transition from traditional DBs to managing
  structured, semi-structured and unstructured data in
  enterprises and on the web
• Challenge of managing dataspaces
• On the web
  – Vertical search engines
  – Domain independent technology for crawling
• Within the enterprise
  – Discover relationships between structured and
    unstructured data
• Extract structure and meaning from un- and semi-
  structured data
• Information extraction technology – pull entities and
  relationships from unstructured text
• Need: apply and management predictions from
  independent extractors
   – Algorithms to determine correctness of extraction
   – Join with IR and ML communities
• Better DB technology needed to manage data in
  context
   – Discover implicit relationships, maintain context
     through storage and computation
• Query and derive insight from heterogeneous data
   – Answer keyword queries over heterogeneous data
     sources
   – Analysis to extract semantics
   – Cannot assume have semantic mappings or
     domain is known
• Develop algorithms to provide best-effort
  services on loosely integrated data
  – Pay as you go as semantic relationships discovered
• Develop index structures to support querying
  hybrid data
• New notions of correctness and consistency
• Innovate on creating data collections
• Ad-hoc communities to collaborate
  – Schema will be dynamic
  – Consensus to guide users
  – Need visualization tools to create data that are
    easy to use
     • Result of tools may be easier to extract info
          Cloud Data Services
• Infrastructures providing software and
  computing facilities as a service
• Efficient for applications
  – Limit up-front capitol expenses
  – reduce cost of ownership over time
• Services hosted in a data center
  – Shared commodity hardware for computation and
    storage
   Cloud services available today
• Application services (salesforce.com)
• Storage services (Amazon S3)
• Compute services (Google App Enginer,
  Amazon EC2)
• Data services (Amazon SimpleDB, SQL Server
  Data Services, Google’s Datastore)
• Cloud data services offer API more restricted
  than traditional DBs
  – Minimalist query languages, limited consistency
  – More predictable services
     • Difficult if had to provide full-function SQL data service
  – Managability important in cloud environments
     • Limited human intervention
     • High workloads
     • Variety of shared infrastructures
• No DBA or system admin
• Automatically by platform
• Large variations in workloads
  – Economical to user more resources for short
    bursts
  – Service tuning depends upon virtualization
     • HW virtual machines as programming interface (EC2)
     • Multi-tenant hosting many independent schemas in
       single managed DBMS (salesforce.com)
• Need for manageability
• Adaptive online techniques
• New architectures and APIs
  – Depart from SQL and transations semantics when
    can
• SQL DBs cannot scale to thousands of nodes
  – Different transactional implementation techniques
    or different storage semantics?
• Query processing and optimization
  – Cannot exhaust search plan if 1000s sites
• More work needed to understand scaling
  realities
• Data security and privacy
  – No longer physical boundaries of machines or
    networks
• New scenarios
  – Specialized services with pre-loaded data sets
    (stock prices, weather)
• Combine data from private and public
  domains
• Reaching across clouds (scientific grids)
  – Federated cloud architectures
 Mobile applications and virtual worlds
• Manage massive amounts of diverse user-created
  data, synthesize intelligently and provide real-time
  services
• Mobile space
   – Large user bases
   – Emergence of mobile search and social networks
      • Timely information to users depending on locations,
        preference, social circles, extraneous factor and context
        in which operate
      • Synthesize user input and behavior to determine
        location and intent
• Virtual worlds – Second Life
  – Began as simulations for multiple users
     • Blur distinction with real-world
     • Co-space, for both virtual and physical worlds
        – Events in physical captured by sensors, materialized in virtual
        – Events in virtual can affect physical
     • Need to process heterogeneous data streams
     • Balance privacy against sharing person RT info
     • Virtual actors requires large-scale parallel programs
        – Efficient storage, data processing, power sensitive
                  Moving Forward
• DB research community doubles in size last decade
• Increasing technical scope make it difficult to keep track of
  field
• Review load for papers growing
   – Quality of reviews decreasing over time
• Need more technical books, blogs, wikis
• Open source software development in DB
   – Competition: system components for cloud computing
   – Large-scale information extraction