The Claremont Report on Database Research

Document Sample
scope of work template
							The Claremont Report on
   Database Research
      SIGMOD 2008
                 What is it?
• May, 2008 prominent DB researchers,
  architects, users, pundits met in Berkeley, CA
  at Claremont Resort
• Seventh meeting in 20 years
• Report based on discussion of new directions
  in DBs
    Turning point in DB Research
• New opportunities for technical advances,
  impact on society, etc.
1. Big Data
  – not only traditional enterprises, but also e-
    science, digital entertainment, natural language
    processing, social network analysis
  – Design new custom data management
  – solutions from simpler components
2. Data analysis as profit center
  – Barriers between IT dept. and business units
    dropping
  – Data is the business
  – Data capture, integration, etc. keys to efficiency
    and profit
  – BI vendors - $10B (only front-end)
  – Also need better analytics, sophisticated analysis
  – non-technical decision makers want data
3. Ubiquity of structured and unstructured data
  – Structured data – extracted from text, SW logs,
    sensors and deep web crawl
  – Semi-structured – blogs, Web 2.0 communities,
    instant messaging
  – Publish and curate structured data
  – Develop techniques to extract useful data, enable
    deeper explorations, connect datasets
4. Expanded developer demands
  – Adoption of relational DBMS and query languages
    has grown
    • MySQL, PostegreSQL, Ruby on Rails
    • Less interest in SQL, view DBMS as too much to learn
      relative to other open source components
  – Need new programming models for Data
    management
5. Architectural Shifts in computing
  – Computing substrates for DM are shifting
  – Macro: Rise of cloud computing
     • Democratizes access to parallel clusters
  – Micro: shift from increasing chip clock speed to
    increase number of cores, threads
     • Changes in memory hierarchy
     • Power consumption
  – New DM technologies
          Research Opportunities
• Impact of DB research has not evolved beyond
  traditional DBs
• Reformation
   – Reform data centric ideas for new applications and
     architectures
• Synthesis
   – Data integration, information extraction, data privacy
• Some topics not mentioned, because still part of
  significant effort
   – Must continue with these efforts
   – Also must continue with
      • Uncertain data, data privacy and security, e-science, human-
        centric interactions, social networks, etc.
                  DB Engines
• Big market relational DBs well known
  limitations
• Peak performance:
  – OLTP with lots of small, concurrent transactions
    debit/credit workloads
  – OLAP with few real-mostly, large join, aggregation
• Bad for:
  – Text indexing, server web pages, media delivery
• DB engine technology could be useful in
  sciences and Web 2.0 applications, but not in
  current bundled DB systems
• Petabytes of storage and 1000s processors,
  but current DB cannot scale
• Need schema evolution, versioning, etc
• Currently, many DB engine startup companies
1. Broaden range for multi-purpose DBs
2. Design special purpose DBs
• Topics in DB engine area:
  – Systems for clusters of many processors
  – Exploit remote RAM and Flash as persistent
  – Query opt. and data layout continuous
  – Compress and encrypt data integrated with data
    layout and optimization
  – Embrace non-relational DB models
  – Trade off consistency/availability for performance
  – Design power aware dBMS
• Declarative programming for emerging
  platforms
• Programmer productivity is important
  – Non-expert must be able to write robust code
  – Data Centric programming techniques
     • Map reduce – language and data parallelism
     • Declarative languages – Data log
     • Enterprise application programming – Ruby Rails, LINQ
• New challenges – programming across multiple machines
• Data independence valuable, no assumptions about where
  data stored
• XQuery for declarative programming?
• Also need language design, efficient compilers, optimize code
  across parallel processors and vertical distribution of tiers
• Need more expressive languages
• Attractive syntax, development tools, etc
• Data management – not only storage service, but
  programming paradigm
        Interplay of Structured and
            Unstructured Data
• Data behind forms – Deep Web
• Data items in HTML
• Data in Web 2.0 services (photo, video sites)

• Transition from traditional DBs to managing
  structured, semi-structured and unstructured data in
  enterprises and on the web
• Challenge of managing dataspaces
• On the web
  – Vertical search engines
  – Domain independent technology for crawling
• Within the enterprise
  – Discover relationships between structured and
    unstructured data
• Extract structure and meaning from un- and semi-
  structured data
• Information extraction technology – pull entities and
  relationships from unstructured text
• Need: apply and management predictions from
  independent extractors
   – Algorithms to determine correctness of extraction
   – Join with IR and ML communities
• Better DB technology needed to manage data in
  context
   – Discover implicit relationships, maintain context
     through storage and computation
• Query and derive insight from heterogeneous data
   – Answer keyword queries over heterogeneous data
     sources
   – Analysis to extract semantics
   – Cannot assume have semantic mappings or
     domain is known
• Develop algorithms to provide best-effort
  services on loosely integrated data
  – Pay as you go as semantic relationships discovered
• Develop index structures to support querying
  hybrid data
• New notions of correctness and consistency
• Innovate on creating data collections
• Ad-hoc communities to collaborate
  – Schema will be dynamic
  – Consensus to guide users
  – Need visualization tools to create data that are
    easy to use
     • Result of tools may be easier to extract info
          Cloud Data Services
• Infrastructures providing software and
  computing facilities as a service
• Efficient for applications
  – Limit up-front capitol expenses
  – reduce cost of ownership over time
• Services hosted in a data center
  – Shared commodity hardware for computation and
    storage
   Cloud services available today
• Application services (salesforce.com)
• Storage services (Amazon S3)
• Compute services (Google App Enginer,
  Amazon EC2)
• Data services (Amazon SimpleDB, SQL Server
  Data Services, Google’s Datastore)
• Cloud data services offer API more restricted
  than traditional DBs
  – Minimalist query languages, limited consistency
  – More predictable services
     • Difficult if had to provide full-function SQL data service
  – Managability important in cloud environments
     • Limited human intervention
     • High workloads
     • Variety of shared infrastructures
• No DBA or system admin
• Automatically by platform
• Large variations in workloads
  – Economical to user more resources for short
    bursts
  – Service tuning depends upon virtualization
     • HW virtual machines as programming interface (EC2)
     • Multi-tenant hosting many independent schemas in
       single managed DBMS (salesforce.com)
• Need for manageability
• Adaptive online techniques
• New architectures and APIs
  – Depart from SQL and transactions semantics when
    can
• SQL DBs cannot scale to thousands of nodes
  – Different transactional implementation techniques
    or different storage semantics?
• Query processing and optimization
  – Cannot exhaust search plan if 1000s sites
• More work needed to understand scaling
  realities
• Data security and privacy
  – No longer physical boundaries of machines or
    networks
• New scenarios
  – Specialized services with pre-loaded data sets
    (stock prices, weather)
• Combine data from private and public
  domains
• Reaching across clouds (scientific grids)
  – Federated cloud architectures
 Mobile applications and virtual worlds
• Manage massive amounts of diverse user-created
  data, synthesize intelligently and provide real-time
  services
• Mobile space
   – Large user bases
   – Emergence of mobile search and social networks
      • Timely information to users depending on locations,
        preference, social circles, extraneous factor and context
        in which operate
      • Synthesize user input and behavior to determine
        location and intent
• Virtual worlds – Second Life
  – Began as simulations for multiple users
     • Blur distinction with real-world
     • Co-space, for both virtual and physical worlds
        – Events in physical captured by sensors, materialized in virtual
        – Events in virtual can affect physical
     • Need to process heterogeneous data streams
     • Balance privacy against sharing person RT info
     • Virtual actors requires large-scale parallel programs
        – Efficient storage, data processing, power sensitive
                  Moving Forward
• DB research community doubles in size last decade
• Increasing technical scope make it difficult to keep track of
  field
• Review load for papers growing
   – Quality of reviews decreasing over time
• Need more technical books, blogs, wikis
• Open source software development in DB
   – Competition: system components for cloud computing
   – Large-scale information extraction

						
Related docs