The Claremont Report on Database Research
Document Sample


The Claremont Report on
Database Research
SIGMOD 2008
What is it?
• May, 2008 prominent DB researchers,
architects, users, pundits met in Berkeley, CA
at Claremont Resort
• Seventh meeting in 20 years
• Report based on discussion of new directions
in DBs
Turning point in DB Research
• New opportunities for technical advances,
impact on society, etc.
1. Big Data
– not only traditional enterprises, but also e-
science, digital entertainment, natural language
processing, social network analysis
– Design new custom data management
– solutions from simpler components
2. Data analysis as profit center
– Barriers between IT dept. and business units
dropping
– Data is the business
– Data capture, integration, etc. keys to efficiency
and profit
– BI vendors - $10B (only front-end)
– Also need better analytics, sophisticated analysis
– non-technical decision makers want data
3. Ubiquity of structured and unstructured data
– Structured data – extracted from text, SW logs,
sensors and deep web crawl
– Semi-structured – blogs, Web 2.0 communities,
instant messaging
– Publish and curate structured data
– Develop techniques to extract useful data, enable
deeper explorations, connect datasets
4. Expanded developer demands
– Adoption of relational DBMS and query languages
has grown
• MySQL, PostegreSQL, Ruby on Rails
• Less interest in SQL, view DBMS as too much to learn
relative to other open source components
– Need new programming models for Data
management
5. Architectural Shifts in computing
– Computing substrates for DM are shifting
– Macro: Rise of cloud computing
• Democratizes access to parallel clusters
– Micro: shift from increasing chip clock speed to
increase number of cores, threads
• Changes in memory hierarchy
• Power consumption
– New DM technologies
Research Opportunities
• Impact of DB research has not evolved beyond
traditional DBs
• Reformation
– Reform data centric ideas for new applications and
architectures
• Synthesis
– Data integration, information extraction, data privacy
• Some topics not mentioned, because still part of
significant effort
– Must continue with these efforts
– Also must continue with
• Uncertain data, data privacy and security, e-science, human-
centric interactions, social networks, etc.
DB Engines
• Big market relational DBs well known
limitations
• Peak performance:
– OLTP with lots of small, concurrent transactions
debit/credit workloads
– OLAP with few real-mostly, large join, aggregation
• Bad for:
– Text indexing, server web pages, media delivery
• DB engine technology could be useful in
sciences and Web 2.0 applications, but not in
current bundled DB systems
• Petabytes of storage and 1000s processors,
but current DB cannot scale
• Need schema evolution, versioning, etc
• Currently, many DB engine startup companies
1. Broaden range for multi-purpose DBs
2. Design special purpose DBs
• Topics in DB engine area:
– Systems for clusters of many processors
– Exploit remote RAM and Flash as persistent
– Query opt. and data layout continuous
– Compress and encrypt data integrated with data
layout and optimization
– Embrace non-relational DB models
– Trade off consistency/availability for performance
– Design power aware dBMS
• Declarative programming for emerging
platforms
• Programmer productivity is important
– Non-expert must be able to write robust code
– Data Centric programming techniques
• Map reduce – language and data parallelism
• Declarative languages – Data log
• Enterprise application programming – Ruby Rails, LINQ
• New challenges – programming across multiple machines
• Data independence valuable, no assumptions about where
data stored
• XQuery for declarative programming?
• Also need language design, efficient compilers, optimize code
across parallel processors and vertical distribution of tiers
• Need more expressive languages
• Attractive syntax, development tools, etc
• Data management – not only storage service, but
programming paradigm
Interplay of Structured and
Unstructured Data
• Data behind forms – Deep Web
• Data items in HTML
• Data in Web 2.0 services (photo, video sites)
• Transition from traditional DBs to managing
structured, semi-structured and unstructured data in
enterprises and on the web
• Challenge of managing dataspaces
• On the web
– Vertical search engines
– Domain independent technology for crawling
• Within the enterprise
– Discover relationships between structured and
unstructured data
• Extract structure and meaning from un- and semi-
structured data
• Information extraction technology – pull entities and
relationships from unstructured text
• Need: apply and management predictions from
independent extractors
– Algorithms to determine correctness of extraction
– Join with IR and ML communities
• Better DB technology needed to manage data in
context
– Discover implicit relationships, maintain context
through storage and computation
• Query and derive insight from heterogeneous data
– Answer keyword queries over heterogeneous data
sources
– Analysis to extract semantics
– Cannot assume have semantic mappings or
domain is known
• Develop algorithms to provide best-effort
services on loosely integrated data
– Pay as you go as semantic relationships discovered
• Develop index structures to support querying
hybrid data
• New notions of correctness and consistency
• Innovate on creating data collections
• Ad-hoc communities to collaborate
– Schema will be dynamic
– Consensus to guide users
– Need visualization tools to create data that are
easy to use
• Result of tools may be easier to extract info
Cloud Data Services
• Infrastructures providing software and
computing facilities as a service
• Efficient for applications
– Limit up-front capitol expenses
– reduce cost of ownership over time
• Services hosted in a data center
– Shared commodity hardware for computation and
storage
Cloud services available today
• Application services (salesforce.com)
• Storage services (Amazon S3)
• Compute services (Google App Enginer,
Amazon EC2)
• Data services (Amazon SimpleDB, SQL Server
Data Services, Google’s Datastore)
• Cloud data services offer API more restricted
than traditional DBs
– Minimalist query languages, limited consistency
– More predictable services
• Difficult if had to provide full-function SQL data service
– Managability important in cloud environments
• Limited human intervention
• High workloads
• Variety of shared infrastructures
• No DBA or system admin
• Automatically by platform
• Large variations in workloads
– Economical to user more resources for short
bursts
– Service tuning depends upon virtualization
• HW virtual machines as programming interface (EC2)
• Multi-tenant hosting many independent schemas in
single managed DBMS (salesforce.com)
• Need for manageability
• Adaptive online techniques
• New architectures and APIs
– Depart from SQL and transactions semantics when
can
• SQL DBs cannot scale to thousands of nodes
– Different transactional implementation techniques
or different storage semantics?
• Query processing and optimization
– Cannot exhaust search plan if 1000s sites
• More work needed to understand scaling
realities
• Data security and privacy
– No longer physical boundaries of machines or
networks
• New scenarios
– Specialized services with pre-loaded data sets
(stock prices, weather)
• Combine data from private and public
domains
• Reaching across clouds (scientific grids)
– Federated cloud architectures
Mobile applications and virtual worlds
• Manage massive amounts of diverse user-created
data, synthesize intelligently and provide real-time
services
• Mobile space
– Large user bases
– Emergence of mobile search and social networks
• Timely information to users depending on locations,
preference, social circles, extraneous factor and context
in which operate
• Synthesize user input and behavior to determine
location and intent
• Virtual worlds – Second Life
– Began as simulations for multiple users
• Blur distinction with real-world
• Co-space, for both virtual and physical worlds
– Events in physical captured by sensors, materialized in virtual
– Events in virtual can affect physical
• Need to process heterogeneous data streams
• Balance privacy against sharing person RT info
• Virtual actors requires large-scale parallel programs
– Efficient storage, data processing, power sensitive
Moving Forward
• DB research community doubles in size last decade
• Increasing technical scope make it difficult to keep track of
field
• Review load for papers growing
– Quality of reviews decreasing over time
• Need more technical books, blogs, wikis
• Open source software development in DB
– Competition: system components for cloud computing
– Large-scale information extraction
Related docs
Get documents about "