Google and Scalable Query Services
Shared by: wuzhenguang
-
Stats
- views:
- 0
- posted:
- 2/8/2012
- language:
- pages:
- 56
Document Sample


Additional DI Paradigms
Peer to Peer Databases
Database Approaches to
Semantic Integration
Data warehouse Warehouse
Design a single schema DBMS
Do physical DB design ETL Tools
Map data into Query (offline)
XML Sources
warehouse schema
Periodically update warehouse Warehoused tables Data in
common
schema Rel. Sources
Data Virtual data integration (EII)
Integration
System Wrappers Design mediated schema
(demand-
XML Sources Map sources to mediated schema
Queries are rewritten and
driven)
Query
Mediated
Data in
answered on demand from
schema
common sources
format Rel. Sources
3
A Single Centralized Schema
is a Bottleneck!
Challenging to form a single schema for all domain data
People don’t agree on how concepts should be represented
Data warehouse: physical design is a strong consideration
Mediated schema very different from original users’ schemas
Difficult to evolve this single schema as needs change
May “break” existing queries
Must build consensus for any schema changes
4
Peer Data Management: Decentralized
Mediation for Ad Hoc Extensibility
DB
Projects
UPenn UW Stanford IIT Mumbai
Data integration: 1 mediated schema, m mappings to sources
Peer data management system (PDMS):
n mediated “peer schemas,” as few as (n - 1) mappings between
them – evaluated transitively
m mappings to sources
5
Answering Queries in a PDMS:
Transitively Evaluating Mappings
Start with schema being queried
Look up mappings to neighbors; expand
Continue iteratively until queries only over sources
Mappings in a PDMS may be a combination of LAV, GAV
techniques:
General form p1a(X, Y), p1b(Y,Z), … = p2a(Y, X), p2b(X, Z), …
(see paper for examination of what is actually tractable)
Requires unfolding and AQUV
We use a rule-goal “tree” to expand the mappings
Extend some of the ideas of MiniCon to avoid unnecessary
expansions
Challenges to avoid redundancy – see paper for optimizations
6
Example of Query Answering
Query: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w)
r0 r1
SameProject ProjMember CoAuthor Author
(a1,a2,p) (a1,p) (a1,a2) (a,w)
Sched r3
(f,s,e)
r2
S1 S2
Mappings between peers’ schemas:
r0: SameProject(a1,a2,p) :- ProjMember(a1,p), ProjMember(a2,p)
r1: CoAuthor(a1,a2) Author(a1,w), Author(a2,w)
Mappings to data sources:
r2: S1(a,p,s) ProjMember(a,p), Sched(f,s,end)
r3: CoAuthor(f1,f2) :- S2(f1,f2)
7
Example Rule-Goal Tree Expansion
q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w)
q
8
Example Rule-Goal Tree Expansion
q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w)
q
SameProject(a1,a2,p) Author(a1,w) Author(a2,w)
9
Example Rule-Goal Tree Expansion
q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w)
q
SameProject(a1,a2,p) Author(a1,w) Author(a2,w)
Mappings between peers’ schemas:
r0: SameProject(a1,a2,p) :- ProjMember(a1,p), ProjMember(a2,p)
r1: CoAuthor(a1,a2) Author(a1,w), Author(a2,w)
10
Example Rule-Goal Tree Expansion
q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w)
q
SameProject(a1,a2,p) Author(a1,w) Author(a2,w)
r0 r1 r1
Mappings between peers’ schemas:
r0: SameProject(a1,a2,p) :- ProjMember(a1,p), ProjMember(a2,p)
r1: CoAuthor(a1,a2) Author(a1,w), Author(a2,w)
11
Example Rule-Goal Tree Expansion
q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w)
q
SameProject(a1,a2,p) Author(a1,w) Author(a2,w)
r0 r1 r1
ProjMember(a1,p) ProjMember(a2,p) CoAuthor(a1,a2) CoAuthor(a2,a1)
Mappings to data sources:
r2: S1(a,p,s) ProjMember(a,p), Sched(a,s,end)
r3: CoAuthor(f1,f2) = S2(f1,f2)
12
Example Rule-Goal Tree Expansion
q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w)
q
SameProject(a1,a2,p) Author(a1,w) Author(a2,w)
r0 r1 r1
ProjMember(a1,p) ProjMember(a2,p) CoAuthor(a1,a2) CoAuthor(a2,a1)
r2 r2 r3 r3
Mappings to data sources:
r2: S1(a,p,s) ProjMember(a,p), Sched(a,s,end)
r3: CoAuthor(f1,f2) = S2(f1,f2)
13
Example Rule-Goal Tree Expansion
q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w)
q
SameProject(a1,a2,p) Author(a1,w) Author(a2,w)
r0 r1 r1
ProjMember(a1,p) ProjMember(a2,p) CoAuthor(a1,a2) CoAuthor(a2,a1)
r2 r2 r3 r3
S1(a1,p,_) S1(a2,p,_) S2(a1,a2) S2(a2,a1)
14
Example Rule-Goal Tree Expansion
q: Q(a1, a2) :- SameProject(a1,a2,p), Author(a1,w), Author(a2,w)
q
SameProject(a1,a2,p) Author(a1,w) Author(a2,w)
r0 r1 r1
ProjMember(a1,p) ProjMember(a2,p) CoAuthor(a1,a2) CoAuthor(a2,a1)
r2 r2 r3 r3
S1(a1,p,_) S1(a2,p,_) S2(a1,a2) S2(a2,a1)
Q’(a1,a2) :- S1(a1,p,_), S1(a2,p,_), S2(a1,a2)
S1(a1,p,_), S1(a2,p,_), S2(a2,a1)
15
ORCHESTRA: Rapid, Collaborative
Sharing of Dynamic Data
Zachary Ives, Nitin Khandelwal,
Aneesh Kapur, University of Pennsylvania
Murat Cakir, Drexel University
Data Exchange among Bioinformatics
Warehouses & Biologists
GO
RAD MAGE-ML Schema
... Schema Schema
Data
providers
RAD DB RAD DB
@ Penn @ Sanger systemsbiology.org ArrayExpress
Different bioinformatics institutes, research groups store their data in
separate warehouses with related, “overlapping” data
Each source is independently updated, curated locally
Updates are published periodically in some “standard” schema
Each site wants to import these changes, maintain a copy of all data
Individual scientists also import the data and changes, and would like to
share their derived results
Caveat: not all sites agree on the facts!
Often, no consensus on the “right” answer!
17
A Clear Need for a General
Infrastructure for Data Exchange
Bioinformatics exchange is done with ad hoc, custom tools – or
manually – or not at all!
(NOT an instance of file sync, e.g., Intellisync, Harmony; or groupware)
It’s only one instance of managing the exchange of independently
modified data, e.g.:
Sharing subsets of contact lists (colleagues with different apps)
Integrating and merging multiple authors’ bibTeX, EndNote files
Distributed maintenance of sites like DBLP, SIGMOD Anthology
This problem has many similarities to traditional DBs/data integration:
Structured or semi-structured data
Schema heterogeneity, different data formats, autonomous sources
Concurrent updates
Transactional semantics
18
Challenges in Developing Collaborative
Data Sharing “Middleware”
1. How do we coordinate updates between
conflicting collaborators?
2. How do we support rapid & transient
participation, as in the Web or P2P systems?
3. How do we handle the issues of exchanging
updates across different schemas?
These issues are the focus of our work on the
ORCHESTRA Collaborative Data Sharing
System
19
Our Data Sharing Model
GO
RAD MAGE-ML Schema
... Schema Schema
Data
providers
RAD DB RAD DB
@ Penn @ Sanger systemsbiology.org ArrayExpress
1. Participants create & independently update local replicas of an instance of
a particular schema
Typically stored in a conventional DBMS
2. Periodically reconcile changes with those of other participants
Updates are accepted based on trust/authority – coordinated disagreement
3. Changes may need to be translated across mappings between schemas
Sometimes only part of the information is mapped
20
Conclusions and Future Work
ORCHESTRA focuses on trying to coordinate disagreement,
rather than enforcing agreement
Significantly different from prior data sharing and
synchronization efforts
Allows full autonomy of participants – offers scalability, flexibility
Central ideas:
A new data model that supports “coordinated disagreement”
Global reconciliation and support for transient membership via
P2P distributed hash substrate
Update translation using extensions to peer data management
and view update/maintence
Currently working on integrated system, performance
optimization
21
Efforts at C^3
The MOBS Project
Learn from multitude of users to improve tool accuracy,
thus significantly reducing builder workload
Questions
Answers
MOBS = Mass Collaboration to Build Systems 23
Mass Collaboration
Build software artifacts
– Linux, Apache server, other open-source software
Knowledge bases, encyclopedia
– wikipedia.com
Review & technical support websites
– amazon.com, epinions.com, quiq.com,
Detect software bugs
– PLDI 03 & 05
Label images on the Web
Improve search engines, recommender systems
Why not data integration systems?
24
Example: Duplicate Data Matching
Serious problem in many settings (e.g., epinions.com)
Dell laptop X200 with mouse ...
Mouse for Dell laptop 200 series ...
Dell X200. Mouse at reduced price ...
Hard for machine, but easy for human
25
Key Challenges
How to modify tools to learn from users?
– gather more training data, learn domain constraints, verify
intermediate and final predictions
– ask questions that are hard for machine, easy for human
How to combine noisy user answers
– build user models, learn them via interaction with users
Novel form of active learning
– with multiple noisy oracles
How to obtain user participation?
– data experts, often willing to help (e.g., Illinois Fire Service)
– may be asked to help (e.g., epinions)
– volunteer (e.g., online communities), "payment" schemes
26
Key Challenges (Cont.)
How to convince the database community?
– many love the idea, some skeptical
– need solid papers, real systems, more evangelizing
Current status
– built prototype, experimented with 3-132 users,
for source discovery and schema matching
– improve accuracy by 9-60%, reduced workload by 29-88%
– built two simple DI systems on Web almost exclusively with users
– building a real system on top of DBworld
See [McCann et al., WebDB-03, ICDE-05,
AAAI Spring Symposium-05, Tech Report-06]
27
Simplify Mediated Schema
Keyword Search over Multiple Databases
Can do joins
across data
sources
Novel problem
Extremely useful for urgent / one-time DI needs
– also when users are SQL-illiterate (e.g., Electronic Medical Records)
Solution [Tech Report 05, forthcoming]
– combines IR, schema matching, data matching, and AI planning
28
Simplify Wrappers
Structured Queries over Text/Web Data
SELECT ... FROM ... WHERE ...
E-mails, text, Web data, news, etc.
Novel problem
– attracts attention from database / AI / Web researchers at Columbia,
IBM TJ Watson/Almaden, UCLA, IIT-Bombay
Handles urgent / one-time DI need [Tech Report 05b]
– combines database, IR, and machine learning techniques
Handles long-term DI needs [Tech Report 05c]
29
Beyond Data Integration:
Community Information Management
Web pages
Researcher Homepages
Conference Pages ** *
DB Group Pages
*
Text documents
DBworld mailing list
etc. * * * *
*
Jim Gray Jim Gray
Keyword search
*
* giving-talk
* SQL querying
SIGMOD-04 SIGMOD-04 Question answering
*
* Browse, mining
*
DBLP
30
The Big Picture [Speculative Mode]
Structured data Unstructured data
(relational, XML) (text, Web, email)
Database: SQL
IR/Web/AI/Mining: keyword, QA
Multitude of users Many apps will
- increasingly involve all three
- want keyword search + SQL + etc.
Semantic Web
Industry/Real World
31
(Almost) Hands-Off
Information Integration for the
Life Sciences
Ulf Leser, Felix Naumann
Humboldt-Universität zu Berlin
Aladin
• Basic idea
- Urgent need for data integration in the life sciences
- Life science databases have certain characteristics
- Life science database users have certain intentions
- These can be exploited to automate integration
• ALmost Automatic Data INtegration
for the Life Sciences
- Minimize manual effort
- Keep quality of integrated data as high as possible
- Use domain-specific heuristics
Leser. Naumann, Hands-Off Information Integration, CIDR 05 33
Integration?
Export schema Export schema
• Database integration Federated schema Federated schema
• Schema level Export schema
Component schema
Export schema
Component schema
Export schema
Component schema
Local schema Local schema Local schema
Data Data Data
Source Source Source
• Data integration
• Data level
Leser. Naumann, Hands-Off Information Integration, CIDR 05 34
Two Cultures of Integration
• Schema-driven (computer scientists)
- Much smaller than data, (hopefully) well-defined elements
- Resolve redundancy and heterogeneity at the schema level
- High degree of automation once system is set-up
- Focus on methods - you rarely publish a “data paper”
• Data-driven (biologists)
- Value is in the data, abstraction is a result of analysis
- Don‘t bother with schemas
• Abstraction is volatile and depends on experimental technique
- Manual integration at data level, constant high effort
- You rarely publish a (database) “method paper”
Leser. Naumann, Hands-Off Information Integration, CIDR 05 35
Two Cultures: TAMBIS & SWISS-PROT
• Semantic middleware
• 6 sources, 1200 concepts
• Ever adopted in any other project?
- Integrated schema difficult to understand
- No agreement on “global” concepts
- Data provenance
• Database of protein sequences
• Papers, pers. comm., ext. databases, …
• Large effort: 30+ data curators
- Gold standard database
• Mostly perceived and used as a book
Leser. Naumann, Hands-Off Information Integration, CIDR 05 36
Linking Associated Objects
• Schema-driven
- Too abstract; tends to blur data provenance
• Data-driven
- Costly and time-consuming; inadequate use of DB technology
• Alternative: Concentrate on object links
• Example: SRS
- Maps a flat-file into a semi-structured,
“one class” representation
- Never mixes data from diff. sources
- Use cross-references for navigation and
joins
Leser. Naumann, Hands-Off Information Integration, CIDR 05 37
Cross-References
Leser. Naumann, Hands-Off Information Integration, CIDR 05 38
Aladin’s Scenario
• Assumptions
- Integration of many, many biological databases
- As little manual interventions as possible
- Do not merge data from different databases
• Challenges
- Push automation as far as possible without lowering quality
of integrated data too much
- Systematically evaluate quality of automatic integration
• Why will it work?
- Integrate by generating / finding links between objects
- Exploit characteristics of life science databases
Leser. Naumann, Hands-Off Information Integration, CIDR 05 39
Properties – and how to use them
• Data sources have only one “type” of object
• Objects have nested, semi-structured annotations
Detect hierarchical structure
• Objects have stable, unique accession numbers
• Databases heavily cross-reference each other
Detect objects
Detect existing cross-references
• Objects have rich annotations (often free text,
sequences)
Detect further associations based on “similarity”
Leser. Naumann, Hands-Off Information Integration, CIDR 05 40
A Biological Database
Leser. Naumann, Hands-Off Information Integration, CIDR 05 41
Columba: Multidimensional Integration
• Interdisciplinary project
• Integrates 15 sources annotating protein structures
• Sources are dimensions for PDB entries
• Neither data nor schema integration - links
SCOP SwissProt
Class
Fold
Description
Domains • Advantages
Superfamily Feature
• Users recognize their sources
CATH • Intuitive query concept
PDB GeneOntology
Class
Architecture
PDB_ID
Compounds Chains
Terms
TermRelations
• “Relatively” easy to maintain/extend
Topology
Ligands Ontologies
Homolog. sf
DSSP KEGG
Secondary Pathway
structure Enzyme
elements EC Number
Leser. Naumann, Hands-Off Information Integration, CIDR 05 42
Columba Experiences
• = Aladin’s assumptions
Relational approach feasible: Sources are downloadable, parsers exist
Databases are collections of each one type
Hierarchical structure, only 1:n relationships
Objects have unique accession numbers
Importance of and lack of cross references
• Lessons learned
- Schema reengineering is extremely time-consuming
• Although we will only use a small part at the end
- There is more demand than resources
• Why not be less specific about which data to integrate,
but much faster?
Leser. Naumann, Hands-Off Information Integration, CIDR 05 43
Materialized Integration
Data
Warehouse
BIND
Brenda
PDB
OMIM Genbank
SWISSPROT KEGG
PubMed
Leser. Naumann, Hands-Off Information Integration, CIDR 05 44
Materialized Integration
BIND
Brenda
Aladin PDB
OMIM Genbank
SWISSPROT KEGG
PubMed
Leser. Naumann, Hands-Off Information Integration, CIDR 05 45
Five Steps to Integration
Source-specific
1. Download source, parse, import into RDBMS
2. Guess primary objects
3. Guess (hierarchically structured) annotation
Across data sources
4. Guess cross-references
• Objects sharing some piece of information
5. Guess duplicates
• Highly similar objects
Leser. Naumann, Hands-Off Information Integration, CIDR 05 46
Overview – Steps 1-3
Steps 2 and 3
• Guess primary objects
• Guess accession number
• Guess / find FK constraints
Step 1
• Parse and import
• Arbitrary target schema
• With or without FK constraints
Leser. Naumann, Hands-Off Information Integration, CIDR 05 47
Overview – Steps 4+5
Step 5
• Guess duplicates
• Different degrees of
“duplicateness”
Step 4
• Guess existing cross-refs
• Compute new cross-refs
Leser. Naumann, Hands-Off Information Integration, CIDR 05 48
1. Download, parse, import
• Q: Is that possible in an automatic way?
• Q: What is the target schema?
• Answers
- Here, some manual work is involved, but …
- Parsers are almost always available (BioXXX)
- Aladin doesn‘t mind the target schema
- Target schemas are completely source-specific
- … may or may not contain FK constraints (MySql is …!)
- But: Universal relation won’t work
Leser. Naumann, Hands-Off Information Integration, CIDR 05 49
2. Guess Primary Objects
• Q: What’s a primary object?
• Q: How do you find them?
• Answers
- A database is a collection of objects of one type
• Many biological databases started as books
- These primary objects have stable accession numbers
- Accession numbers look very much the same
• P0496, DXS231, 1DXX, …
• Analyze length, composition, variation, uniqueness, NOT NULL
- But: Databases may have more than one primary type
Leser. Naumann, Hands-Off Information Integration, CIDR 05 50
3. Guess Dependent Annotation
• Q: Can we detect dependency from data?
• Q: What about complex relationships?
• Answers
- Hierarchical annotation means 1:1 or 1:n relationships
• Annotations don’t reference each other
• No m:n - especially flat-file parsers don’t generate m:n
- Guess or use primary keys and foreign key constraints
• Unique and not null; subset relationship; surrogate keys; …
- Lot of previous work, e.g. [KL92], [MLP02], …
Leser. Naumann, Hands-Off Information Integration, CIDR 05 51
4. Guess Associations between Objects
• Q: How can we find existing cross-refs?
• Q: How can we generate new cross-refs?
• Answers
- An existing cross-reference is essentially a pair of identical
accession numbers in two different data sources
• Same characteristics as accession number (minus uniqueness)
- Guess new cross-refs based on similarity of attribute values
• Similarity of text fields (text mining), sequences, …
- Note: cross-refs are on the object level – need to be stored
- Lot of previous work, e.g. [NHT+02], [HBP+05], [AMS+97]
Leser. Naumann, Hands-Off Information Integration, CIDR 05 52
5. Guess Duplicates
• Q: If we don’t even know classes – what’s a
duplicate?
• Answer
- Most difficult part, but there are many kind-of duplicates
• Are sequence-identical genes in different species the same?
- Need for varying degrees of “duplicateness”
• Data level (overlap in attribute values)
• Schema-level (schema matching)
- Note: No removal or merging of duplicates
- Lot of previous work, e.g. [MGR+02], [BN05], [MLF04], …
Leser. Naumann, Hands-Off Information Integration, CIDR 05 53
Caveats
• Not meant for high-throughput data
- Proteomics profiling, gene expression databases
- Targets “knowledge-rich” databases
• Resulting warehouse will contain errors
- Wrong cross-refs, misinterpreted structure, missing links
- Requirement: Measure quality of Aladin’s methods
• Use existing integrated databases as gold standard
• Precision/recall measures can be derived for all steps
• Intended for human usage, not for automatic
further processing
Leser. Naumann, Hands-Off Information Integration, CIDR 05 54
Summary
• Five step (almost) automatic integration procedure
- Depends on domain characteristics
- Guesses primary objects, annotations, cross-references, duplicates
- Neither schema integration nor data fusion – links
• Which quality does Aladin achieve?
- We don’t know yet – needs to be evaluated
• Issue: Scalability
- Needs many, many comparisons of tables, tuples, values
- But: Incremental integration, sampling, pruning
• Issue: Searching and result presentation
- Full text search, browsing
- But: Queries across sources possible for advanced users
Leser. Naumann, Hands-Off Information Integration, CIDR 05 55
Acknowledgements Columba
• Humboldt University • Conrad-Zuse Center
Silke Trissl Rene Heek
Heiko Müller Thomas Steinke
Raphael Bauer • Technische
• Charite Fachhochschule
Kristian Rother Patrick May
Stefan Günther Ina Koch
Robert Preissner
Cornelius Frömmel • Funding: BMBF
Leser. Naumann, Hands-Off Information Integration, CIDR 05 56
Get documents about "