VIEWS: 11 PAGES: 48 POSTED ON: 8/9/2012 Public Domain
Topics covered Databases QL Query containment An evaluation of QL CS848: Topics in Databases: Information Integration A simple case of information integration SQL Server Global Schema Subsystem1 Subsystem2 SQL Open, Scan Subsystem3 A single table: T A “local as view” (LAV) integration schema: T ´ Q2. User submits Q1. Query optimizer must determine if a scan of T suffices. True iff Q1 is equivalent to Q2. CS848: Topics in Databases: Information Integration In the beginning … Infinite countable sets of each of the following kinds of symbols: C = {C1, C2, … } (primitive concepts) A = {A1, A2, …} [ {B1, B2, …} (attributes) R = {R1, R2, … } (roles) Conventions: Attributes (resp. primitive concepts and roles) correspond to words in lower case or to positive integers (resp. words in upper case and words in mixed case). CS848: Topics in Databases: Information Integration For a particular database I hD, (¢)Ii where D is a countable possibly infinite domain, and where for each symbol (C)I µ D (A)I : D ! D (R)I µ (D £ D) CS848: Topics in Databases: Information Integration Partial databases e e2D e : {C1, … , Cn} e 2 (Ci)I A e1 e2 (A)I(e1) = e2 R e1 e2 (e1, e2) 2 (R)I e1 e2 e1 = e2 e1 e2 e1 e2 CS848: Topics in Databases: Information Integration Relational databases EMP name age name “John” “John” 33 “Mary” 33 e1 : {EMP} age 33 e2 : {EMP} name “Mary” CS848: Topics in Databases: Information Integration Relational databases (cont’d) name “John” {e1, e2, “John”, 33, “Mary”} µ D e1 : {EMP} {e1, e2} µ (EMP)I age (name)I(e1) = “John” 33 (age)I(e1) = (age)I(e2) = 33 e2 : {EMP} e1 e2 name “John” ? 33 “Mary” CS848: Topics in Databases: Information Integration Dialects of QL Conjunctive QL Positive QL First order QL with with with (semantics) bag semantics† bag semantics‡ bag semantics Conjunctive QL Positive QL First order QL (expressiveness) †[Khizder et al., 1999], ‡[Lui et al., 2002] CS848: Topics in Databases: Information Integration Conjunctive QL Q ::= D as A (quantification) | A1 = A2.R (unnest) | A1.Pf1 = A2.Pf2 (selection) | elim A1, … , An Q (projection) | true (null tuple) | from Q1, Q2 (natural join) | (Q) D ::= THING | C (basic description) Pf ::= id | A.Pf (path function) CS848: Topics in Databases: Information Integration Well formed queries: a(Q) a(D as A) ´ {A} a(A1 = A2.R) ´ {A1, A2} a(A1.Pf1 = A2.Pf2) ´ {A1, A2} a(elim A1, … , An Q) ´ {A1, … , An} a(true) ´ ; a(from Q1, Q2) ´ a(Q1) [ a(Q2) Require {A1, … , An} µ a(Q) for projection operators. CS848: Topics in Databases: Information Integration Tuples and bags A (duplicate) tuple t with attribute bindings for attributes {A1, … , An} over a database I = hD, (¢)Ii has the general form hA1 : e1, … , An : en , cnt : ii, where {e1, … , en} µ D, “cnt” is a distinct symbol not used as an attribute, and i a positive integer. A set of duplicate tuples that contain the same attribute bindings is called a bag. CS848: Topics in Databases: Information Integration Operations on tuples a(t) ´ set of attributes occurring in t t@cnt ´ integer i such that “cnt : i” occurs in t t@A ´ element e 2 D such that “A : e” occurs in t; defined only when A 2 a(t) t[{A1, … , An}] ´ {A1 : t@A1, … , An : t@An}; defined only when {A1, … , An} µ a(t) [t] ´ t[a(t)] CS848: Topics in Databases: Information Integration Semantics The meaning of a query Q, denoted «Q¬, is a function that maps databases to bags. The behavior of this function on a particular database I = hD, (¢)Ii is defined as follows. «THING as A¬(I) ´ {hA : e, cnt : 1i : e 2 D } «C as A¬(I) ´ {hA : e, cnt : 1i : e 2 (C)I} «A1 = A2.R¬(I) ´ {hA1 : e1, A2 : e2, cnt : 1i : (e2, e1) 2 RI} «A1.Pf1 = A2.Pf2¬(I) ´ {hA1 : e1, A2 : e2, cnt : 1i : (Pf1)I(e1) = (Pf2)I(e2)} where (id)I ´ {(e, e) : e 2 D} (A.Pf )I ´ {(e1, e2) : (Pf )I((A)I(e1)) = e2} CS848: Topics in Databases: Information Integration Semantics (cont’d) «elim A1, … , An Q ¬(I) ´ ; , if not well formed; otherwise {hA1 : t@A1, … , An : t@An , cnt : 1i : t 2 «Q¬(I)} «true¬(I) ´ {hcnt : 1i} «from Q1, Q2¬(I) ´ {t : a(t) = a(Q1) [ a(Q2) Æ 9 t1 2 «Q1¬(I), t2 2 «Q2¬(I) : t@cnt = t1@cnt £ t2@cnt Æ t[a(t1)] = [t1] Æ t[a(t2)] = [t2]} CS848: Topics in Databases: Information Integration Syntactic sugar A1. L .An.id ´ A1. L .An select distinct A1, … , An Q ´ elim A1, … , An Q select * Q ´ Q Q1 where Q2 ´ from Q1, Q2 Q1 and Q2 ´ from Q1, Q2 from ´ true from Q1, Q2, … , Qn ´ from (from Q1, Q2, …) Qn CS848: Topics in Databases: Information Integration Examples The names of employees who have the same age as another employee with a given name. select distinct :p, name from EMP as e, ( select distinct :p, e1 from EMP as e1, EMP as e2 where e1.age = e2.age and e2.name = :p ) where e.name = name and e.id = e1.id CS848: Topics in Databases: Information Integration Method calls (more syntactic sugar) A1.Pf1.C(A2.Pf2, … , An-1.Pfn-1) = An.Pfn select distinct A1, … , An ´ from C as A where A.1 = A1.Pf1 and … and A.n = An.Pfn A1.Pf1.C(A2.Pf2, … , An-1.Pfn-1) as An ´ A1.Pf1.C(A2.Pf2, … , An-1.Pfn-1) = An.id CS848: Topics in Databases: Information Integration Examples (cont’d) The names of employees who have an age double that of another employee. select distinct name from EMP as e, ( select distinct e1 from EMP as e1, EMP as e2 where e2.age.+(e2.age) = e1.age ) where e.name = name and e.id = e1.id CS848: Topics in Databases: Information Integration Conjunctive datalog (more syntactic sugar) C(A1, … , An) select distinct A1, … , An ´ from C as A where A.1 = A1.id and … and A.n = An.id (A1, … , Am) :- Q1, … , Qn. ´ select distinct A1, … , Am from Q1, … , Qn CS848: Topics in Databases: Information Integration Positive QL Q ::= empty A1, … , An (empty set) | Q1 union Q2 (union) a(empty A1, …, An) ´ {A1, …, An} a(Q1 union all Q2) ´ a(Q1) Require a(Q1) = a(Q2) in union operations. CS848: Topics in Databases: Information Integration Semantics «empty A1, … , An¬(I) ´ ; «Q1 union Q2¬(I) ´ {t : t@cnt = 1 Æ a(t) = a(Q1) Æ a(t) = a(Q2) Æ( ( 9 t1 2 «Q1¬(I) : [t] = [t1] Æ :9 t2 2 «Q2¬(I) : [t] = [t2] ) Ç ( 9 t2 2 «Q2¬(I) : [t] = [t2] Æ :9 t1 2 «Q1¬(I) : [t] = [t1] ) Ç ( 9 t1 2 «Q1¬(I), t2 2 «Q2¬(I) : [t] = [t1] Æ [t] = [t2] ) )} CS848: Topics in Databases: Information Integration First order QL Q ::= Q1 minus Q2 (difference) a(Q1 minus Q2) ´ a(Q1) Require a(Q1) = a(Q2) in difference operations. CS848: Topics in Databases: Information Integration Semantics «Q1 minus Q2¬(I) ´ {t : t@cnt = 1 Æ a(t) = a(Q1) Æ a(t) = a(Q2) Æ ( 9t1 2 «Q1¬(I) : [t] = [t1] ) Æ ( :9t2 2 «Q2¬(I) : [t] = [t2] )} CS848: Topics in Databases: Information Integration QL with duplicates Q ::= select A1, … , An Q (duplicate preserving projection) | Q1 union all Q2 (bag union) | Q1 minus all Q2 (bag difference) CS848: Topics in Databases: Information Integration Well formed queries (cont’d) a(select A1, … , An Q) ´ {A1, … , An} a(Q1 union all Q2 ) ´ a(Q1) a(Q1 minus all Q2) ´ a(Q1) Require a(Q1) = a(Q2) in bag union and bag difference operations, and that {A1, … , An} µ a(Q) in (duplicate preserving) projection operations. CS848: Topics in Databases: Information Integration Semantics «select A1, … , An Q ¬(I) ´ ; , if not well formed and representable†; otherwise {hA1 : t1@A1, … , An : t1@An , cnt : ni : t1 2 «Q¬(I) Æ n = (t2@cnt) } t2 2 «Q¬(I) : t2[{A1, … , An}] = t1 [{A1, … , An}] †The selection operation is representable on database I iff, for every t1 2 «Q¬(I), |{t2 2 «Q¬(I) : t2[{A1, … , An}] = t1 [{A1, … , An}]}| is finite. CS848: Topics in Databases: Information Integration Example A duplicate preserving projection operation that is not representable in any database with an infinite domain. select e1 from THING as e1, THING as e2 Observation: All well-formed duplicate preserving projection operations on databases with finite domains are representable. CS848: Topics in Databases: Information Integration Semantics (cont’d) «Q1 union all Q2¬(I) ´ ; , if not well formed; otherwise {t 2 «Q1¬(I) : :9 t2 2 «Q2¬(I) : [t] = [t2]} [ {t 2 «Q2¬(I) : :9 t1 2 «Q1¬(I) : [t] = [t1]} [ {t : 9 t1 2 «Q1¬(I), t2 2 «Q2¬(I) : [t] = [t1]} Æ [t] = [t2] Æ t@cnt = t1@cnt + t2@cnt} «Q1 minus all Q2¬(I) ´ ; , if not well formed; otherwise {t 2 «Q1¬(I) : :9 t2 2 «Q2¬(I) : [t] = [t2]} [ {t : 9 t1 2 «Q1¬(I), t2 2 «Q2¬(I) : [t] = [t1]} Æ [t] = [t2] Æ t@cnt = t1@cnt - t2@cnt Æ t1@cnt > t2@cnt } CS848: Topics in Databases: Information Integration Summary at, =, elim, at, =, elim, at, =, elim, true, from, (bag semantics) true, from, true, from, select, select, select empty, union all, empty, union all minus all at, =, elim, at, =, elim, (set semantics) at, =, elim, true, from, true, from, true, from empty, union, empty, union minus (conjunctive) (positive) (first order) CS848: Topics in Databases: Information Integration Query contexts An expression Q[] in the language QL enriched by an additional terminal symbol [] is called a query context. For a query Q1 2 QL, the expression Q1[Q2] denotes the syntactical substitution of Q2 for []. Q2 is compatible with Q1 if Q1[Q2] 2 QL. For example, Q2 is compatible with Q1 in the following. Q2: EMP as e where e.name = :p Q1: select distinct :p, d from DEPT as d, [] where d = e.dept CS848: Topics in Databases: Information Integration The query equivalence problem Q1 is equivalent to Q2 for database I, written I ² (Q1 ´ Q2 ), if «Q1¬(I) = «Q2¬(I). A query equivalence dependency E has the form (Q1 ´ Q2). E = (Q1 ´ Q2) is an axiom if, for any database I, I ² (Q1 ´ Q2 ). A query equivalence problem for a given set of query equivalence dependencies is to determine if a given member of the set is an axiom. CS848: Topics in Databases: Information Integration Some axioms Question: Is it true that any E with the following form is an axiom? (elim A1, … , Am Q1)[elim B1, … , Bn Q2] ´ elim A1, … , Am Q1 [Q2] Answer: No. However, any such E is an axiom if any attribute in (a(Q2) – {B1, … , Bn}) does not occur in query context (elim A1, … , Am Q1 []). CS848: Topics in Databases: Information Integration Excluding variable reuse in QL Q has an occurrence of variable reuse if there is a query context Q1[] and a query of the form elim A1, … , An Q2 or of the form select A1, … , An Q2 such that Q = Q1[Q2] and there exists A in (a(Q2) – {A1, … , An}) that also occurs in Q1[]. Observation: For any Q1, there exists an equivalent class of query Q2 that has no occurrence of variable reuse. CS848: Topics in Databases: Information Integration The query containment problem Q1 is contained in Q2 for database I, written I ² (Q1 v Q2), if, for any tuple t1 in «Q1¬(I), there exists t2 in «Q2¬(I) such that [t1] = [t2] and t1@cnt t2@cnt. A query containment dependency C has the form (Q1 v Q2). C = (Q1 v Q2) is an axiom if, for any database I, I ² (Q1 v Q2). A query containment problem for a given set of query containment dependencies is to determine if a given member of the set is an axiom. CS848: Topics in Databases: Information Integration Equivalence and containment Observation: Equivalence reduces to containment. Q1 ´ Q2 iff Q1 v Q2 and Q2 v Q1 Observation: Containment reduces to equivalence in first order QL. Q1 v Q2 iff (Q1 minus all Q2) ´ empty a(Q1) CS848: Topics in Databases: Information Integration Some complexity results Theorem: The query equivalence and containment problems for conjunctive QL is NP-complete.† †Chandra, A. K. and P. M. Merlin. Optimal implementation of conjunctive queries in relational databases. Proc. Ninth Annual ACM Symposium on the Theory of Computing, pp. 77–90, 1977. CS848: Topics in Databases: Information Integration A decision procedure Theorem: The following procedure decides if C = (Q1 v Q2) is an axiom for conjunctive QL.† 1. Freeze the body of Q1 by creating a partial database consisting of individuals that include its variables. 2. If the tuple hA1 : A1 , … , An : An , cnt : 1i occurs in «Q2¬(I), where a(Q1) = {A1, … , An}, then return true; otherwise return false.‡ †Derived from [Ullman, 1999]. ‡Use forced semantics for selection operations. CS848: Topics in Databases: Information Integration Obtaining a partial database from Q THING as A A A1 = A2.R R A2 A1 C as A A : {C} A1.A2. L .Am = B1.B2. L .Bn A2 … Am A1 B2 … Bn B1 CS848: Topics in Databases: Information Integration Derived partial databases (cont’d) A A u : L1 v : L3 u : L1 v : L3 A A w : L2 x : L4 w : L2 x : L4 A A w:L u : L1 w:L u : L1 A A v : L2 v : L2 CS848: Topics in Databases: Information Integration Deriving partial databases (cont’d) n 1 : L1 n 2 : L2 n 1 : L1 [ L2 n 2 : L1 [ L2 n 1 : L1 n 2 : L2 n 3 : L3 n 1 : L1 n 2 : L2 n 3 : L3 CS848: Topics in Databases: Information Integration Evaluating selections on partial databases Note that selection conditions can navigate missing attribute values. In such cases, assume a forced semantics. In particular, two nodes n1 and n2 satisfy a selection condition iff the condition has the form n1.Pf1.Pf = n2.Pf2.Pf where (Pf1)I(n1) and (Pf2)I(n2) are defined and lead to nodes connected by an equality arc. CS848: Topics in Databases: Information Integration Some complexity results (cont’d) Theorem: The query equivalence problem for conjunctive QL with bag semantics is NP-complete. Observation: The complexity of the query containment problem for conjunctive QL with bag semantics remains open at this time. Example:† In conjunctive QL with bag semantics, the query containment dependency Q1 v Q2 is an axiom, where Q1 and Q2 have the respective definitions select x, z select x, z from P as x, R as z from P as x, R as z where x = u.Q and z = v.Q where y = u.Q and y = v.Q † [Chaudhuri and Vardi, 1993] CS848: Topics in Databases: Information Integration The query membership problem A database schema, denoted T, consists of a finite set {C1, … , Cn} of query containment dependencies. C is an axiom relative to database schema T = {C1, … , Cn}, written T ² C, if, for any database I, I ² C if I ² Ci for each i. A query membership problem for a given set of query containment dependencies is to determine if a given member of the set is an axiom relative to a given database schema also consisting of members of the set. CS848: Topics in Databases: Information Integration More complexity results Theorem: The query membership problem for conjunctive QL is undecidable. Theorem: The query membership problem for first order QL is equivalent to the query containment problem for first order QL. Proof: Assignment. CS848: Topics in Databases: Information Integration Evaluating QL Defining database schema Expressing access plans Fine grained APIs: record addresses Protocols Safety Binding patterns Adequacy: SQL, OQL, XQuery CS848: Topics in Databases: Information Integration Modeling generalization taxonomies Consider a simple object-oriented schema language consisting of sentences of the following form.† class C {A1 : ref C1, … , Am : ref Cm} [isa C1 , … , Cn]; Assignment: Encode a fixed collection of such sentences as a database schema in conjunctive QL. Your encoding should be as compact as possible and should enable the following questions to be expressed as query containment dependencies over your schema. 1. Is C a defined class? 2. Is attribute A defined on class C? 3. Can an object reside in both class C1 and class C2? †Assume that any object in a database was created with respect to a single class. CS848: Topics in Databases: Information Integration Modeling pipelined query access plans (syntax) (defn of b(¢)) (parameter) Q ::= (PARAM as A) {A} (index scan) | (from C as A, A.1 = B1, … , A.n = Bn) {A} (nested loops) | (from Q1, Q2) b(Q1) [ b(Q2) (noop) | (select A1, … , An Q) b(Q) Å {A1, … , An} (record field access) | (A1 = A2.B) {A1} (comparison) | (A1 = A2) ; (catenation) | (Q1 union all Q2) b(Q1) Å b(Q2) (cut) | (elim A1, … , An Q) ; | … Require: 1. (a(Q2) – b(Q2)) µ a(Q1) for nested loops, and 2. a(Q) = b(Q) for top-level queries. CS848: Topics in Databases: Information Integration Alternative semantics Require richer models theories for 1. sort operations, and 2. named cuts. CS848: Topics in Databases: Information Integration