Docstoc

indexing_sigmod

Document Sample
indexing_sigmod Powered By Docstoc
					            Indexing Dataspaces


  Xin (Luna) Dong          Alon Halevy
University of Washington      Google Inc.
              @ SIGMOD 2007
Many Data Management Applications Need to Manage
Heterogeneous Data Sources




     D1                                 D5




            D2                    D4


                       D3
   Traditional Data Integration Systems
  SELECT P.title AS title, P.year
   AS year, A.name AS author
  FROM Author AS A, Paper AS P,
                                    Publication(title, year, author)
   AuthoredBy AS B
  WHERE A.aid=B.aid AND
   P.pid=B.pid



                                      Mediated Schema


               D1                                                      D5
Author(aid, name)
Paper(pid, title, year)
AuthoredBy(aid,pid)
                             D2                                   D4


                                              D3
Querying on Traditional Data Integration Systems


                           Q

    Q1                                      Q5
                     Mediated Schema


      D1   Q
           Q2                          Q4    D5




                D2
                      Q3
                                       D4


                           D3
In Many Applications it is Hard to Obtain Precise
Semantic Mappings




      D1
                       ?                      D5




              D2                      D4


                          D3
Scenario 1. Different Websites About Movies
Scenario 2. Personal Information Space




                                Intranet
                                Internet
Querying Dataspaces
   Dataspaces
     Collections  of heterogeneous data sources
     Don’t necessarily include semantic mappings
     Scenarios: personal information, enterprises,
      government agencies, smart homes, digital
      libraries, and the Web
   Pay-as-you-go data management
     Providesome services from the outset
     Improve the mappings on an as-needed basis
   How to effectively query and search a
    dataspace?
Example Dataspace
<publication>
 <title>Semex: Personal information management
  and integration</title>
 <author>Xin Dong</author>
 <author>Alon Halevy</author>
 <conference>IIWeb Workshop</conference>
</publication>

<thesis-proposal>
 <title>Semex: Personal …</title>
 <student>
  <name>Xin (Luna) Dong</name>
  <entryYear>2001</entryYear>
 </student>
</thesis-proposal>


     stuID         lastName         firstName    entryYear
    5001438            Xin             Dong        2001
       …                …                …          …
Searching and Querying a Dataspace
   Structured query?
     Require detailed knowledge on schemas
     Require precise attribute values
   Keyword search?
     Forgiving,but…
     Does not allow specifications on structure
   We consider queries that are
     keyword-based
     structure-aware
I. Predicate Query
 <publication>
  <title>Semex: Personal information management
   and integration</title>
  <author>Xin Dong</author>                      Conjunction of predicates
  <author>Alon Halevy</author>
  <conference>IIWeb Workshop</conference>        Predicate: (v, {K1, …, Kn})
 </publication>
                                                  v  - an attribute or
 <thesis-proposal>                                  association label
  <title>Semex: Personal …</title>
  <student>                                        {K1, …, Kn} - a keyword set
   <name>Xin (Luna) Dong</name>
   <entryYear>2001</entryYear>
  </student>
 </thesis-proposal>


      stuID        lastName         firstName       entryYear
     5001438           Xin             Dong            2001
        …               …               …              …
I. Predicate Query
 <publication>
  <title>Semex: Personal information management
   and integration</title>
  <author>Xin Dong</author>
  <author>Alon Halevy</author>
  <conference>IIWeb Workshop</conference>
                                                     Example I:
 </publication>                                       (title, ‘Semex’)
 <thesis-proposal>                                    (author, ‘Luna Dong’)
  <title>Semex: Personal …</title>
  <student>
   <name>Xin (Luna) Dong</name>
   <entryYear>2001</entryYear>
  </student>
 </thesis-proposal>


      stuID         lastName         firstName        entryYear
     5001438            Xin             Dong            2001
        …                …                …              …
I. Predicate Query
 <publication>
  <title>Semex: Personal information management
   and integration</title>
  <author>Xin Dong</author>
  <author>Alon Halevy</author>
  <conference>IIWeb Workshop</conference>
                                                     Example II:
 </publication>                                       (name, ‘Dong’)
 <thesis-proposal>
  <title>Semex: Personal …</title>
  <student>
   <name>Xin (Luna) Dong</name>
   <entryYear>2001</entryYear>
  </student>
 </thesis-proposal>


      stuID         lastName         firstName    entryYear
     5001438            Xin             Dong          2001
        …                …                …           …
II. Neighborhood Keyword Query
<publication>
 <title>Semex: Personal information management
  and integration</title>
 <author>Xin Dong</author>
 <author>Alon Halevy</author>
 <conference>IIWeb Workshop</conference>
</publication>
                                                     Form: {K1, …, Kn}
<thesis-proposal>
                                                     Example: ‘Semex’
 <title>Semex: Personal …</title>                        Relevant items
 <student>
  <name>Xin (Luna) Dong</name>                           Associated   items
  <entryYear>2001</entryYear>
 </student>
</thesis-proposal>


     stuID         lastName         firstName    entryYear
    5001438            Xin             Dong        2001
       …                …                …          …
Indexing of the Heterogeneous Data
   Challenges
     Indexdata from heterogeneous data sources
     Capture both text values and structural information

   Traditional Indexes
     Build a separate index for each attribute to support
      structured queries
     Build an inverted list to support keyword search
     XML indexes assume tree models and build multiple
      indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.)

Our approach: Extend inverted lists to capture
both text values and structure of the data
Contributions
   Design an index that
     indexes data from heterogeneous data sources
     captures both structure and text of the data
     incorporates various types of heterogeneity,
      including synonyms and hierarchies of attributes
      and associations
Outline
 Motivation
 Overview  of our approach
 Our algorithm
  Indexing structure
  Indexing hierarchies
 Experimental   Results
 Conclusions
View Data Sources as Triple Base

 <publication>                           Alon Halevy
  <title>Semex: Toward …</title>                                        Semex: …
  <authors>                                            author
   <author><name>
    Xin Dong</name></author>                            Luna Dong   author
   <author><name>
    Alon Halevy</name></author>
  </authors>…                      Attribute
 </publication>

                                      Object                           Association
View Data Sources as Triple Base

                       Alon Halevy

                                                      Semex: …

                                     author


                                      Luna Dong   author
View Data Sources as Triple Base
 Alon Halevy
                                           Departmental Database
                                Semex: …
                                             StuID    lastName   firstName   …
               author
                                            1000001     Xin        Dong      …
                                              …          …          …        …
                Luna Dong   author




Goal: Index triples to efficiently answer queries
that combine text and structure
Indexing a Triple Base Using an Inverted List

  Alon Halevy
                                            Departmental Database
                                 Semex: …
                                              StuID    lastName   firstName   …
                author
                                             1000001     Xin        Dong      …
                                               …          …          …        …
                 Luna Dong   author




 Inverted List

                    Alon
                    Dong
                   Halevy
                    Luna
                   Semex
                     Xin
Indexing a Triple Base Using an Inverted List
                Query: Dong

  Alon Halevy
                                                Departmental Database
                                 Semex: …
                                                     StuID   lastName   firstName    …
                author
                                                 1000001       Xin        Dong       …
                                                      …         …          …         …
                 Luna Dong   author




 Inverted List

                    Alon                    1
                    Dong                         1                               1
                   Halevy                   1
                    Luna                         1
                   Semex                                         1
                     Xin                                                         1
Outline
 Motivation
 Overview  of our approach
 Our algorithm
  Indexing structure
  Indexing hierarchies
 Experimental   Results
 Conclusions
Incorporate Attribute Labels in the Inverted List
                Query: firstName “Dong”

  Alon Halevy
                                                 Departmental Database
                                  Semex: …
                                                      StuID   lastName   firstName    …
                 author
                                                  1000001       Xin        Dong       …
                                                       …         …          …         …
                  Luna Dong   author




 Inverted List

                     Alon                    1
                     Dong                         1                               1
                    Halevy                   1
                     Luna                         1
                    Semex                                         1
                      Xin                                                         1
Incorporate Attribute Labels in the Inverted List
                Query: firstName “Dong”  “Dong/firstName/”

  Alon Halevy
                                                      Departmental Database
                                       Semex: …
                                                           StuID   lastName   firstName    …
                 author
                                                       1000001       Xin        Dong       …
                                                            …         …          …         …
                   Luna Dong       author




 Inverted List

                   Alon/name/                     1
                  Dong/name/                           1
                 Dong/firstName/                                                       1
                  Halevy/name/                    1
                   Luna/name/                          1
                   Semex/title/                                        1
                  Xin/lastName/                                                        1
Incorporate Association Labels in the Inverted List
                Query: author “Dong”

  Alon Halevy
                                                      Departmental Database
                                       Semex: …
                                                           StuID   lastName   firstName    …
                  author
                                                       1000001       Xin        Dong       …
                                                            …         …          …         …
                    Luna Dong      author




 Inverted List

                    Alon/name/                    1
                    Dong/name/                         1
                Dong/name/firstName/                                                   1
                   Halevy/name/                   1
                    Luna/name/                         1
                    Semex/title/                                       1
                 Xin/name/lastName/                                                    1
Incorporate Association Labels in the Inverted List
  Alon Halevy   Query: author “Dong”  “Dong/author/”
                              Semex: …       Departmental Database
                 author                             StuID    LastName   FirstName   …
                                                   1000001     Xin        Dong      …
                   Luna Dong      author             …          …          …        …




 Inverted List
                Alon/author/                                   1
                Alon/name/                 1
                Dong/author/                                   1
                Dong/name/                     1
           Dong/name/firstName/                                                1
                Halevy/name/               1
                Luna/name/                     1
                Luna/auhor                                     1
                Semex/title/                                   1
           Xin/name/LastName/                                                  1
Outline
 Motivation
 Overview  of our approach
 Our algorithm
  Indexing structure
  Indexing hierarchies
 Experimental   Results
 Conclusions
Hierarchies of Attributes and Associations
 <publication>
  <title>Semex: Toward on-the-fly personal
  information integration</title>
   <author>Xin Dong</author>
  <author>Alon Halevy</author>
  <conference>IIWeb Workshop</conference>
                                                 Example II:
 </publication>                                   (name, ‘Dong’)
 <thesis-proposal>
  <title>Semex: Personal …</title>            Attribute Hierarchy:
  <student>
   <name>Xin (Luna) Dong</name>                           name
   <entryYear>2001</entryYear>
  </student>
 </thesis-proposal>                           firstName          lastName


     stuID        lastName       firstName   entryYear
     5001438          Xin            Dong      2001
       …              …               …           …
Incorporate Attribute Hierarchy in the Inverted List
                Query: name “Dong”

  Alon Halevy
                                                      Departmental Database
                                       Semex: …
                                                           StuID   lastName   firstName      …
                 author
                                                       1000001       Xin          Dong       …
                                                            …         …            …         …
                   Luna Dong       author

                                                                           name


                                                             firstName            lastName
 Inverted List

                   Alon/name/                     1
                  Dong/name/                           1
                 Dong/firstName/                                                         1
                  Halevy/name/                    1
                   Luna/name/                          1
                   Semex/title/                                        1
                  Xin/lastName/                                                          1
Naïve Approach: Expand Queries with Sub-Attributes
                Query: name “Dong”  “Dong/name/ OR Dong/firstName/ OR …”

  Alon Halevy
                                                      Departmental Database
                                       Semex: …
                                                           StuID   lastName   firstName      …
                 author
                                                       1000001       Xin          Dong       …
                                                            …         …            …         …
                   Luna Dong       author

                                                                           name


                                                             firstName            lastName
 Inverted List

                   Alon/name/                     1
                  Dong/name/                           1
                 Dong/firstName/                                                         1
                  Halevy/name/                    1
                   Luna/name/                          1
                   Semex/title/                                        1
                  Xin/lastName/                                                          1
Approach I: Duplicate Entries for Parent Attributes
                Query: name “Dong”  “Dong/name/”

  Alon Halevy
                                                      Departmental Database
                                       Semex: …
                                                           StuID   lastName   firstName      …
                 author
                                                       1000001       Xin          Dong       …
                                                            …         …            …         …
                   Luna Dong       author

                                                                           name


                                                             firstName            lastName
 Inverted List

                   Alon/name/                     1
                  Dong/name/                           1                                 1
                 Dong/firstName/                                                         1
                  Halevy/name/                    1
                   Luna/name/                          1
                   Semex/title/                                        1
                  Xin/lastName/                                                          1
                   Xin/name/                                                             1
Approach I: Duplicate Entries for Parent Attributes
                Query: name “Dong”  “Dong/name/”

  Alon Halevy
                                                      Departmental Database
                                       Semex: …
                                                           StuID   lastName   firstName      …
                 author
                                                       1000001       Xin          Dong       …
                                                            …         …            …         …
                   Luna Dong       author

                                                                           name


                                                             firstName            lastName
 Inverted List

                   Alon/name/                     1
                  Dong/name/                           1                                 1
                 Dong/firstName/                                                         1
                  Halevy/name/                    1
                   Luna/name/                          1
                   Semex/title/                                        1
                  Xin/lastName/                                                          1
                   Xin/name/                                                             1
Approach II. Concatenate a keyword with a Hierarchy
Path
                Query: name “Dong”  “Dong/name/*”

  Alon Halevy
                                                      Departmental Database
                                       Semex: …
                                                           StuID   lastName   firstName      …
                  author
                                                       1000001       Xin          Dong       …
                                                            …         …            …         …
                    Luna Dong      author

                                                                           name


                                                             firstName            lastName
 Inverted List

                    Alon/name/                    1
                    Dong/name/                         1
                Dong/name/firstName/                                                     1
                   Halevy/name/                   1
                    Luna/name/                         1
                    Semex/title/                                       1
                 Xin/name/lastName/                                                      1
Approach III. Hierarchy Path + Summary Rows
                Query: name “Dong”  “Dong/name/*”

  Alon Halevy
                                                      Departmental Database
                                       Semex: …
                                                           StuID   lastName   firstName      …
                  author
                                                       1000001       Xin          Dong       …
                                                            …         …            …         …
                    Luna Dong      author

                                                                           name


                                                             firstName            lastName
 Inverted List

                    Alon/name/                    1
                    Dong/name//                        1                                 1
                Dong/name/firstName/                                                     1
                   Halevy/name/                   1
                    Luna/name/                         1
                    Semex/title/                                       1
                 Xin/name/lastName/                                                      1
Summary Rows
   Goal: Given a threshold t, answer any prefix
    search by reading no more than t rows.
   Definition:
     The  indexed keyword: p//
      E.g. “Dong/name//”
     Rows  starting with p/ are shadowed by the
      summary row p//
      E.g. “Dong/name/lastName/” is shadowed by
      “Dong/name//”
Answering Prefix Search with Summary
Rows
   Once read a summary row, ignore the rows
    shadowed by it
   Example (t=1)
          Query: name “Dong”  “Dong/name/*”

Inverted List

                Alon/name/      1
             Dong/name//               1           1
         Dong/name/firstName/                      1
            Halevy/name/        1
                Luna/name/             1
                Semex/title/                   1
          Xin/name/lastName/                       1
Answering Prefix Search with Summary
Rows
   Once read a summary row, ignore the rows
    shadowed by it
   Example (t=1)
          Query: name “Xin”  “Xin/name/*”

Inverted List

                Alon/name/      1
             Dong/name//                 1       1
         Dong/name/firstName/                    1
            Halevy/name/        1
                Luna/name/               1
                Semex/title/                 1
          Xin/name/lastName/                     1
Adding Summary Rows
    Step 1. Create a summary row for a prefix p if
         Searching prefix p needs to read more than t rows
         There is no p’ with p as prefix such that searching prefix p’ needs to
          read more than t rows
    Step 2. Remove row p if summary row p/ exists
    Example (t=1)
    Inverted List

                    Alon/name/      1
                 Dong/name/                    1
             Dong/name/firstName/                                     1
                Halevy/name/        1
                    Luna/name/                 1
                    Semex/title/                          1
              Xin/name/lastName/                                      1
Adding Summary Rows
    Step 1. Create a summary row for a prefix p if
         Searching prefix p needs to read more than t rows
         There is no p’ with p as prefix such that searching prefix p’ needs to
          read more than t rows
    Step 2. Remove row p if summary row p/ exists
    Example (t=1)
    Inverted List

                    Alon/name/      1
                 Dong/name//                   1                      1
                 Dong/name/                    1
             Dong/name/firstName/                                     1
                Halevy/name/        1
                    Luna/name/                 1
                    Semex/title/                          1
              Xin/name/lastName/                                      1
Adding Summary Rows
    Step 1. Create a summary row for a prefix p if
         Searching prefix p needs to read more than t rows
         There is no p’ with p as prefix such that searching prefix p’ needs to
          read more than t rows
    Step 2. Remove row p if summary row p/ exists
    Example (t=1)
    Inverted List

                    Alon/name/      1
                 Dong/name//                   1                      1
                 Dong/name/                    1
             Dong/name/firstName/                                     1
                Halevy/name/        1
                    Luna/name/                 1
                    Semex/title/                          1
              Xin/name/lastName/                                      1
Adding Summary Rows
    Step 1. Create a summary row for a prefix p if
         Searching prefix p needs to read more than t rows
         There is no p’ with p as prefix such that searching prefix p’ needs to
          read more than t rows
    Step 2. Remove row p if summary row p/ exists
    Example (t=1)
    Inverted List

                    Alon/name/      1
                 Dong/name//                   1                      1
             Dong/name/firstName/                                     1
                Halevy/name/        1
                    Luna/name/                 1
                    Semex/title/                          1
              Xin/name/lastName/                                      1
Answering Neighborhood Keyword Queries
  Alon Halevy    Query: Semex  “Semex/*”
                   ~author            Semex: …    Departmental Database
                  author                               StuID    LastName   FirstName   …
                                                      1000001     Xin        Dong      …
                    Luna Dong     author                …          …          …        …

                                    ~author




 Inverted List
                 Alon/author/                                     1
                 Alon/name/                   1
                 Dong/author/                                     1
                 Dong/name/                       1
           Dong/name/firstName/                                                   1
                 Halevy/name/                 1
                 Luna/name/                       1
                Semex/~author/                1   1
                 Semex/title/                                     1
           Xin/name/LastName/                                                     1
Outline
 Motivation
 Overview  of our approach
 Our algorithm
  Indexing structure
  Indexing hierarchies
 Experimental   Results
 Conclusions
Implementation Details
   Our index extends the Lucene Indexing Tool
     Lucene   stores an inverted list as a sorted array
   Implemented in Java
   Run on a machine with four 3.2GHz and
    1024KB-cache CPUs, and 1GB memory
Experimental Setting
   Data sets
    A 50MB personal data set
     Two 10GB XML data sets: Wikipedia, XMark Benchmark

   Queries: with one predicate or keyword
     Predicate Query with leaf attributes
     Predicate Query with branch attributes
     Predicate Query with associations
     Neighborhood Keyword Query
   Measure: in millisecond
     Index-lookuptime
     Query-answering time
Our Indexing Method Significantly Improves
Query Answering
                    Plain Inverted List   Extended Inverted List
                         (10.6MB)               (15.2MB)
Query Type           Index     Query        Index      Query
                    Lookup     Answer      Lookup      Answer
                     (ms)        (ms)        (ms)        (ms)
Pred Query with
 leaf attributes      2           22          4           6
Pred Query with
branch attributes     3           43          4           6
Pred Query with
  associations        3           88          6          17
Neighborhood
Keyword Query         18        4174         48          97
XML Index [Kaushik et al, Sigmod’05]
   Three indexes
     Inverted   list: index each attribute value on its text
     Structured  index: index each attribute value on the
      labels of the attribute and its ancestor attributes
     Relationship index: index each instance on its
      associated instances
Our Indexing Method Performs Better Than
XML Indexes
                       XML Index      Extended Inverted List
                        (28.1MB)            (15.2MB)
Query Type           Index   Query      Index      Query
                    Lookup   Answer    Lookup      Answer
                     (ms)     (ms)       (ms)        (ms)
Pred Query with
 leaf attributes      7        9          4           6
Pred Query with
branch attributes     7        11         4           6
Pred Query with
  associations       301      415         6          17
Neighborhood
Keyword Query        365      488        48          97
Our Indexing Method Scales Well
                                 XMark      XMark
                    Wikipedia
                                w/o asso   with asso
                      4.15hr      6.64hr    12.72hr
    Index           (1.13GB)    (3.04GB)   (4.08GB)
Pred Query with
 leaf attributes      156         94         116
Pred Query with
branch attributes       -         67          93
Pred Query with
  associations          -          -         217
Neighborhood
Keyword Query         1646       1838       13468
Conclusions
   Contributions: An index for heterogeneous
    data
     Index  heterogeneous data from multiple sources
      through a (virtual) central triple base
     Extend inverted lists to capture both texts and
      structure of data
   Future Work
     Support value heterogeneity
     Incorporate approximate matching of schema
      terms and object instances
            Indexing Dataspaces


  Xin (Luna) Dong          Alon Halevy
University of Washington      Google Inc.
              @ SIGMOD 2007

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:7/27/2012
language:
pages:52