Scalable Hybrid Keyword Search on Distributed Database

Reviews
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on Autonomic Distributed Data and Storage Systems Management (ADSM 2005) Motivation Where is the Information ? Internet Outline Two Typical Search Paradigms Problems of Current Search Approaches Local Hybrid Keyword Search Hybrid Search on Distributed Databases Two Typical Search Paradigms Searching over structured data  Relational Databases Searching over unstructured data  Information Retrieval Internet Environment  Semistructured Data –  Web Search Engines XML – Technologies from Information Retrieval  Keyword Search in DB Hybrid Keyword Search ? Current Approaches – Keyword-only Search Web Search Engines Web crawlers visit Web pages and collect the keyword based text indexes. Fast information retrieval Keyword Search in databases Web integration on legacy DBMS Dynamic Web publication through embedded DB Easy to use without knowledge of DB schema Problems of Current Approaches – Keyword-based Web Search Engines Can not collect every connected resource Query results are often unrelated Keyword Search in Databases Losing the inherent meaning of the schema Query results are not based on semantic schema Current Approaches – Semantic Semantic Web Multiple relation links with directed labeled graphs and machines can understand the relationship between different resources Describes metadata about resources To represent the relations of the objects on the Web; the object terms defined under a specific description – an Ontology Problems of Current Approaches – Semantic Web Ontology design is sophisticated Lack of unified definition * Limited adoption Our Approach Hybrid search mechanisms – Semantic metadata + Keyword search Semantic Solution Semantic Web might be better than Hybrid search Hybrid search must be better than Web search engines Simplicity Hybrid search is simpler than Semantic Web Hybrid Keyword Search Service  A search service fetches target information data against a search query.  Unstructured data A file containing data – MS Word, PDF, PS documents  Metadata: Structured or semistructured data – XML  We utilized an XML-enabled relational DBMS and a native XML DB along with a text search library (Apache Xindice + Jakarta Lucene) to address the search against metadata and text. How to Combine? (1)  Two entity sets and a relationship in relational DBMS  We can obtain the hybrid search result using a nested subquery How to Combine? (2)  A hash table is used for joining search results in non-DBMS based system (Apache Xindice + Lucene) Local Query Processing – XML (1) Average XML Query Time  XML-enabled RDB  DBLP XML record (1,000 – 10,000)  Non indexed matches except year match bound by the number of matches.  Combined query time depends on # of year query results Local Query Processing – XML (2) Average XML Query Time  Apache Xindice  DBLP XML record (1,000 – 10,000)  Indexed approximate matches for text elements in XML instances as bad as non-indexed queries  Exact matches bound by the number of matches. Local Query Processing – Hybrid (1)  Hybrid search query performance measurement  XML-enabled RDB  For 100,000 XML instances and 100,000 text documents  Small result set: 4 XML and a keyword matches  Large result set: 7,752 XML and 41,889 documents (3,227) Metadata Author Year (Nested subquery) Few 0.04 82.9 Sec. Keywords Sec. Many 0.48 Half hour Keywords Sec. Year (Hash table) 5.70 Sec. 6.96 Sec. Local Query Processing – Hybrid (2)  Hybrid search query performance measurement  Apache Xindice + Jakarta Lucene  For 10,000 XML instances and 10,000 text documents  Small result set: 2 XML and a keyword matches  Large result set: 192 XML and 4,562 documents (41) Discussion – Local Hybrid Search XML-enabled RDB provides proper response except some extreme query loads. Inefficient query plan and query optimization in an old version – better performance in a newer version A native XML DB (Apache Xindice) had very limited scalability. (No accurate query result over 16,000 XML instances) We will generalize hybrid search to a distributed environment. Hybrid Search on Distributed Databases  Data Independence: logically and physically independent; the same schema – no change, data encapsulation in each machine  Network Transparency: depends on MOM or P2P framework  No replication – restricted to a computer cluster  Fragment: full partition; horizontal fragmentation  The query result for the distributed databases is the collection of query results from individual database queries. Scalable Hybrid Search Architecture on DDBS Search Service Subscriber for a query topic Query Message Message Broker Query Message Client Publisher for a query topic Client Result Message Subscriber for a temporary topic Client Search Service Publisher for a temporary topic Result Message Search Service Cooperating Broker Network  Distributed Databases based on NaradaBrokering Network Query Processing – DDBS (1)  100,000 XML and 100,000 Documents in 8 machines – 12,500 each  Few keyword match (1-3) on 1 machine only  RDB – 0.04 Sec. for few keyword Avg. response time for an author exact match query match over 8 search services Query Processing – DDBS (2)  100,000 XML and 100,000 Documents in 8 machines – 12,500 each  RDB – half hour or 6.96 Sec. (Hash table) Avg. response time for a year match query over 8 search services Coupling vs. Scalability  From ICDE 2002 Tutorial Query Propagate and Results back on a P2P Network Peer group architecture of the P2P Search Conclusion We addressed the semantic loss of keyword-only search while remaining a simpler solution than the Semantic Web Our architecture contributed a performance improvement for some queries Extension of the scalability of Xindice XML query limited to a small size on a single machine

Related docs
premium docs
Other docs by Corona NLime
I Will Enter His Gates
Views: 922  |  Downloads: 5
Massage Therapy Fast Facts
Views: 1441  |  Downloads: 43
Still-Music
Views: 211  |  Downloads: 3
Final and irrevocable surrender
Views: 269  |  Downloads: 7
Great in Power
Views: 476  |  Downloads: 3
Sample Term Sheet Negotiation
Views: 1064  |  Downloads: 76
Evidence Master
Views: 390  |  Downloads: 14
cr191
Views: 98  |  Downloads: 0
Meditation for Health Purposes
Views: 497  |  Downloads: 33
de310
Views: 105  |  Downloads: 1
cr117
Views: 97  |  Downloads: 0
Sample lock box agreement
Views: 231  |  Downloads: 2
Covenant of Love
Views: 175  |  Downloads: 3
Leasehold Estates
Views: 226  |  Downloads: 3
NoteCards
Views: 309  |  Downloads: 4