Ranking by huangyinggok


									DB & IR from a DB Viewpoint

   Gerhard Weikum
                   Adding Ranking to DB
               (= Adding “Semantics” to IR)

                 Keyword Search on                  IR Systems
Unstructured     Relational Graphs                  Search Engines
search           (BANKS, Discover, DBexplorer, …)
                                                    + Digital Libraries
(keywords)          + Web 2.0                       + Enterprise Search
                    DB Systems                      Querying entities &
Structured          + Text
search                                              relations from IE
                    + Relax. & Approx.              (Libra, ExDB, NAGA, … )
(SQL, XQuery)       + Ranking
               Structured data (records)      Unstructured data (documents)

                Trend: quadrants getting blurred
                       towards DB&IR technology integration

     Why DB&IR ? – Application Needs
Simplify life for application areas like:
 • Global health-care management for monitoring epidemics
 • News archives for journalists, press agencies, etc.
 • Product catalogs for houses, cars, vacation places, etc.
 • Customer support & CRM in insurances, telcom, retail, software, etc.
 • Bulletin boards for social communities
 • Enterprise search for projects, skills, know-how, etc.
 • Personalized & collaborative search in digital libraries, Web, etc.
 • Comprehensive archive of blogs with time-travel search

Typical data:
Disease (DId, Name, Category, Pathogen …)      UMLS-Categories ( … )
Patient (… Age, HId, Date, Report, TreatedDId) Hospital (HId, Address …)
Typical query:
symptoms of tropical virus diseases and reported anomalies
with young patients in central Europe in the last two weeks

                        DB Tag Cloud

                                   Probability Theory
                                                                Information Extraction
          Text Mining     Statistics
                                                    Pay-As-You-Go                        End-Users
                 Ranking        Schema-Free
                                       Dataspaces                 Heterogeneity
  Record Linkage   Uncertainty                                                            E-Science
                                            Schema Evolution
         Lineage         Data Integration                      Focus on Programmer
                     Structure                                  Logic                  XQuery

    Workflows  Scalability   Database                                       XML

                       Indexing          SQL                                           XQuery
          Sensors                Algebra Top-k Query
                        Execution Plan Query Optimizer Threshold Algorithm
               Selectivity Estimation                Query Rewriting            Similarity Search
Hash Sketches
                 Histogram                   Join Order                   Magic Sets


• DB & IR Motivation 1: Text Matching
• DB & IR Motivation 2: Too-Many-Answers Problem
• DB & IR Motivation 3: Schema Relaxation
• DB & IR Motivation 4: Information Extraction & Entity Search

                DB & IR Motivation 1:
                  Text Matching

• Add keyword search to structured data (relations)
  or semistructured data (XML trees)
• Combine this with structured predicates (SQL, XPath)
• Add scoring and ranking
  (tf*idf, Prob IR, … + PR or HITS when applicable)

  WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98]
Add text-similarity selection and join to relational algebra
Example: Select * From Movies M, Reviews R
                Where M.Plot ~ ”fight“ And M.Year > 1990 And R.Rating > 3
                And M.Title ~ R.Title And M.Plot ~ R.Comment
Movies                                      Reviews
Title    Plot         …            Year     Title        Comment      …       Rating
Matrix   In the near future …      1999    Matrix 1      … cool fights …
                                                         …                       4
                • DB&IR for
         computer hacker Neo …
         … fight training …
                               query-time data        integration …
                                                         new techniques
                                          Matrix    … fights …
                          recent 2002
                • More … fights work: MinorThird, Spider, DBLife,…
         In ancient China                 Reloaded  and more fights etc.         1
                • fight scoring models fairly ad hoc fairly boring …
         … swordBut …                               …
         fights Broken Sword …             Matrix      … matrix spectrum
                                           Eigenvalues … orthonormal …           5
Shrek 2 In Far Far Away …          2004
        our lovely hero                    Ying xiong   … fight for peace …
        fights with cat killer …           aka. Hero    … sword fight …          5
                                                        dramatic colors …
Scoring and ranking:
                                               xj ~ tf (word j in x) * idf (word j)
 s (<x,y>, q: A~B) = cosine (x.A, y.B)         with dampening & normalization
 s (<x,y>, q1 Ù … Ù qm) =

       XML Search for Higher Precision
    Keyword query: Max Planck                Keyword query: Madonna child


               Concept query:                 Entity-Relation query:
               Person = „Max Planck“          Person x = „Madonna“
                                              & HasChild (x,y)

Semantic XPath Full-Text query:
    /Article [ftcontains(//Person, ”Max Planck“)]   should ideally be
or even:                                            automatically inferred
   /Article                                         from keyword query
       [ftcontains(//Person, ”Max Planck“)]         + user background and
       [ftcontains(//Work, ”quantum physics“)]        implicit feedback
    //Children[@Gender = ”female“]//Birthdates
   XML Text Search with Result Ranking
//Professor                    Which professors from Germany
  [//Country ”Germany“)]       give courses on IR and run projects on XML?
  [//Course ”IR“)]
  [//Research ”XML“)]                            Query predicates:
                       Professor                 • tag-term conditions
                                                   (subtree contains terms)
             Gerhard              Address
             Weikum                ...                     Scoring and ranking:
                             City: SB       Research
                 Teaching                                  • element-specific
                                    Germany            ...   tf*idf or BM25 for
              Course                           Project       content conditions
   Title: IR          Syllabus         Intelligent           (tag-term scores)
                                       Search of           • score aggregation with
                        ...            Heterogeneous
  Information Book          Article XML Data                 prob. independence
  retrieval ...    ...        ...                Funding: • extended TA for
                                                   fast query execution
       From Tables and Trees to Graphs
          [BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS]

Schema-agnostic keyword search over multiple tables:
graph of tuples with foreign-key relationships as edges
Conferences (CId, Title, Location, Year) Journals (JId, Title)
CPublications (PId, Title, CId)          JPublications (PId, Title, Vol, No, Year)
Authors (PId, Person)                    Editors (CId, Person)
Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95
                                     Related use cases:
Result is connected tree with nodes that contain
                                     • XML beyond trees
as many query keywords as possible • RDF graphs
                                     • generate meaningful joins
QP over relational DB: exploit schema, ER graphs (e.g. from IE)
Ranking:                             • social networks

    with nodeScore based on tf*idf or prob. IR
    and edgeScore reflecting importance of relationships (or confidence, authority, etc.)
Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)
          Keyword Search on Graphs: Semantics
Subtleties of Interconnection Semantics
[S. Cohen et al. 2005, B. Kimelfeld et al. 2007]
                                         director city      country
        EDBT                           member CWI
        School                                             Netherlands
                                  Arjen           cit izen
            content aker          de Vries                                                 VLDB
city               spe                              trustee                                Endowment


                            ker Weikum         director       city

Bolzano                                              MPII            Saarbrücken

                 Italy                   cit
       country                                  n                       country
              citizen        memb
                                    er                                Germany member EU
• directed vs. undirected graphs, strict vs. relaxed
• conditions on nodes, conditions on edges (node pairs)
• all conditions mandatory or some optional
• dependencies among conditions
                     DB & IR Motivation 2:
                 Too-Many-Answers Problem
Precise queries yield too many or too few results
(actually indicating uncertainty in the search goal)

Search paradigms:
• top-k, n-nearest-neighbors, score aggregation
• preference search for skylines
Top-k results from ranked retrieval on
• product catalog data: aggregate preference scores for
   properties such as price, rating, sports facilities, beach type, etc.
• multimedia data: aggregate similarity scores for color, shape, etc.
• text, XML, Web documents: tf*idf, authority, recency, spamRisk, etc.
• Internet sources: aggregate properties from distributed sources
   e.g., site1: restaurant rating, site2: prices, site3: driving distances, etc.
• social networks: ranking based on friends, tags, ratings, etc.
         Probabilistic Ranking for SQL
               [S. Chaudhuri, G. Das, V. Hristidis, GW: TODS‘06]
SQL queries that return many answers need ranking
• Houses (Id, City, Price, #Rooms, View, Pool, SchoolDistrict, …)
 Select * From Houses Where View = ”Lake“ And City In (”Redmond“, ”Bellevue“)
• Movies (Id, Title, Genre, Country, Era, Format, Director, Actor1, Actor2, …)
  Select * From Movies Where Genre = ”Romance“ And Era = ”90s“

                                                 odds for tuple d with
                                                 attributes XÈY relevant for
                                                 query q: X1=x1 Ù… Ù Xm=xm

 Estimate prob‘s, exploiting workload W:
 Example: frequent queries
    • … Where Genre = ”Romance“ And Actor1 = ”Hugh Grant“
    • … Where Actor1 = ”Hugh Grant“ And Actor2 = ”Julia Roberts“
    boosts HG and JR movies in ranking for Genre = ”Romance“ And Era = ”90s“
Threshold Algorithm (TA)                               [Fagin 01, Güntzer 00, Nepal
 simple & DB-style;
 needs only O(k) memory;                         Threshold algorithm (TA):
 for monotone score aggr.                        scan index lists; consider d at posi in Li;
                                                 highi := s(ti,d);
 Data items: d1, …, dn                           if d Ï top-k then {
                                                      look up sn(d) in all lists Ln with n¹i;
    d1                                                score(d) := aggr {sn(d) | n=1..m};
     s(t1,d1) =                                  if score(d) > min-k then
     s(t1,d1) =
     0.7                                             add d to top-k and remove min-score d’;
     …                                               min-k := min{score(d’) | d’ Î top-k};
     s(tm,d1) =
     s(tm,d1) =                                  threshold := aggr {highn | n=1..m};
     0.2                                         if threshold £ min-k then exit;
 Query: q = (t1, t2, t3)
 aggr: summation
               Index lists
    t1   d78   d23   d10   d1    d88              k=2                Rank Doc Score
         0.9   0.8   0.8   0.7   0.2   …                               Rank Doc Score
                                                   Scan                            2.1
                                                                              d10 0.9
                                                                              d78 1.5
                                                                      1 Rank Doc Score
         d64   d23   d10   d12   d78                  Scan
    t2   0.9   0.6   0.6   0.2   0.1   …             Scan1
                                                  depth 12 2
                                                                        1      d10 2.1
                                                                      2 1Rank d10 2.1
                                                                              d64 Doc Score
                                                                              d78 1.2
                                                   depth 3 3 4
                                                    depth 4
         d10   d78   d64   d99   d34                  depth                          1.5
                                                                        2 1 d78 d10 2.1
    t3   0.7   0.5   0.3   0.2   0.1   …                                  2      d78 1.5
                                                                            2     d78 1.5

TA with Sorted Access Only (NRA) [Fagin 01, Güntzer et al. 01]
   sequential access (SA) faster             No-random-access algorithm (NRA):
   than random access (RA)                   scan index lists; consider d at posi in Li;
                                             E(d) := E(d) È {i}; highi := s(ti,d);
   by factor of 20-1000                      worstscore(d) := aggr{s(tn,d) | n ÎE(d)};
   Data items: d1, …, dn                     bestscore(d) := aggr{worstscore(d),
                                                                    aggr{highn | n Ï E(d)}};
                                             if worstscore(d) > min-k then add d to top-k
       1                                         min-k := min{worstscore(d’) | d’ Î top-k};
       s(t1,d1) =
       s(t1,d1) =                            else if bestscore(d) > min-k then
       0.7                                       cand := cand È {d};
       s(tm,d1) =
       s(tm,d1) =                            threshold := max {bestscore(d’) | d’Î cand};
       0.2                                   if threshold £ min-k then exit;
   Query: q = (t1, t2, t3)
   aggr: summation
                 Index lists                             Ran Do Wors Best-
                                                          Ran Do Worst Best-
      t1   d78   d23   d10   d1    d88        k=1        k Ran c Do t- Worst Best-
                                                          kk    cc
                                                                      - -score score
           0.9   0.8   0.8   0.7   0.2   …                          score        scor
                                                  Scan                score      e
           d64   d23   d10   d12   d78          Scan
      t2   0.8                           …       Scan1 1
                                              depth 12 2       d78 0.9        2.4
                 0.6   0.6   0.2   0.1         depth 3 3 11 d78
                                                  depth 2
                                                depth            d10 1.4  2.1    2.0
           d10   d78   d64   d99   d34                         d64 0.8        2.4
      t3   0.7   0.5   0.4   0.2   0.1   …                  22 d23
                                                                 d78 1.4  1.4    1.9
                                                         3     d10 0.7        2.4
                                                                 d23 0.8
                                                            33 d64 STOP! 2.1
                                                                  STOP! 1.4        1.8
                                                                44   d10
                                                                      d64   0.7
                                                                             1.2    2.1
 History of TA Family and Related Top-k Work
                                  Bruno 02 Cao 04    Michel 05 (KLEE)
                                  (distr.TA) (TPUT)  Balke 05 (P2P top-k)
                                                     Theobald 05 (XML)
                                           Kaushik 04 (XML)
                           Ciaccia 00      Theobald 04     Bast 06
                           (PAC top-k)     (prob. top-k)   (sched.)
                                  Chang 02 (exp. pred.)       Li 06    Xin 07
              Chaudhuri 96        Kersten 02 (PQs)            (ad-hoc) (ad-hoc)
                 Fagin 97 (A0) Fagin 01 (TA) Fagin 03
                        Nepal 99 Natsev 01
                           Güntzer 00
                                Güntzer 01    Agrawal 03
                                     Balke 02 (many answers)
                                          Ilyas 03 Ilyas 05

                Moffat 96        Anh 01                    Anh 06
Buckley 85         Pfeifer 97                              Fuhr 06
                Persin 06

1985         1995               2000               2005

 History of TA Family and Related Top-k Work
                                  Bruno 02 Cao 04    Michel 05 (KLEE)
                                  (distr.TA) (TPUT)  Balke 05 (P2P top-k)
                                                     Theobald 05 (XML)
                                           Kaushik 04 (XML)
                           Ciaccia 00      Theobald 04     Bast 06
                           (PAC top-k)     (prob. top-k)   (sched.)
                                  Chang 02 (exp. pred.)       Li 06    Xin 07
              Chaudhuri 96        Kersten 02 (PQs)            (ad-hoc) (ad-hoc)
                 Fagin 97 (A0) Fagin 01 (TA) Fagin 03
                        Nepal 99 Natsev 01
                           Güntzer 00
                                Güntzer 01    Agrawal 03
                                     Balke 02 (many answers)
                                          Ilyas 03 Ilyas 05

                Moffat 96        Anh 01                    Anh 06
Buckley 85         Pfeifer 97                              Fuhr 06
                Persin 06

1985         1995               2000               2005

                    DB & IR Motivation 3:
                    Schema Relaxation

• Traditional DB focus was one DB with perfectly clean data
• New focus is many DBs with on-the-fly fusion (partial integration)
• Modern apps (mashups etc.) even require many DBs
  with non-DB sources (blogs, maps, sensors, etc.), many with
  no schema, partial typing, or rapidly evolving schema
• More appropriate abstraction is data spaces
  (with pay-as-you-go schema)
• Calls for schema-free or schema-relaxed querying,
  which entails inherent need for ranking

            XML IR with Background Ontology
//Professor                                                                   Lecturer
  [//Country ”Germany“)]                                      Name:
  [//Course ”IR“)]                                            Ralf          Address:    Activities
  [//Research ”XML“)]                                         Schenkel      Saarland
                             Professor                                      D-66123
                 Name:                                        Seminar               Scientific       Other
                 Gerhard             Address
                                      ...             Contents:            Name:                      …
                 Weikum                                                                      Sponsor:
                               City: SB               Future of
                                               Research                    INEX task
                   Teaching            Country:       Web search …         coordinator EU
                                       Germany            Literature: … (Initiative for the
                                                          ...          magician
                                                                           Evaluation of XML …)
                                                                                artist director
                Course                            Project                 wizard
                                          Title:                                               investigator
     Title: IR                                                          intellectual
                        Syllabus          Intelligent                                  Related (0.48)
                                                                                       Related (0.48)
                                          Search of
  Query expansion                          terms based
                          ...for tags &Heterogeneous on                researcher       professor
  • Information Book in ontology/thesaurus &
     related concepts Article XML Data                                                  Hyponym (0.749)
    retrieval ...    ...         ...
  • strength of relatedness (or correlation) EU     Funding:              scientist
                                                                                 scholar            lecturer
  ® robustness and efficiency issues                                       academic,       mentor
                                                                           academician,              teacher
                                                                           faculty member              19/32
            XML: Structural Similarity
Structural similarity and ranking
based on tree edit distance (FleXPath, Timber, …)

Query 1: //movie [ftcontains(/plot/location, ”Tibet“)]
Query 2: //movie [ftcontains(//plot, ”happily ever after“)] //actor

                   movie                                actor

          actor            director                movie movie
           actor     plot                               plot    director
               location                          location location

Score = min. relaxation cost to unify structures of query and data trees
        based on node insertion, node deletion,
        node generalization, edge generalization,
        subtree promotion, etc.
               DB & IR Motivation 4:
      Information Extraction & Entity Search

• Best content producer (rate*quality) is text and speech
  (scientific literature, news, blogs, photo/video tags, etc.)
• Extract entities, attributes, relations from text
• Build knowledge bases as ER graphs (with uncertainty)
• Move from keyword search to the level of
  entity/relation querying
• Most likely already in use by leading search engines for
  for attractive vertical domains (travel, electronics, entertainment)

Entity Search: Example Google
                    Which politicians
                    are also scientists ?

               What is lacking?
               • data is not knowledge
                 ® extraction and organization
               • keywords cannot express
                 advanced user intentions
                 ® concepts, entities, properties,

 Information Extraction (IE): Text to Records
                             Person          BirthDate    BirthPlace ...
                             Max Planck      4/23, 1858   Kiel
                             Albert Einstein 3/14, 1879   Ulm
                             Mahatma Gandhi 10/2, 1869    Porbandar

                                         Person     ScientificResult
                                         Max Planck Quantum Theory
                                              extracted facts often
                             Constant         Value       Dimension
                                              have confidence < 1
                             Planck‘s constant6.226´1023 Js
                                              ® DB with uncertainty
                                                (probabilistic DB)
                                         Person      Collaborator
                                         Max Planck Albert Einstein
                                         Max Planck Niels Bohr

                                         Person     Organization
                                         Max Planck KWG / MPG

combine NLP, pattern matching, lexicons, statistical learning

  Entity Reconciliation (Fuzzy Matching,
  Entity Matching/Resolution, Record Linkage)
• same entity appears in
    • different spellings (incl. mis-spellings, abbr., multilingual, etc.)
     e.g. Brittnee Speers vs. Britney Spears, M-31 vs. NGC 224,
          Microsoft Research vs. MS Research, Rome vs. Roma vs. Rom
    • different levels of completeness
     e.g. Joe Hellerstein (UC Berkeley) vs. Prof. Joseph M. Hellerstein, CA
          Larry Page (born Mar 1973) vs. Larry Page (born 26/3/73)
          Microsoft (Redmond, USA) vs. Microsoft (Redmond, WA 98002)
• different entities happen to look the same
    e.g. George W. Bush vs. George W. Bush, Paris vs. Paris
• Problem even occurs within structured databases and
   requires data cleaning when integrating multiple databases
• Current approaches are:
  edit distance measures, context consideration, reference dictionaries,
  statistical learning in graph models, etc.
          Entity-Search Ranking with LM
       [Z. Nie et al.: WWW 2007; see also T. Cheng: VLDB 2007]
Standard LM for docs with background model (smoothing):

Assume entity e was seen in k records r1, …, rk
extracted from k pages d1, …, dk with accuracy a1, …, ak
                                                     record-level LM

     with context window around ri in di (default: only ri itself)

alternatively consider individual attributes e.aj with importance bj
extracted from page di with accuracy gij

      Entity-Search Ranking by Link Analysis
[A. Balmin et al. 2004, Nie et al. 2005, Chakrabarti 2007, J. Stoyanovich 2007]

EntityAuthority (ObjectRank, PopRank, HubRank, EVA, etc.):
• define authority transfer graph
  among entities and pages with edges:
   • entity ® page if entity appears in page
   • page ® entity if entity is extracted from page
   • page1 ® page2 if there is hyperlink or implicit link between pages
   • entity1 ® entity2 if there is a semantic relation between entities
• edges can be typed and (degree- or weight-) normalized and
  are weighted by confidence and type-importance
• also applicable to graph of DB records with foreign-key relations
 (e.g. bibliography with different weights of publisher vs. location for conference record)
• compared to standard Web graph, ER graphs of this kind
  have higher variation of edge weights

  YAGO Knowledge Base from Wikipedia & WordNet
             Entities            Facts                               Entities &
KnowItAll      30 000      subclass                                  Relations
SUMO          20 000             60 000
WordNet      120 000           800 000                          subclass
Cyc          300 000           5 Mio.
TextRunner        n/a
                 subclass      8 Mio.                                    Location               concepts
YAGO         1.7 Mio.
                              15 Mio.
     subclass1.9 Mio.        103 Mio.                        subclass                subclass
                                                                    City            Country

                 instanceOf                                              instanceOf
  Nobel Prize                    Erwin_Planck
                                                           bornIn Kiel
                                        FatherOf                                                individuals
        October 4, 1947        diedOn                      bornOn
                                        Max_Planck                   April 23, 1858

                                                   means            means

                            “Max Planck”      “Max Karl Ernst            “Dr.
                                              Ludwig Planck”             Planck”

   Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/
               Graph IR on Knowledge Bases
Graph-based search on YAGO-style knowledge bases
with built-in ranking based on confidence and informativeness
discovery queries                                   diedOn $x       hasWon Nobel
                                              $a                           prize
  Kiel   bornIn $x    isa    scientist         >                  hasSon
                                              $b    diedOn $y

connectedness queries
  German        isa   Thomas Mann                        *                Goethe

queries with regular expressions
         hasFirstName | hasLastName                isa
  Ling                                   $x                  scientist
                      (coAuthor           worksFor
                      | advisor)*
             Beng Chin Ooi               $y                    Zhejiang

                          Ranking Factors
Confidence:                                      bornIn (Max Planck, Kiel) from
                                                 „Max Planck was born in Kiel“
Prefer results that are likely to be correct
    Ø Certainty of IE
    Ø Authenticity and Authority of Sources      livesIn (Elvis Presley, Mars) from
                                                 „They believe Elvis hides on Mars“
                                                 (Martian Bloggeria)

Informativeness:                                 q: isa (Einstein, $y)
Prefer results that are likely important         isa (Einstein, scientist)
May prefer results that are likely new to user   isa (Einstein, vegetarian)
    Ø Frequency in answer
    Ø Frequency in corpus (e.g. Web)             q: isa ($x, vegetarian)
    Ø Frequency in query log                     isa (Einstein, vegetarian)
                                                 isa (Al Nobody, vegetarian)
Prefer results that are tightly connected isa vegetarian            Tom
                                                                isa Cruise      bornIn
    Ø Size of answer graph                Einstein
                                          won                                   1962
                                         Nobel Prize            Bohr       diedIn
Entity-Relation Search: Example NAGA
                           $x isa politician
                           $x isa scientist

                           Benjamin Franklin
                           Paul Wolfowitz
                           Angela Merkel

         Major Trends in DB and IR
Database Systems               Information Retrieval

malleable schema (later)       deep NLP, adding structure
record linkage                 info extraction
graph mining                   entity-relationship graph IR
dataspaces                     Web entities
ontologies                     statistical language models

       data uncertainty    ranking
       programmability search as Web Service
               Web 2.0     Web 2.0

Thank You !


To top