Hypertext Databases and Data Mining (SIGMOD 1999 Tutorial)

Document Sample
Hypertext Databases and Data Mining (SIGMOD 1999 Tutorial) Powered By Docstoc
					Hypertext Databases and Data Mining
     (SIGMOD 1999 Tutorial)

          Soumen Chakrabarti
 Indian Institute of Technology Bombay
http://www.cse.iitb.ernet.in/~soumen
 http://www.cs.berkeley.edu/~soumen
      soumen@cse.iitb.ernet.in
                        The Web
•      350 million static HTML pages, 2 terabytes
•      0.8–1 million new pages created per day
•      600 GB of pages change per month
•      Average page changes in a few weeks
•      Average page has about ten links
•      Increasing volume of active pages and views
•      Boundaries between repositories blurred
•      Bigger than the sum of its parts
Soumen Chakrabarti
IIT Bombay                                           2
                     Hypertext databases
• Academia
         – Digital library, web publication
• Consumer
         – Newsgroups, communities, product reviews
• Industry and organizations
         – Health care, customer service
         – Corporate email



Soumen Chakrabarti
IIT Bombay                                            3
                     What to expect
• Write in decimal the exact circumference of a
  circle of radius one inch
• Is the distance between Tokyo and Rome more
  than 6000 miles?
• What is the distance between Tokyo and Rome?
• java
• java +coffee -applet
• “uninterrupt* power suppl*” ups -parcel
Soumen Chakrabarti
IIT Bombay                                    4
                     Search products and services
•      Verity                      •   Inktomi (HotBot)
•      Fulcrum                     •   Alta Vista
•      PLS                         •   Google!
•      Oracle text extender        •   Yahoo!
•      DB2 text extender           •   Infoseek Internet
•      Infoseek Intranet           •   Lycos
•      SMART (academic)            •   Excite
•      Glimpse (academic)


Soumen Chakrabarti
IIT Bombay                                                 5
                                   Local data           FTP   Gopher          HTML

                                                                                             More structure
                                             Indexing
                                                              Crawling
                                              Search

                                                                                         WebSQL           WebL
                     Relevance Ranking
                                                                  Social
              Latent Semantic Indexing                           Network                            XML
                                                              of Hyperlinks

                     Clustering
                                                                 Web
                                                              Communities
Collaborative                     Scatter-
  Filtering                       Gather                                             Web Servers

                                                                 Topic
               Topic Directories
                                                               Distillation            Monitor
                                                                                                     User
                                                                                        Mine
                                                                                                    Profiling
                                                                                       Modify

 Semi-supervised              Automatic                         Focused
    Learning                 Classification                     Crawling             Web Browsers

Soumen Chakrabarti
IIT Bombay                                                                                                      6
Basic indexing and search
                     Keyword indexing
• Boolean search                My care is loss of care   D1
         – care AND NOT old     with old care done

• Stemming                      Your care is gain of
                                care with new care won
                                                          D2

         – gain*
• Phrases and proximity         care     D1: 1, 5, 8
                                         D2: 1, 5, 8
         – “new care”
                                new      D2: 7
         – loss <NEAR/5> care
         – <SENTENCE>           old      D1: 7

                                loss     D1: 3
Soumen Chakrabarti
IIT Bombay                                                     8
                       Tables and queries 1

 POSTING             select distinct did from POSTING where tid = „care‟ except
tid did pos          select distinct did from POSTING where tid like „gain%‟
care d1   1
care d1   5          with
care d1   8          TPOS1(did, pos) as
care d2   1                   (select did, pos from POSTING where tid = „new‟),
care d2   5          TPOS2(did, pos) as
                              (select did, pos from POSTING where tid = „care‟)
care d2   8
                     select distinct did from TPOS1, TPOS2
new d2    7
                              where TPOS1.did = TPOS2.did
old d1    7
                              and proximity(TPOS1.pos, TPOS2.pos)
loss d1   3
… … …                proximity(a, b) ::=
                            a+1=b
                            abs(a - b) < 5
Soumen Chakrabarti
IIT Bombay                                                                  9
                     Relevance ranking
• Recall = coverage
                                                      Query          “True response”
         – What fraction of
                                                                       Compar
           relevant documents                         Search             e
           were reported                                                   Consider
                                            Output sequence
• Precision = accuracy                                                      prefix k

                                             1
         – What fraction of                 0.8



                                Precision
           reported documents               0.6

           were relevant                    0.4
                                            0.2
• Trade-off                                  0
                                                  0      0.2   0.4   0.6      0.8      1
                                                                 Recall
Soumen Chakrabarti
IIT Bombay                                                                                 10
                     Vector space model and TFIDF
• Some words are more important than others
• W.r.t. a document collection D
         – d+ have a term, d- do not
         – “Inverse document frequency”                   d  d
                                                1  log
• “Term frequency” (TF)                                      d 
         – Many variants:            n(d , t )             n(d , t )
                                                     ,
                                       t
                                            n ( d , t ) max t n ( d , t )
• Probabilistic models
Soumen Chakrabarti
IIT Bombay                                                                  11
                         Tables and queries 2
        VECTOR(did, tid, elem) ::=
        With
        TEXT(did, tid, freq) as
                  (select did, tid, count(distinct pos) from POSTING
                  group by did, tid),
        LENGTH(did, len) as
                  (select did, sum(freq) from TEXT group by did),
        DOCFREQ(tid, df) as
                  (select tid, count(distinct did) from TEXT
                  group by tid)
        select did, tid,
        (freq / len) * (1 + log((select count(distinct did from POSTING))/df))
        from TEXT, LENGTH, DOCFREQ
        where TEXT.did = LENGTH.did
        and TEXT.tid = DOCFREQ.tid
Soumen Chakrabarti
IIT Bombay                                                                       12
                         Relevance ranking
select did, cosine(did, query)                   „now‟
from corpus                                          query
where candidate(did, query)
order by cosine(did, query) desc
fetch first k rows only
                                        „auto‟               „car‟

Find largest k columns of:
                            D
                                   Exact computation: O(n2)
      q 1 , , q T
                                   All entries above mean can
                     T     A       be estimated with error e
                                   within O(ne-2) time
Soumen Chakrabarti
IIT Bombay                                                      13
Similarity and clustering
                          Clustering
• Given an unlabeled collection of documents,
  induce a taxonomy based on similarity
• Need document similarity measure
         – Distance between normalized document vectors
         – Cosine of angle between document vectors
• Top-down clustering is difficult because of
  huge number of noisy dimensions
         – k-means, expectation maximization
• Quadratic-time bottom-up clustering
Soumen Chakrabarti
IIT Bombay                                                15
                         Document model
• Vocabulary V, term wi, document  represented
  by c ( )   f ( w i ,  )w V    i

• f ( w i ,  ) is the number of times wi occurs in
  document 
• Most f‟s are zeroes for a single document
• Monotone component-wise damping function g
  such as log or square-root
                     g ( c ( ))  g ( f ( w i ,  )) w V
                                                         i



Soumen Chakrabarti
IIT Bombay                                                     16
                                    Similarity
                                      g ( c ( )), g ( c (  ))
                     s ( ,  ) 
                                     g ( c ( ))  g ( c (  ))
             ,  inner product

        Normalized                                         g ( c ( ))
                                              p ( ) 
        document profile:                                  g ( c ( ))


        Profile for                           p ( ) 
                                                                
                                                                       p ( )
        document group :                                       
                                                                       p ( )
Soumen Chakrabarti
IIT Bombay                                                                      17
                     Group average clustering 1
                                   1
                     s ( )                 s ( ,  )
                                    1   
                                              

• Initially G is a collection of singleton groups,
  each with one document
• Repeat
         – Find ,  in G with max s()
         – Merge group  with group 
• For each  keep track of best 
• O(n2) algorithm
Soumen Chakrabarti
IIT Bombay                                                  18
                       Group average clustering 2
   Un-normalized                       p   
                                       ˆ                    p    
                                                         
   group profile:
   Can show:
                                p (  ), p (  )  
                                ˆ        ˆ
                     s   
                                         1
                                p (    ), p (    )     
                                ˆ            ˆ                           
              s     
                                                     1

         p    , p      p  , p    p   , p   
         ˆ           ˆ            ˆ       ˆ        ˆ        ˆ
                                                    2 p  , p   
                                                       ˆ       ˆ
Soumen Chakrabarti
IIT Bombay                                                                   19
          “Rectangular time” algorithm Buckshot
• Randomly sample O  kn  documents
• Run group average clustering algorithm to
  reduce to k groups or clusters
• Iterate assign-to-nearest O(1) times
         – Move each document to cluster  with max s(,)
• Total time taken is O(kn)



Soumen Chakrabarti
IIT Bombay                                                   20
                     Extended similarity
• auto and car co-occur often               … auto …car
                                             … auto …car
                                              … … auto
                                            … car auto …car
• Therefore they must be related             … car … auto
                                              … car … auto

• Documents having related words
  are related
                                               car  auto
• Useful for search and clustering
• Two basic approaches
         – Hand-made thesaurus (WordNet)       … auto …
                                                   
         – Co-occurrence and associations
                                               … car …

Soumen Chakrabarti
IIT Bombay                                                    21
                         Latent semantic indexing

                             Term           Document

                                    k                          d
            Documents

                                                D              V
        Terms




                     A   t   SVD    U




                     d                  r       k-dim vector


Soumen Chakrabarti
IIT Bombay                                                         22
                     Collaborative recommendation
•      People=record, movies=features, cluster people
•      Both people and features can be clustered
•      For hypertext access, time of access is a feature
•      Need advanced models
                      Batman   Rambo   Andre   Hiver   Whispers StarWars
          Lyle
          Ellen
          Jason
          Fred
          Dean
          Karen
Soumen Chakrabarti
IIT Bombay                                                                 23
                     A model for collaboration
• People and movies belong to unknown classes
• Pk = probability a random person is in class k
• Pl = probability a random movie is in class l
• Pkl = probability of a class-k person liking a
  class-l movie
• Gibbs sampling: iterate
         – Pick a person or movie at random and assign to a
           class with probability proportional to Pk or Pl
         – Estimate new parameters
Soumen Chakrabarti
IIT Bombay                                                    24
Supervised learning
                     Supervised learning (classification)
• Many forms
         – Content: automatically organize the web per Yahoo!
         – Type: faculty, student, staff
         – Intent: education, discussion, comparison,
           advertisement
• Applications
         – Relevance feedback for re-scoring query responses
         – Filtering news, email, etc.
         – Narrowing searches and selective data acquisition
Soumen Chakrabarti
IIT Bombay                                                     26
                           Difficulties
• Dimensionality
         – Decision tree classifiers: dozens of columns
         – Vector space model: 50,000 „columns‟
• Context-dependent noise
         – „Can‟ (v.) considered a „stopword‟
         – „Can‟ (n.) may not be a stopword in
           /Yahoo/SocietyCulture/Environment/Recycling



Soumen Chakrabarti
IIT Bombay                                                27
                       More difficulties
• Need for scalability
         – High dimension needs more data to learn
• Class labels are from a hierarchy
         – All documents belong to the root node
         – Highest probability leaf may have low confidence




Soumen Chakrabarti
IIT Bombay                                                    28
                            Techniques
• Nearest neighbor
         + Standard keyword index also supports classification
         – How to define similarity? (TFIDF may not work)
         – Wastes space by storing individual document info
• Rule-based, decision-tree based
         – Very slow to train (but quick to test)
         + Good accuracy (but brittle rules)
• Model-based
         + Fast training and testing with small footprint
Soumen Chakrabarti
IIT Bombay                                                   29
                     More document models
• Boolean vector (word counts ignored)
         – Toss one coin for each term in the universe
• Bag of words (multinomial)
         – Repeatedly toss coin with a term on each face
• Limited dependence models
         – Bayesian network where each feature has at most k
           features as parents
         – Maximum entropy estimation

Soumen Chakrabarti
IIT Bombay                                                     30
                                “Bag-of-words”
• Decide topic; topic c is picked with prior
  probability (c); c(c) = 1
• Each topic c has parameters (c,t) for terms t
• Coin with face probabilities t (c,t) = 1
• Fix document length and keep tossing coin
• Given c, probability of document is
                                    n(d ) 
                     Pr[ d | c ]                    (c , t )
                                                                  n ( d ,t )
                                    { n ( d , t )} 
                                                    t d
Soumen Chakrabarti
IIT Bombay                                                                     31
                                     Limitations
• With the model
         – 100th occurrence of term as surprising as first
         – No inter-term dependence
• With using the model
         – Most observed (c,t) are zero and/or noisy
         – Have to pick a low-noise subset of the term universe
                     • Improves space, time, and accuracy
         – Have to “fix” low-support statistics

Soumen Chakrabarti
IIT Bombay                                                   32
                               Feature selection
       Model with unknown parameters             Confidence intervals
             T              T

                                                p1
            p1 p2        ...   q1 q2   ...
                                                q1

       Observed data                                            N

             0 1         ...
                                             Pick FT such that
N
                                             models built over F have
                                             high separation confidence
    Soumen Chakrabarti
    IIT Bombay                                                          33
                                      Tables and queries 3
TAXONOMY
pcid kcid kcname                        EGMAPR(did, kcid) ::=
                                             ((select did, kcid from EGMAP) union all
         1
                                             (select e.did, t.pcid from
    1    2 Arts
                                             EGMAPR as e, TAXONOMY as t
    1    3 Science                           where e.kcid = t.kcid))
    3    4 Math
    3    5 Physics                      STAT(pcid, tid, kcid, ksmc, ksnc) ::=
                                               (select pcid, tid, TAXONOMY.kcid,
                          1
                                               count(distinct TEXT.did), sum(freq)
                      2       3
EGMAP                                          from EGMAPR, TAXONOMY, TEXT
                          4       5            where TAXONOMY.kcid = EGMAPR.kcid
did kcid
                                               and EGMAPR.did = TEXT.did
                                               group by pcid, tid, TAXONOMY.kcid)
TEXT
did tid freq
 Soumen Chakrabarti
 IIT Bombay                                                                         34
Analyzing hyperlink structure
                      Hyperlink graph analysis
• Hypermedia is a social network
         – Telephoned, advised, co-authored, paid, cited
• Social network theory (cf. Wasserman & Faust)
         –     Extensive research applying graph notions
         –     Centrality
         –     Prestige and reflected prestige
         –     Co-citation
• Can be applied directly to Web search
         – HIT, Google, CLEVER, topic distillation
Soumen Chakrabarti
IIT Bombay                                                 36
                     Hypertext models for classification
• c=class, t=text,
  N=neighbors
• Text-only model: Pr[t|c]
• Using neighbors‟ text
  to judge my topic:                              ?

  Pr[t, t(N) | c]
• Better model:
  Pr[t, c(N) | c]
• Non-linear relaxation
Soumen Chakrabarti
IIT Bombay                                                 37
                     Exploiting link features
• 9600 patents from 12
                                             40
  classes marked by USPTO                    35

• Patents have text and cite                 30
                                             25




                                    %Error
  other patents                              20

• Expand test patent to                      15
                                             10
  include neighborhood                        5

• „Forget‟ fraction of
                                              0
                                                  0              50           100

  neighbors‟ classes                                  %Neighborhood known

                                                  Text    Link        Text+Link

Soumen Chakrabarti
IIT Bombay                                                                        38
                     Google and HITS
• In-degree  prestige   • High prestige  good
• Not all votes are        authority
  worth the same         • High reflected
• Prestige of a page is    prestige  good hub
  the sum of prestige of • Bipartite iteration
  citing pages: p = Ep      – a = Eh
• Pre-compute query         – h = ETa
  independent prestige      – h = ETEh
  score
Soumen Chakrabarti
IIT Bombay                                   39
                           Tables and queries 4
                      delete from HUBS;
                      insert into HUBS(url, score)
HUBS                           (select urlsrc, sum(score * wtrev) from AUTH, LINK
url score                      where authwt is not null and type = non-local
AUTH                           and ipdst <> ipsrc and url = urldst
url score                      group by urlsrc);
                      update HUBS set (score) = score /
                               (select sum(score) from HUBS);

update LINK as X set (wtfwd) = 1. /                            wgtfwd
       (select count(ipsrc) from LINK                 score              score
       where ipsrc = X.ipsrc                          urlsrc            urldst
       and urldst = X.urldst)                         @ipsrc            @ipdst
       where type = non-local;
LINK                                                           wgtrev
urlsrc urldst ipsrc ipdst wgtfwd wtrev type
 Soumen Chakrabarti
 IIT Bombay                                                                      40
Querying/mining semi-structured data
                     Semi-structured database systems
• Lore (Stanford)
         – Object exchange model, dataguides
• WebSQL (Toronto), WebL (Compaq SRC)
         – Structured query languages for the Web
• WHIRL (AT&T Research)
         – Approximate matches on multiple textual columns
• Strudel (AT&T Research, U. Washington)
         – Web site generation and management

Soumen Chakrabarti
IIT Bombay                                                   42
        Queries combining structure and content
• Select x.url, x.title from Document x such that
  “http://www.cs.wisc.edu”==||  x where x
  mentions “semi-structured data”
• Apart from cycling, find the most common
  topic found within link radius 2 of pages on
  cycling
                     Answer: “first-aid”
• In the last year, how many links were made
  from environment protection pages to Exxon?
Soumen Chakrabarti
IIT Bombay                                        43
                           Resource discovery

        Taxonomy        Example Feedback   Topic
                                                       Scheduler
         Editor         Browser          Distiller


                                       Crawler
      Taxonomy             Crawl                  Workers
      Database            Database



                     Hypertext                   Hypertext
                     Classifier      Topic       Classifier
                      (Learn)        Models       (Apply)

Soumen Chakrabarti
IIT Bombay                                                         44
                                                   Resource discovery results 1
• High rate of „harvesting‟ relevant pages
                                 – Standard crawling neither necessary nor adequate
                                   for answering specific queries
                                     H a rv e s t R a te (C y c lin g , U n fo c u s e d )                                      H a rv e s t R a te (C y c lin g , S o ft F o c u s )
                            1                                                                                          1
                                                                                                                                                                       Avg over 1 0 0
                          0 .9                                                                                       0 .9
                                                                                                                                                                       Avg over 1 0 0 0
                          0 .8                                                                                       0 .8




                                                                                               Av erage R elevance
                                                                                                                     0 .7
  Av erage R elev anc e




                          0 .7

                          0 .6                                                                                       0 .6

                          0 .5                                                                                       0 .5

                          0 .4                                                                                       0 .4

                          0 .3                                                                                       0 .3

                          0 .2                                                                                       0 .2

                          0 .1                                                Avg over 1 0 0                         0 .1

                            0                                                                                          0
                                 0                      5000                          10000                                 0                 2000                     4000               6000
                                                       # U R L s fe tc h e d                                                                       # U R L s fe tc h e d



Soumen Chakrabarti
IIT Bombay                                                                                                                                                                                46
                                                                 Resource discovery results 2
• Robust to perturbations of starting URLs
• Great resources found 12 links from start set
                                                                   U R L C o v e ra g e                                                           D is ta n c e to to p a u th o ritie s
                                                 0 .9                                                                                18

                                                 0 .8                                                                                16
      F ra c tio n o f re fe re n c e c ra w l




                                                 0 .7                                                                                14

                                                 0 .6                                                                                12




                                                                                                                  F re q u e n c y
                                                 0 .5                                                                                10
                                                 0 .4                                                                                 8
                                                 0 .3                                                                                 6
                                                 0 .2                                                                                 4
                                                 0 .1                                                                                 2
                                                   0                                                                                  0
                                                        0           1000                  2000             3000                           1   2      3     4    5     6     7    8     9    10 11 12
                                                            # U R L s c ra w le d b y te s t c ra w le r                                          S h o rte s t d is ta n c e fo u n d (# lin ks )


Soumen Chakrabarti
IIT Bombay                                                                                                                                                                                             47
                            Database issues
• Useful features
         + Concurrency and recovery (crawlers)
         + I/O-efficient representation of mining algorithms
         + Ad-hoc queries combining structure and content
• Need better support for
         –     Flexible choices for concurrency and recovery
         –     Indexed scans over temporary table expressions
         –     Efficient string storage and operations
         –     Answering complex queries approximately
Soumen Chakrabarti
IIT Bombay                                                      48
Resources
                     Research areas
•      Modeling, representation, and manipulation
•      More applications of machine learning
•      Approximate structure and content matching
•      Answering questions in specific domains
•      Interactive refinement of ill-defined queries
•      Tracking emergent topics in a discussion group
•      Content-based collaborative recommendation
•      Semantic prefetching and caching
Soumen Chakrabarti
IIT Bombay                                              50
                     Events and activities
• Text REtrieval Conference (TREC)
         – Mature ad-hoc query and filtering tracks (newswire)
         – New track for web search (2GB and 100GB corpus)
         – New track for question answering
• DIMACS special years on Networks (-2000)
         – Includes applications such as information retrieval,
           databases and the Web, multimedia transmission
           and coding, distributed and collaborative computing
• Conferences: WWW, SIGIR, SIGMOD/VLDB?
Soumen Chakrabarti
IIT Bombay                                                    51