02e

Document Sample
02e Powered By Docstoc
					           Generalized Vector Model
   Classic models enforce independence of index
    terms.
   For the Vector model:
        Set of term vectors {k1, k2, ..., kt} are linearly
        independent and form a basis for the subspace of
        interest.
   Frequently, this is interpreted as:
     i,j    ki  kj = 0
   In 1985, Wong, Ziarko, and Wong proposed an
    interpretation in which the set of terms is linearly
    independent, but not pairwise orthogonal.
Key Idea:
         In the generalized vector model, two index terms
          might be non-orthogonal and are represented in
          terms of smaller components (minterms).
         As before let,
               wij be the weight associated with [ki,dj]
              {k1, k2, ..., kt} be the set of all terms

         If  these weights are all binary, all patterns of
            occurrence of terms within documents can be
            represented by the minterms:
                   m1 = (0,0, ..., 0)   m5 = (0,0,1, ..., 0)
                    m2 = (1,0, ..., 0)   •
                    m3 = (0,1, ..., 0)   •
                                           t
                    m4 = (1,1, ..., 0)   m2 = (1,1,1, ..., 1)
              In here, m2 indicates documents in which solely the
               term k1 occurs.
Key Idea:
     Thebasis for the generalized vector model is
     formed by a set of 2 vectors defined over the set
     of minterms, as follows:
                                   t
                 0 1 2 ...     2
       m1  = (1, 0, 0, ..., 0, 0)
       m2 = (0, 1, 0, ..., 0, 0)

       m3 = (0, 0, 1, ..., 0, 0)
            •
            •
            •
          t
       m2 = (0, 0, 0, ..., 0, 1)

     Notice   that,
       i,j    mi  mj = 0 i.e., pairwise orthogonal
Key Idea:
  Mintermvectors are pairwise orthogonal. But, this
  does not mean that the index terms are independent:
    The  minterm m4 is given by:
           m4 = (1, 1, 0, ..., 0, 0)
    This minterm indicates the occurrence of the terms k1
     and k2 within a same document. If such document
     exists in a collection, we say that the minterm m4 is
     active and that a dependency between these two terms
     is induced.
    The generalized vector model adopts as a basic
     foundation the notion that cooccurence of terms within
     documents induces dependencies among them.
                Forming the Term Vectors
 The    vector associated with the term ki is computed as:
                      ci,r mr
              r, g i(m r )=1
   ki   =
                                         2
             sqrt( r, g   (m r   )=1 ci,r )
                            i


   ci,r   =                            wij
                dj | l, gl(dj)=gl(mr)

   The  weight c i,r associated with the pair [ki,mr] sums up
    the weights of the term ki in all the documents which
    have a term occurrence pattern given by mr.
   Notice that for a collection of size N, only N minterms
                                   t
    affect the ranking (and not 2 ).
    Dependency between Index Terms
 A degreeof correlation between the terms ki and kj
 can now be computed as:

         ki • kj =  r, g    (m )=1  g (mr )=1
                                                   c i,r * c j,r
                             i r        j


 Thisdegree of correlation sums up (in a weighted
 form) the dependencies between ki and kj induced by
 the documents in the collection (represented by the mr
 minterms).
The Generalized Vector Model:
        An Example       k1
                                                             k2

                                                   d7
                               d2        d6

                              d4         d5
                                              d3
                                    d1



                                                        k3


                k1        k2                  k3
     d1         2         0                   1
     d2         1         0                   0
     d3         0         1                   3
     d4         2         0                   0
     d5         1         2                   4
     d6         1         2                   0
     d7         0         5                   0

      q         1         2                   3
                    Computation of C i,r
  wij
              k1          k2           k3                         k1          k2          k3
  d1           2           0            1        d1 = m6           1           0           1
  d2           1           0            0        d2 = m2           1           0           0
  d3           0           1            3        d3 = m7           0           1           1
  d4           2           0            0        d4 = m2           1           0           0
  d5           1           2            4        d5 = m8           1           1           1
  d6           0           2            2        d6 = m7           0           1           1
  d7           0           5            0        d7 = m3           0           1           0

  q           1           2            3         q = m8           1           1           1




c i,r =  dj | l, gl(dj)=gl(mr) wij
                                                           c1,r        c2,r        c3,r
                                            m1              0           0           0
                                            m2              3           0           0
                                            m3              0           5           0
                                            m4              0           0           0
                                            m5              0           0           0
                                            m6              2           0           1
                                            m7              0           3           5
                                            m8              1           2           4
Computation of Index Term Vectors
                  c1,r   c2,r   c3,r
           m1      0      0      0
           m2      3      0      0
           m3      0      5      0
           m4      0      0      0
           m5      0      0      0
           m6      2      0      1
           m7      0      3      5
           m8      1      2      4



 k1   =       1           (3 m2 + 2 m6 + m8 )
       sqrt(32 + 22 + 12 )
 k2 =         1           (5 m3 + 3 m7 + 2 m8 )
             2    2    2
       sqrt(5 + 3 + 2 )
 k3 =         1           (1 m6 + 5 m7 + 4 m8 )
             2    2     2
       sqrt(1 + 5 + 4 )
Computation of Document Vectors
                    k1    k2     k3
             d1      2     0      1
             d2      1     0      0
             d3      0     1      3
             d4      2     0      0
             d5      1     2      4
             d6      0     2      2
             d7      0     5      0

              q     1     2      3

    d1   = 2 k1 +          k3
    d2   = k1
    d3   =          k2 + 3 k3
    d4   = 2 k1
    d5   = k1 + 2 k2 + 4 k3
    d6   =        2 k2 + 2 k3
    d7   =        5 k2
    q    = k1 + 2 k2 + 3 k3
Ranking Computation
        k1   =       1           (3 m2 + 2 m6 + m8 )
                    2    2    2
              sqrt(3 + 2 + 1 )
        k2 =         1           (5 m3 + 3 m7 + 2 m8 )
              sqrt(52 + 32 + 22 )
        k3 =         1           (1 m6 + 5 m7 + 4 m8 )
                    2    2    2
              sqrt(1 + 5 + 4 )
           d1    = 2 k1 +          k3
           d2    = k1
           d3    =          k2 + 3 k3
           d4    = 2 k1
           d5    = k1 + 2 k2 + 4 k3
           d6    =        2 k2 + 2 k3
           d7    =        5 k2
           q     = k1 + 2 k2 + 3 k3
               Conclusions
   Model considers correlations among index terms
   Not clear in which situations it is superior to the
    standard Vector model
   Computation costs are higher
   Model does introduce interesting new ideas

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:12/19/2011
language:
pages:12