# 02e

Document Sample

```					           Generalized Vector Model
   Classic models enforce independence of index
terms.
   For the Vector model:
    Set of term vectors {k1, k2, ..., kt} are linearly
independent and form a basis for the subspace of
interest.
   Frequently, this is interpreted as:
 i,j    ki  kj = 0
   In 1985, Wong, Ziarko, and Wong proposed an
interpretation in which the set of terms is linearly
independent, but not pairwise orthogonal.
Key Idea:
 In the generalized vector model, two index terms
might be non-orthogonal and are represented in
terms of smaller components (minterms).
 As before let,
  wij be the weight associated with [ki,dj]
 {k1, k2, ..., kt} be the set of all terms

 If  these weights are all binary, all patterns of
occurrence of terms within documents can be
represented by the minterms:
      m1 = (0,0, ..., 0)   m5 = (0,0,1, ..., 0)
m2 = (1,0, ..., 0)   •
m3 = (0,1, ..., 0)   •
t
m4 = (1,1, ..., 0)   m2 = (1,1,1, ..., 1)
 In here, m2 indicates documents in which solely the
term k1 occurs.
Key Idea:
 Thebasis for the generalized vector model is
formed by a set of 2 vectors defined over the set
of minterms, as follows:
t
0 1 2 ...     2
 m1  = (1, 0, 0, ..., 0, 0)
 m2 = (0, 1, 0, ..., 0, 0)

 m3 = (0, 0, 1, ..., 0, 0)
•
•
•
t
 m2 = (0, 0, 0, ..., 0, 1)

 Notice   that,
 i,j    mi  mj = 0 i.e., pairwise orthogonal
Key Idea:
 Mintermvectors are pairwise orthogonal. But, this
does not mean that the index terms are independent:
 The  minterm m4 is given by:
m4 = (1, 1, 0, ..., 0, 0)
 This minterm indicates the occurrence of the terms k1
and k2 within a same document. If such document
exists in a collection, we say that the minterm m4 is
active and that a dependency between these two terms
is induced.
 The generalized vector model adopts as a basic
foundation the notion that cooccurence of terms within
documents induces dependencies among them.
Forming the Term Vectors
 The    vector associated with the term ki is computed as:
           ci,r mr
r, g i(m r )=1
 ki   =
2
sqrt( r, g   (m r   )=1 ci,r )
i

 ci,r   =                            wij
dj | l, gl(dj)=gl(mr)

 The  weight c i,r associated with the pair [ki,mr] sums up
the weights of the term ki in all the documents which
have a term occurrence pattern given by mr.
 Notice that for a collection of size N, only N minterms
t
affect the ranking (and not 2 ).
Dependency between Index Terms
 A degreeof correlation between the terms ki and kj
can now be computed as:

ki • kj =  r, g    (m )=1  g (mr )=1
c i,r * c j,r
i r        j

 Thisdegree of correlation sums up (in a weighted
form) the dependencies between ki and kj induced by
the documents in the collection (represented by the mr
minterms).
The Generalized Vector Model:
An Example       k1
k2

d7
d2        d6

d4         d5
d3
d1

k3

k1        k2                  k3
d1         2         0                   1
d2         1         0                   0
d3         0         1                   3
d4         2         0                   0
d5         1         2                   4
d6         1         2                   0
d7         0         5                   0

q         1         2                   3
Computation of C i,r
wij
k1          k2           k3                         k1          k2          k3
d1           2           0            1        d1 = m6           1           0           1
d2           1           0            0        d2 = m2           1           0           0
d3           0           1            3        d3 = m7           0           1           1
d4           2           0            0        d4 = m2           1           0           0
d5           1           2            4        d5 = m8           1           1           1
d6           0           2            2        d6 = m7           0           1           1
d7           0           5            0        d7 = m3           0           1           0

q           1           2            3         q = m8           1           1           1

c i,r =  dj | l, gl(dj)=gl(mr) wij
c1,r        c2,r        c3,r
m1              0           0           0
m2              3           0           0
m3              0           5           0
m4              0           0           0
m5              0           0           0
m6              2           0           1
m7              0           3           5
m8              1           2           4
Computation of Index Term Vectors
c1,r   c2,r   c3,r
m1      0      0      0
m2      3      0      0
m3      0      5      0
m4      0      0      0
m5      0      0      0
m6      2      0      1
m7      0      3      5
m8      1      2      4

 k1   =       1           (3 m2 + 2 m6 + m8 )
sqrt(32 + 22 + 12 )
 k2 =         1           (5 m3 + 3 m7 + 2 m8 )
2    2    2
sqrt(5 + 3 + 2 )
 k3 =         1           (1 m6 + 5 m7 + 4 m8 )
2    2     2
sqrt(1 + 5 + 4 )
Computation of Document Vectors
k1    k2     k3
d1      2     0      1
d2      1     0      0
d3      0     1      3
d4      2     0      0
d5      1     2      4
d6      0     2      2
d7      0     5      0

q     1     2      3

   d1   = 2 k1 +          k3
   d2   = k1
   d3   =          k2 + 3 k3
   d4   = 2 k1
   d5   = k1 + 2 k2 + 4 k3
   d6   =        2 k2 + 2 k3
   d7   =        5 k2
   q    = k1 + 2 k2 + 3 k3
Ranking Computation
 k1   =       1           (3 m2 + 2 m6 + m8 )
2    2    2
sqrt(3 + 2 + 1 )
 k2 =         1           (5 m3 + 3 m7 + 2 m8 )
sqrt(52 + 32 + 22 )
 k3 =         1           (1 m6 + 5 m7 + 4 m8 )
2    2    2
sqrt(1 + 5 + 4 )
   d1    = 2 k1 +          k3
   d2    = k1
   d3    =          k2 + 3 k3
   d4    = 2 k1
   d5    = k1 + 2 k2 + 4 k3
   d6    =        2 k2 + 2 k3
   d7    =        5 k2
   q     = k1 + 2 k2 + 3 k3
Conclusions
   Model considers correlations among index terms
   Not clear in which situations it is superior to the
standard Vector model
   Computation costs are higher
   Model does introduce interesting new ideas

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 2 posted: 12/19/2011 language: pages: 12