# Supporting the Emergence of Ideas in Spatial Hypertext: The Visual

Document Sample

```					   Query Operations:
Automatic Global Analysis
Motivation
Methods of local analysis extract information from
local set of documents retrieved to expand the
query
An alternative is to expand the query using
information from the whole set of documents
Until the beginning of the 1990s these techniques
failed to yield consistent improvements in
retrieval performance
Now, with moderns variants, sometimes based on
a thesaurus, this perception has changed
Automatic Global Analysis
There are two modern variants based on a
thesaurus-like structure built using all
documents in collection
– Query Expansion based on a Similarity
Thesaurus
– Query Expansion based on a Statistical
Thesaurus
Similarity Thesaurus
The similarity thesaurus is based on term-to-term
relationships rather than on a matrix of co-
occurrence.
– These relationships are not derived directly from co-
occurrence of terms inside documents.
– They are obtained by considering that the terms are
concepts in a concept space.
– In this concept space, each term is indexed by the
documents in which it appears.
Terms assume the original role of documents
while documents are interpreted as indexing
elements
Similarity Thesaurus vs.
Vector Model
The frequency factor:
– In vector model
f (i,j) = freq ( term ki in doc dj ) / freq ( most common term in dj )
– In similarity thesaurus
f (i,j) = freq ( term ki in doc dj ) / freq ( doc where term ki appears
most)
The inverse frequency factor:
– In vector model
Idf (i) = log (# of docs in collection / # of docs with term ki)
– In similarity thesaurus
Itf (j) = log (# of terms in collection / # of terms in doc dj)
Similarity Thesaurus
Definitions:
– t: number of terms in the collection
– N: number of documents in the collection
– fi,j: frequency of occurrence of the term ki in the document dj
– tj: vocabulary of document dj
– itfj: inverse term frequency for document dj
Inverse term frequency for document dj
t
itf j  log                                         (0.5  0.5 
fi, j
)  itf j
tj                                                       max j ( f i , j )
For each term ki                             wi , j 
N
f i ,l

k i  ( wi ,1 , wi , 2 ,...., wi , N )

l 1
(0.5  0.5 
maxl ( f i ,l )
) 2 itf j2

where wi,j is a weight associated between the term and the documents.
Similarity Thesaurus
The relationship between two terms ku and kv is
computed as a correlation factor cu,v given by
c u,v  k u  k v   w u, j  w v, j
d j

The global similarity thesaurus is built through the
computation of correlation factor Cu,v for each
pair of indexing terms [ku,kv] in the collection
The computation is expensive but only has to be
computed once and can be updated
incrementally
Query Expansion based on a
Similarity Thesaurus
Query expansion is done in three steps as
follows:
   Represent the query in the concept space
used for representation of the index terms
2   Based on the global similarity thesaurus,
compute a similarity sim(q,kv) between each
term kv correlated to the query terms and the
whole query q.
3   Expand the query with the top r ranked terms
according to sim(q,kv)
Statistical Thesaurus
Global thesaurus is composed of classes which
group correlated terms in the context of the
whole collection
– Such correlated terms can then be used to expand
the original user query
– These terms must be low frequency terms
– However, it is difficult to cluster low frequency terms
– To circumvent this problem, we cluster documents
into classes instead and use the low frequency terms
in these documents to define our thesaurus classes.
– This algorithm must produce small and tight clusters.
Document clustering algorithm
– Place each document in a distinct cluster.
– Compute the similarity between all pairs of clusters.
– Determine the pair of clusters [Cu,Cv] with the highest
inter-cluster similarity.
– Merge the clusters Cu and Cv
– Verify a stop criterion. If this criterion is not met then go
back to step 2.
– Return a hierarchy of clusters.
Similarity between two clusters is defined as the minimum
of similarities between all pair of inter-cluster documents
– Use of minimum ensures small, focussed clusters
Generating the Thesaurus
Given the document cluster hierarchy for
the whole collection
– Which clusters become classes?
– Which terms represent classes?
specified by operator based on
characteristics of the collection.
– TC: Threshold class
– NDC: Number of documents in class
– MIDF: Minimum inverse document frequency
Selecting Thesaurus Classes
TC is the minimum similarity between two
subclusters for the parent to be
considered a class.
– A high value makes classes smaller and
more focussed.
NDC is an upper limit on the size of
clusters.
– A low value of NDC restricts the selection to
smaller, more focussed clusters
Picking Terms for Each Class
Consider the set of documents in each class
selected above
Only the lower frequency terms are used for
the thesaurus classes
The parameter MIDF defines the minimum
inverse document frequency for any term
which is selected to participate in a
thesaurus class
Query Expansion with
Statisitcal Thesaurus
For each thesaurus class C:
– Compute an average term weight wtc
C

w         i,C
wtc    i 1
C
– Compute the thesaurus class weight wc
wtc
Wc       0 .5
C
– Where |C| is the number of terms in the thesaurus
class and wi,C is a precomputed weight associated
with the term-class pair [Ki, C]
Initializing TC, NDC, and MIDF

TC depends on the collection
– Inspection of the cluster hierarchy is almost
always necessary for assisting with the
setting of TC.
– A high value of TC might yield classes with
too few terms
– A low value of TC yields too few classes
NDC is easier to set once TC is set
MIDF can be difficult to set
Conclusions
Automatically generated thesaurus is an
efficient method to expand queries
Thesaurus generation is expensive but it is
executed only once
Query expansion based on similarity
thesaurus uses term frequencies to
expand the query
Query expansion based on statistical
thesaurus needs well defined parameters

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 0 posted: 2/8/2013 language: English pages: 16
How are you planning on using Docstoc?