Computing Term Translation Probabilities with Generalized Latent
Irina Matveeva Gina-Anne Levow
Department of Computer Science Department of Computer Science
University of Chicago University of Chicago
Chicago, IL 60637 Chicago, IL 60637
Abstract Berger et. al (Berger and Lafferty, 1999) used
translation probabilities between terms to account
Term translation probabilities proved an for synonymy and polysemy. However, their
effective method of semantic smoothing in model of such probabilities was computationally
the language modelling approach to infor- demanding.
mation retrieval tasks. In this paper, we Latent Semantic Analysis (LSA) (Deerwester et
use Generalized Latent Semantic Analysis al., 1990) is one of the best known dimensionality
to compute semantically motivated term reduction algorithms. Using a bag-of-words docu-
and document vectors. The normalized ment vectors (Salton and McGill, 1983), it com-
cosine similarity between the term vec- putes a dual representation for terms and docu-
tors is used as term translation probabil- ments in a lower dimensional space. The resulting
ity in the language modelling framework. document vectors reside in the space of latent se-
Our experiments demonstrate that GLSA- mantic concepts which can be expressed using dif-
based term translation probabilities cap- ferent words. The statistical analysis of the seman-
ture semantic relations between terms and tic relatedness between terms is performed implic-
improve performance on document classi- itly, in the course of a matrix decomposition.
ﬁcation. In this project, we propose to use a combi-
nation of dimensionality reduction and language
1 Introduction modelling to compute the similarity between doc-
Many recent applications such as document sum- uments. We compute term vectors using the Gen-
marization, passage retrieval and question answer- eralized Latent Semantic Analysis (Matveeva et
ing require a detailed analysis of semantic rela- al., 2005). This method uses co-occurrence based
tions between terms since often there is no large measures of semantic similarity between terms
context that could disambiguate words’s meaning. to compute low dimensional term vectors in the
Many approaches model the semantic similarity space of latent semantic concepts. The normalized
between documents using the relations between cosine similarity between the term vectors is used
semantic classes of words, such as representing as term translation probability.
dimensions of the document vectors with distri-
2 Term Translation Probabilities in
butional term clusters (Bekkerman et al., 2003)
and expanding the document and query vectors
with synonyms and related terms as discussed The language modelling approach (Ponte and
in (Levow et al., 2005). They improve the per- Croft, 1998) proved very effective for the infor-
formance on average, but also introduce some in- mation retrieval task. This method assumes that
stability and thus increased variance (Levow et al., every document deﬁnes a multinomial probabil-
2005). ity distribution p(w|d) over the vocabulary space.
The language modelling approach (Ponte and Thus, given a query q = (q1 , ..., qm ), the like-
Croft, 1998; Berger and Lafferty, 1999) proved lihood of the query is estimated using the docu-
very effective for the information retrieval task. ment’s distribution: p(q|d) = m p(qi |d), where
qi are query terms. Relevant documents maximize cos(v, w) are used as the translation probability
p(d|q) ∝ p(q|d)p(d). between the corresponding terms t(v|w).
Many relevant documents may not contain the In addition, we used the cosine similarity be-
same terms as the query. However, they may tween the document vectors
contain terms that are semantically related to the d
di , dj = αdi βv j w, v ,
query terms and thus have high probability of
being “translations”, i.e. re-formulations for the
query words. d
where αdi and βv j represent the weight of the
Berger et. al (Berger and Lafferty, 1999) in- terms w and v with respect to the documents di
troduced translation probabilities between words and dj , respectively.
into the document-to-query model as a way of se- In this case, the inner products between the term
mantic smoothing of the conditional word proba- vectors are also used to compute the similarity be-
bilities. Thus, they query-document similarity is tween the document vectors. Therefore, the cosine
computed as similarity between the document vectors also de-
m pends on the relatedness between pairs of terms.
p(q|d) = t(qi |w)p(w|d). (1) We compare these two document similarity
scores to the cosine similarity between bag-of-
Each document word w is a translation of a query word document vectors. Our experiments show
term qi with probability t(qi |w). This approach that these two methods offer an advantage for doc-
showed improvements over the baseline language ument classiﬁcation.
modelling approach (Berger and Lafferty, 1999).
The estimation of the translation probabilities is, 2.2 Generalized Latent Semantic Analysis
however, a difﬁcult task. Lafferty and Zhai used We use the Generalized Latent Semantic Analy-
a Markov chain on words and documents to es- sis (GLSA) (Matveeva et al., 2005) to compute se-
timate the translation probabilities (Lafferty and mantically motivated term vectors.
Zhai, 2001). We use the Generalized Latent Se- The GLSA algorithm computes the term vectors
mantic Analysis to compute the translation proba- for the vocabulary of the document collection C
bilities. with vocabulary V using a large corpus W . It has
the following outline:
2.1 Document Similarity
1. Construct the weighted term document ma-
We propose to use low dimensional term vectors
trix D based on C
for inducing the translation probabilities between
terms. We postpone the discussion of how the term 2. For the vocabulary words in V , obtain a ma-
vectors are computed to section 2.2. To evaluate trix of pair-wise similarities, S, using the
the validity of this approach, we applied it to doc- large corpus W
We used two methods of computing the sim- 3. Obtain the matrix U T of low dimensional
ilarity between documents. First, we computed vector space representation of terms that pre-
the language modelling score using term transla- serves the similarities in S, U T ∈ Rk×|V |
tion probabilities. Once the term vectors are com- 4. Compute document vectors by taking linear
puted, the document vectors are generated as lin- ˆ
combinations of term vectors D = U T D
ear combinations of term vectors. Therefore, we
The columns of D are documents in the k-
also used the cosine similarity between the docu-
ments to perform classiﬁcaiton. dimensional space.
We computed the language modelling score of In step 2 we used point-wise mutual informa-
a test document d relative to a training document tion (PMI) as the co-occurrence based measure of
di as semantic associations between pairs of the vocab-
ulary terms. PMI has been successfully applied to
p(d|di ) = t(v|w)p(w|di ). (2) semantic proximity tests for words (Turney, 2001;
Terra and Clarke, 2003) and was also success-
Appropriately normalized values of the cosine fully used as a measure of term similarity to com-
similarity measure between pairs of term vectors pute document clusters (Pantel and Lin, 2002). In
our preliminary experiments, the GLSA with PMI Groupd Groups
showed a better performance than with other co- #L tf Glsa LM tf Glsa LM
occurrence based measures such as the likelihood 100 0.58 0.75 0.69 0.42 0.48 0.48
ratio, and χ2 test. 200 0.65 0.78 0.74 0.47 0.52 0.51
PMI between random variables representing 400 0.69 0.79 0.76 0.51 0.56 0.55
two words, w1 and w2 , is computed as 1000 0.75 0.81 0.80 0.58 0.60 0.59
2000 0.78 0.83 0.83 0.63 0.64 0.63
P (W1 = 1, W2 = 1)
P M I(w1 , w2 ) = log .
P (W1 = 1)P (W2 = 1) Table 1: k-NN classiﬁcation accuracy for 20NG.
We used the singular value decomposition
(SVD) in step 3 to compute GLSA term vectors.
LSA (Deerwester et al., 1990) and some other
related dimensionality reduction techniques, e.g.
Locality Preserving Projections (He and Niyogi,
2003) compute a dual document-term representa-
tion. The main advantage of GLSA is that it fo-
cuses on term vectors which allows for a greater
ﬂexibility in the choice of the similarity matrix.
Figure 1: k-NN with 400 training documents.
The goal of the experiments was to understand We used the Lemur toolkit3 to tokenize and in-
whether the GLSA term vectors can be used to dex the document; we used stemming and a list of
model the term translation probabilities. We used stop words. Unless stated otherwise, for the GLSA
a simple k-NN classiﬁer and a basic baseline to methods we report the best performance over dif-
evalute the performance. We used the GLSA- ferent numbers of embedding dimensions.
based term translation probabilities within the lan- The co-occurrence counts can be obtained using
guage modelling framework and GLSA document either term co-occurrence within the same docu-
vectors. ment or within a sliding window of certain ﬁxed
We used the 20 news groups data set because size. In our experiments we used the window-
previous studies showed that the classiﬁcation per- based approach which was shown to give better
formance on this document collection can notice- results (Terra and Clarke, 2003). We used the win-
ably beneﬁt from additional semantic informa- dow of size 4.
tion (Bekkerman et al., 2003). For the GLSA
computations we used the terms that occurred in 3.2 Classiﬁcation Experiments
at least 15 documents, and had a vocabulary of We ran the k-NN classiﬁer with k=5 on ten ran-
9732 terms. We removed documents with fewer dom splits of training and test sets, with different
than 5 words. Here we used 2 sets of 6 news numbers of training documents. The baseline was
groups. Groupd contained documents from dis- to use the cosine similarity between the bag-of-
similar news groups1 , with a total of 5300 docu- words document vectors weighted with term fre-
ments. Groups contained documents from more quency. Other weighting schemes such as max-
similar news groups2 and had 4578 documents. imum likelihood and Laplace smoothing did not
3.1 GLSA Computation
Table 1 shows the results. We computed the
To collect the co-occurrence statistics for the sim- score between the training and test documents us-
ilarities matrix S we used the English Gigaword ing two approaches: cosine similarity between the
collection (LDC). We used 1,119,364 New York GLSA document vectors according to Equation 3
Times articles labeled “story” with 771,451 terms. (denoted as GLSA), and the language modelling
os.ms, sports.baseball, rec.autos, sci.space, misc.forsale, score which included the translation probabilities
religion-christian between the terms as in Equation 2 (denoted as
politics.misc, politics.mideast, politics.guns, reli-
gion.misc, religion.christian, atheism http://www.lemurproject.org/
LM ). We used the term frequency as an estimate References
for p(w|d). To compute the matrix of translation Ron Bekkerman, Ran El-Yaniv, and Naftali Tishby.
probabilities P , where P [i][j] = t(tj |ti ) for the 2003. Distributional word clusters vs. words for text
LMCLSA approach, we ﬁrst obtained the matrix categorization.
P [i][j] = cos(ti , tj ). We set the negative and zero Adam Berger and John Lafferty. 1999. Information re-
entries in P to a small positive value. Finally, we trieval as statistical translation. In Proc. of the 22rd
normalized the rows of P to sum up to one. ACM SIGIR.
Table 1 shows that for both settings GLSA and
Scott C. Deerwester, Susan T. Dumais, Thomas K. Lan-
LM outperform the tf document vectors. As ex- dauer, George W. Furnas, and Richard A. Harshman.
pected, the classiﬁcation task was more difﬁcult 1990. Indexing by latent semantic analysis. Jour-
for the similar news groups. However, in this nal of the American Society of Information Science,
case both GLSA-based approaches outperform the 41(6):391–407.
baseline. In both cases, the advantage is more Xiaofei He and Partha Niyogi. 2003. Locality preserv-
signiﬁcant with smaller sizes of the training set. ing projections. In Proc. of NIPS.
GLSA and LM performance usually peaked at John Lafferty and Chengxiang Zhai. 2001. Document
around 300-500 dimensions which is in line with language models, query models, and risk minimiza-
results for other SVD-based approaches (Deer- tion for information retrieval. In Proc. of the 24th
wester et al., 1990). When the highest accuracy ACM SIGIR, pages 111–119, New York, NY, USA.
was achieved at higher dimensions, the increase ACM Press.
after 500 dimensions was rather small, as illus- Gina-Anne Levow, Douglas W. Oard, and Philip
trated in Figure 1. Resnik. 2005. Dictionary-based techniques for
These results illustrate that the pair-wise simi- cross-language information retrieval. Information
Processing and Management: Special Issue on
larities between the GLSA term vectors add im- Cross-language Information Retrieval.
portant semantic information which helps to go
beyond term matching and deal with synonymy Irina Matveeva, Gina-Anne Levow, Ayman Farahat,
and Christian Royer. 2005. Generalized latent se-
and polysemy. mantic analysis for term representation. In Proc. of
4 Conclusion and Future Work
Patrick Pantel and Dekang Lin. 2002. Document clus-
We used the GLSA to compute term translation tering with committees. In Proc. of the 25th ACM
SIGIR, pages 199–206. ACM Press.
probabilities as a measure of semantic similarity
between documents. We showed that the GLSA Jay M. Ponte and W. Bruce Croft. 1998. A language
term-based document representation and GLSA- modeling approach to information retrieval. In Proc.
based term translation probabilities improve per- of the 21st ACM SIGIR, pages 275–281, New York,
NY, USA. ACM Press.
formance on document classiﬁcation.
The GLSA term vectors were computed for all Gerard Salton and Michael J. McGill. 1983. Intro-
vocabulary terms. However, different measures of duction to Modern Information Retrieval. McGraw-
similarity may be required for different groups of
terms such as content bearing general vocabulary Egidio L. Terra and Charles L. A. Clarke. 2003. Fre-
words and proper names as well as other named quency estimates for statistical word similarity mea-
sures. In Proc.of HLT-NAACL.
entities. Furthermore, different measures of sim-
ilarity work best for nouns and verbs. To extend Peter D. Turney. 2001. Mining the web for synonyms:
this approach, we will use a combination of sim- PMI–IR versus LSA on TOEFL. Lecture Notes in
ilarity measures between terms to model the doc- Computer Science, 2167:491–502.
ument similarity. We will divide the vocabulary
into general vocabulary terms and named entities
and compute a separate similarity score for each
of the group of terms. The overall similarity score
is a function of these two scores. In addition, we
will use the GLSA-based score together with syn-
tactic similarity to compute the similarity between
the general vocabulary terms.