Latent Semantic Indexing Defined
Latent semantic indexing, by definition, is a
mathematical or statistical technique for extracting
and representing the similarity of meaning of words
and passages by analysis of large bodies of text.
The definition may be a little difficult to
understand, but basically latent semantic indexing
takes the keywords you put into your search engine and
go through each and every web page searching out the
best results for the key words you are seeking.
There are several different mappings for latent
semantic indexing from high dimensional to low
LSI chooses the optimal mapping in a sense that
minimizes the distance. Choosing the number of
dimensions is a unique problem. A reduction can remove
much of the noise while keeping too few dimensions may
lose important information.
LSI performance is improved considerably after ten to
twenty dimensions and peaks at seventy to one hundred
dimensions. Then it slowly begins to diminish again.
There is a pattern of performance that is observed
with other datasets as well.
Latent semantic indexing considers pages that have
many words in common and close in meaning, sorts them
out, and presents them to the seeker.
The result is an LSI indexed database with similarity
and values that are calculated for every content word
and phrase. In response to a query, the LSI database
returns the pages it sees fit best to the keywords.
The algorithm doesn't understand anything about what
the words mean and does not require an exact match to
return results that are useful to the seeker.