Lecture 6: Linguistic Methods
for Searching
   Stemming
   Thesaurus
   Online resources
   Automatic construction of thesaurus
Outline of Stemming Methods
   Goal of Stemming Process
   Algorithm
   Affix Removal (Porter’s Algorithm)
   Dictionary Look-up Stemmers
   Successor Variety
   n-Gram Stemming
   Applications
   Originally designed to improve
performance by reducing the
requirement on system resources.
   With the continued significant increase in
storage and computing power, use of
stemming for performance reason is no
longer as important.
Other Potentials
   It may make improvement in recall.
   There may be associated decline in precision.
   System designer make their own choice of
including stemming or not.
   Google does not use the stemming
   Hotbot includes the word stemming for user
choice
Porter Stemming Algorithm
   The Porter algorithm is the most commonly
accepted algorithm.
   Based upon a set of conditions of the stem,
suffix and prefix and associated actions given
the condition.
   See, e.g,
   http://www.tartarus.org/~martin/PorterStemmer/
Porter Stemming (Condition)
   m, the measure of a stem is a function of
sequences of vowels (a,e,i,o,u,y) followed by
a consonant.
   C(VC)mV where the initial C and final V are
optional and m is the number VC repeats

Measure          Example
m=0              free, why
m=1              frees, whose
m=2              prologue, compute
Porter Stemming (Condition)
   *<X> -stem ends with letter X
   *v*      -stem contains a vowel
   *d       -stem ends in double consonant
   *o       -stem ends with consonant-
vowel-consonant sequence where the
final consonant is not w, x, or y
Rules
Step   Condition Suffix         Replacement Examples
1a     NULL      sses           ss          stresses
->stress
1b     *v*       ing            NULL        making ->
mak
1b1    NULL      at             ate         inflat(ed)->
inflate
1c     *v*       y              i           happy->
happi
Rules (continued)
2    m>0            aliti           al                   formaliti->
formal
3    m>0            icate           ic                   duplicated
->duplic
5a   m>1            e               NULL                 inflate->
inflat
5b   m>1 and *d     NULL            single               controll->
and *<L>                       letter               control
Example
   duplicatable
   duplicat               rule 4
   duplicate              rule 1b1
   duplic                 rule 3
Dictionary Look-Up Stemmer
   A dictionary contains the pairing of a word
and its stem for all the words.
   The structure of the dictionary should be well
designed for speeding up the search

TERM              STEM
computer          comput
compute           comput
computation       comput
Successor Variety Stemming
   Hafer and Weiss (1974) “word segmentation by
letter successor varieties”, Information Storage
and Retrieval 10, 371-385.
   Main Idea: Determine word and morpheme
boundaries based on
   the distribution of phonemes in a large body of
utterances.
Note
   Morpheme: smallest meaningful part into which
a word can be divided
   Run-s contains two morphemes
   un-like-ly contains three morphemes
   Phoneme: unit of the system of sounds in a
language
   English has 24 consonant phonemes
Overall approach
   Hafer and Weiss use
   letters in place of phonemes
   texts in place of phonemically transcribed utterances
Formal Definition
   Let w be a word of length n
   wi is a length I prefix of w
   Let D be a collection of words
   D(wi) is the subset of D containing terms whose
first I letters match wi exactly
   S(wi) the successor variety of wi is the number
of distinct letters that occupy the (i+1)st
position of words in D(wi).
   A test word of length n has n successor varieties
S(w1) S(w2) … S(wn).
Informal Definition
   The successor variety of a string in a collection
D of words is the number of different characters
that follows it in D.
   That it, it depends on
   the string
   the collection D of words under consideration
An example
   D={able, axle, accident, ape, about, be}
   The successor variety for
   a: 4 (b,x,c,p)
   ap: 1 (e)
   app: 0
   ab: 2 (l, o)
   b: 1 (e)
   Using Trie, successor variety of a string is the
number of children for the node the string
reaches in the trie (terminal node is treated as
having one child
Trie for the corpus of data D
1
a                            b

2
b
x
be
3           c              p               axle
l       o
ape
accident
able
Segment in Words
   From a large body of text, usually the successor
variety of a substring decreases as a character
is added, until a segment boundary is reached
   Consider the following example
   r              3      (e,I,o)
   re             2      (a,d)
   rea            1      (d)
   read is a segment (or stem)
Selecting segments of words
   Cut off method:
 a boundary is identified if some cutoff value is

reached.
   Peak and plateau method
 a segment break is made after a character whose

successor variety is larger than that of both the
character immediate before and the character
immediately after it.
   Complete word method
 a break is made after a segment if the segment is a

complete word in the corpus
   Entropy method
 cutoff method applied to entropy defined for words.
Peak and Plateau Method
   r           3         (e,I,o)
   re          2         (a,d)
   rea         1         (d)
   the successor variety of {read} is 3 larger than
that of both “rea” and “reada”
Peak and Plateau Method
   Input: A document of many terms.
   Output: each term is segmented.
Stem method of Hafer and Weiss

   Determine successor variety of a word
   Use this information to segment the word using one of
the previous methods (say peak&plateau)
   Choose one of the segment as stem
   if (first segment is in <=12 words in the corpus)
   //comment: maybe a prefix
   first segment is stem
   else
   second segment is stem
Stem method of Hafer and Weiss

   Input: segmented word
   Output: the stem of the word
   For example:

 //may be able is the output dependent on what

happens in the algorithms
Accessor Variety Method in
Chinese
   The notation is introduced by Feng, Chen, Zheng, Deng
for chinese word extraction.
   The idea is similar to successor variety
   It is use to determine chinese text segmentation since it
is difficult to separate words in Chinese text. In
comparison, English words are separated by a space
symbol in text.
Definition: Accessor Variety
   We treat each Chinese character as a letter
   For each string (a potential word) consisting of several
consecutive characters, we define successor variety as
in English.
   Symmetrically, we also define a predecessor variety for
each string.
   A string is considered a word if it has a large successor
variety and a large predecessor variety.
Testing Results
   The accessor variety method turns out a very simple yet
efficient way to recognize Chinese words when
combined with some simple grammar rules.
   For details, see our paper:
   http://www.cs.cityu.edu.hk/~deng/5286/feng.pdf
Applications
   Finding new words created in e-communities.
   Use the extracted words as base units to do
segmentation.
Word similarity
   N-gram method:
   break a word of length n into (n-1) digrams, consisting of
substring of two characters of the word.
   Count the number of distinguished digrams
   Let A (B) be the number of distinguished digrams in
word 1 (2). Let C be the number of distinguished
digrams shared by word 1 and word 2.
   The similarity of the two words is
   S=2C/(A+B)
Example of Word similarity
   Statistics: st, ta, at, ti, is, st, ti, ic, cs
 its distinguished digrams

 at, cs, ic, is, st, ta, ti

   statistical: st, ta, at, ti, is, st, ti, ic, ca, al
 its distinguished digrams:

 al, at, ca, ic, is, st, ta, ti

   A=7, B=8, C=6
   Similarity =2x6/(7+8)=12/15=4/5=80%
   One may build a similarity matrix of all words in a
corpus, calculated as above, and complemented by
cutoff value method (set to zero if less than a certain
value, and to 1 else)
Project Question
   Would this method be useful in comparing
Chinese texts, without doing
segmentation?
Thesaurus
   Vocabulary control in an information
retrieval system
   Thesaurus construction
   Manual construction
   Automatic construction
Vocabulary control
   Standard vocabulary for both indexing
and searching (for the constructors of
the system and the users of the system)
Objectives of vocabulary control
   To promote the consistent representation of
subject matter by indexers and
searchers ,thereby avoiding the dispersion of
related materials.
   To facilitate the conduct of a comprehensive
search on some topic by linking together
terms whose meanings are related
Thesaurus
   Not like common dictionary
   Words with their explanations
   May contain words in a language
   Or only contains words in a specific domain.
   With a lot of other information especially the
relationship between words
   Classification of words in the language
   Words relationship like synonyms, antonyms
On-Line Thesaurus
   http://www.thesaurus.com
   http://www.dictionary.com/
   http://www.cogsci.princeton.edu/~
wn/
Dictionary vs. Thesaurus
Check Information use http://www.thesaurus.com
Dictionary                                 Thesaurus
   in·for·ma·tion ( n f r-m
sh n)
n.
[Nouns] information, enlightenment,
   Knowledge derived
from study, experience,          acquaintance ……
or instruction.
[Verbs] tell; inform, inform of; acquaint,
   Knowledge of specific
events or situations that        acquaint with; impart, ……
has been gathered or             [Adjectives] informed; communique;
communication;                   reported; published
intelligence or news.
See Synonyms at
knowledge.
   ......
Use of Thesaurus
   To control the term used in
indexing ,for a specific domain only use
the terms in the thesaurus as indexing
terms
   Assist the users to form proper queries
by the help information contained in the
thesaurus
Construction of Thesaurus
   Stemming can be used for reduce the
size of thesaurus
   Can be constructed either manually or
automatically
WordNet: manually constructed
   WordNet® is an online lexical reference
system whose design is inspired by
current psycholinguistic theories of
human lexical memory. English nouns,
organized into synonym sets, each
representing one underlying lexical
synonym sets.
Relations in WordNet
Automatic Thesaurus Construction

   A variety of methods can be used in
construction the thesaurus
   Term similarity can be used for
constructing the thesaurus
Complete Term Relation Method
Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8
Doc1          0         4     0         0           0          2          1          3
Doc2          3         1     4         3           1          2          0          1
Doc3          3         0     0         0           3          0          3          0
Doc4          0         1     0         3           0          0          2          0
Doc5          2         2     2         3           1          4          0          2
Term – Document Relationship can be calculated using a variety of methods
Like tf-idf
Term similarity can be calculated base on the term – document relationship
 for example:
Sim(Termi , Term j )                 (DocTerm
All Document K
k ,i   )(DocTermk , j )
Complete Term Relation Method
Term1   Term2   Term3      Term4     Term5      Term6      Term7     Term8
Term1             7        16         15        14         14         9          7
Term2     7                 8         12         3         18         6         17
Term3    16       8                   18         6         16         0          8
Term4    15      12        18                    6         18         6          9
Term5    14       3         6         6                     6         9          3
Term6    14      18        16         18         6                    2         16
Term7     9       6         0         6          9          2                    3
Term8     7      17         8         9          3         16         3

Set threshold to 10
Complete Term Relation Method
T1          T3
Group
T1,T3,T4,T6
T2
T4                                                     T1,T5
T2,T4,T6

T5
T2,T6,T8
T6                                                T7
T8
T7

