L6 by shitingting

VIEWS: 0 PAGES: 45

									                      CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Lecture 6: Linguistic Methods
for Searching
   Stemming
   Thesaurus
       Online resources
       Automatic construction of thesaurus
                 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Outline of Stemming Methods
   Goal of Stemming Process
   Algorithm
       Affix Removal (Porter’s Algorithm)
       Dictionary Look-up Stemmers
       Successor Variety
       n-Gram Stemming
   Applications
                 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




The advantage
   Originally designed to improve
    performance by reducing the
    requirement on system resources.
       With the continued significant increase in
        storage and computing power, use of
        stemming for performance reason is no
        longer as important.
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Other Potentials
   It may make improvement in recall.
       There may be associated decline in precision.
   System designer make their own choice of
    including stemming or not.
       Google does not use the stemming
       Hotbot includes the word stemming for user
        choice
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




          Porter Stemming Algorithm
   The Porter algorithm is the most commonly
    accepted algorithm.
   Based upon a set of conditions of the stem,
    suffix and prefix and associated actions given
    the condition.
   See, e.g,
       http://www.tartarus.org/~martin/PorterStemmer/
                 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Porter Stemming (Condition)
   m, the measure of a stem is a function of
    sequences of vowels (a,e,i,o,u,y) followed by
    a consonant.
   C(VC)mV where the initial C and final V are
    optional and m is the number VC repeats

    Measure          Example
    m=0              free, why
    m=1              frees, whose
    m=2              prologue, compute
             CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Porter Stemming (Condition)
   *<X> -stem ends with letter X
   *v*      -stem contains a vowel
   *d       -stem ends in double consonant
   *o       -stem ends with consonant-
    vowel-consonant sequence where the
    final consonant is not w, x, or y
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



     Rules
Step   Condition Suffix         Replacement Examples
1a     NULL      sses           ss          stresses
                                            ->stress
1b     *v*       ing            NULL        making ->
                                            mak
1b1    NULL      at             ate         inflat(ed)->
                                            inflate
1c     *v*       y              i           happy->
                                            happi
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



     Rules (continued)
2    m>0            aliti           al                   formaliti->
                                                         formal
3    m>0            icate           ic                   duplicated
                                                         ->duplic
4    m>1            able            NULL                 adjustable
                                                         ->adjust
5a   m>1            e               NULL                 inflate->
                                                         inflat
5b   m>1 and *d     NULL            single               controll->
     and *<L>                       letter               control
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Example
   duplicatable
       duplicat               rule 4
       duplicate              rule 1b1
       duplic                 rule 3
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Dictionary Look-Up Stemmer
   A dictionary contains the pairing of a word
    and its stem for all the words.
   The structure of the dictionary should be well
    designed for speeding up the search

              TERM              STEM
              computer          comput
              compute           comput
              computation       comput
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Successor Variety Stemming
   Hafer and Weiss (1974) “word segmentation by
    letter successor varieties”, Information Storage
    and Retrieval 10, 371-385.
   Main Idea: Determine word and morpheme
    boundaries based on
       the distribution of phonemes in a large body of
        utterances.
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Note
   Morpheme: smallest meaningful part into which
    a word can be divided
          Run-s contains two morphemes
          un-like-ly contains three morphemes
   Phoneme: unit of the system of sounds in a
    language
          English has 24 consonant phonemes
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Overall approach
   Hafer and Weiss use
          letters in place of phonemes
          texts in place of phonemically transcribed utterances
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Formal Definition
   Let w be a word of length n
   wi is a length I prefix of w
   Let D be a collection of words
   D(wi) is the subset of D containing terms whose
    first I letters match wi exactly
   S(wi) the successor variety of wi is the number
    of distinct letters that occupy the (i+1)st
    position of words in D(wi).
   A test word of length n has n successor varieties
    S(w1) S(w2) … S(wn).
                      CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Informal Definition
   The successor variety of a string in a collection
    D of words is the number of different characters
    that follows it in D.
   That it, it depends on
       the string
       the collection D of words under consideration
                         CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


An example
   D={able, axle, accident, ape, about, be}
   The successor variety for
       a: 4 (b,x,c,p)
       ap: 1 (e)
       app: 0
       ab: 2 (l, o)
       b: 1 (e)
   Using Trie, successor variety of a string is the
    number of children for the node the string
    reaches in the trie (terminal node is treated as
    having one child
                        CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




       Trie for the corpus of data D
                                        1
                            a                            b


                   2
               b
                                  x
                                                                        be
       3           c              p               axle
   l       o
                                       ape
                       accident
           about
able
                      CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Segment in Words
   From a large body of text, usually the successor
    variety of a substring decreases as a character
    is added, until a segment boundary is reached
   Consider the following example
       D={able,ape,beatable, fixable, read, readable,
        reading, reads, red rope, ripe}
       r              3      (e,I,o)
       re             2      (a,d)
       rea            1      (d)
       read           3      (a,I,s)
   read is a segment (or stem)
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Selecting segments of words
   Cut off method:
      a boundary is identified if some cutoff value is

       reached.
   Peak and plateau method
      a segment break is made after a character whose

       successor variety is larger than that of both the
       character immediate before and the character
       immediately after it.
   Complete word method
      a break is made after a segment if the segment is a

       complete word in the corpus
   Entropy method
      cutoff method applied to entropy defined for words.
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Peak and Plateau Method
   D={able,ape,beatable, fixable, read, readable,
    reading, reads, red rope, ripe}
       r           3         (e,I,o)
       re          2         (a,d)
       rea         1         (d)
       read        3         (a,I,s)
       reada       1         (b)
       readab      1         (l)
       readabl     1         (e)
       readable    1         (blank)
   the successor variety of {read} is 3 larger than
    that of both “rea” and “reada”
                 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Peak and Plateau Method
   Input: A document of many terms.
   Output: each term is segmented.
   E.G., the output of readable is read-able
                          CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Stem method of Hafer and Weiss

   Determine successor variety of a word
   Use this information to segment the word using one of
    the previous methods (say peak&plateau)
   Choose one of the segment as stem
       if (first segment is in <=12 words in the corpus)
            //comment: maybe a prefix
            first segment is stem
       else
            second segment is stem
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Stem method of Hafer and Weiss

   Input: segmented word
   Output: the stem of the word
   For example:
      read-able is input

      read is the output

      //may be able is the output dependent on what

       happens in the algorithms
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Accessor Variety Method in
Chinese
   The notation is introduced by Feng, Chen, Zheng, Deng
    for chinese word extraction.
   The idea is similar to successor variety
   It is use to determine chinese text segmentation since it
    is difficult to separate words in Chinese text. In
    comparison, English words are separated by a space
    symbol in text.
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Definition: Accessor Variety
   We treat each Chinese character as a letter
   For each string (a potential word) consisting of several
    consecutive characters, we define successor variety as
    in English.
   Symmetrically, we also define a predecessor variety for
    each string.
   A string is considered a word if it has a large successor
    variety and a large predecessor variety.
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Testing Results
   The accessor variety method turns out a very simple yet
    efficient way to recognize Chinese words when
    combined with some simple grammar rules.
   For details, see our paper:
   http://www.cs.cityu.edu.hk/~deng/5286/feng.pdf
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Applications
   Finding new words created in e-communities.
   Use the extracted words as base units to do
    segmentation.
                         CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Word similarity
   N-gram method:
       break a word of length n into (n-1) digrams, consisting of
        substring of two characters of the word.
       Count the number of distinguished digrams
   Let A (B) be the number of distinguished digrams in
    word 1 (2). Let C be the number of distinguished
    digrams shared by word 1 and word 2.
   The similarity of the two words is
   S=2C/(A+B)
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Example of Word similarity
   Statistics: st, ta, at, ti, is, st, ti, ic, cs
      its distinguished digrams

      at, cs, ic, is, st, ta, ti

   statistical: st, ta, at, ti, is, st, ti, ic, ca, al
      its distinguished digrams:

      al, at, ca, ic, is, st, ta, ti

   A=7, B=8, C=6
   Similarity =2x6/(7+8)=12/15=4/5=80%
   One may build a similarity matrix of all words in a
    corpus, calculated as above, and complemented by
    cutoff value method (set to zero if less than a certain
    value, and to 1 else)
                CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Project Question
   Would this method be useful in comparing
    Chinese texts, without doing
    segmentation?
                CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Thesaurus
   Vocabulary control in an information
    retrieval system
   Thesaurus construction
       Manual construction
       Automatic construction
             CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Vocabulary control
   Standard vocabulary for both indexing
    and searching (for the constructors of
    the system and the users of the system)
               CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Objectives of vocabulary control
   To promote the consistent representation of
    subject matter by indexers and
    searchers ,thereby avoiding the dispersion of
    related materials.
   To facilitate the conduct of a comprehensive
    search on some topic by linking together
    terms whose meanings are related
    paradigmatically.
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Thesaurus
   Not like common dictionary
       Words with their explanations
   May contain words in a language
   Or only contains words in a specific domain.
   With a lot of other information especially the
    relationship between words
       Classification of words in the language
       Words relationship like synonyms, antonyms
            CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




On-Line Thesaurus
   http://www.thesaurus.com
   http://www.dictionary.com/
   http://www.cogsci.princeton.edu/~
    wn/
                             CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Dictionary vs. Thesaurus
Check Information use http://www.thesaurus.com
Dictionary                                 Thesaurus
   in·for·ma·tion ( n f r-m
    sh n)
    n.
                                          [Nouns] information, enlightenment,
        Knowledge derived
         from study, experience,          acquaintance ……
         or instruction.
                                          [Verbs] tell; inform, inform of; acquaint,
        Knowledge of specific
         events or situations that        acquaint with; impart, ……
         has been gathered or             [Adjectives] informed; communique;
         received by
         communication;                   reported; published
         intelligence or news.
         See Synonyms at
         knowledge.
        ......
              CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Use of Thesaurus
   To control the term used in
    indexing ,for a specific domain only use
    the terms in the thesaurus as indexing
    terms
   Assist the users to form proper queries
    by the help information contained in the
    thesaurus
             CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Construction of Thesaurus
   Stemming can be used for reduce the
    size of thesaurus
   Can be constructed either manually or
    automatically
             CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




WordNet: manually constructed
   WordNet® is an online lexical reference
    system whose design is inspired by
    current psycholinguistic theories of
    human lexical memory. English nouns,
    verbs, adjectives and adverbs are
    organized into synonym sets, each
    representing one underlying lexical
    concept. Different relations link the
    synonym sets.
        CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Relations in WordNet
             CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Automatic Thesaurus Construction

   A variety of methods can be used in
    construction the thesaurus
   Term similarity can be used for
    constructing the thesaurus
                              CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




           Complete Term Relation Method
            Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8
  Doc1          0         4     0         0           0          2          1          3
  Doc2          3         1     4         3           1          2          0          1
  Doc3          3         0     0         0           3          0          3          0
  Doc4          0         1     0         3           0          0          2          0
  Doc5          2         2     2         3           1          4          0          2
   Term – Document Relationship can be calculated using a variety of methods
         Like tf-idf
   Term similarity can be calculated base on the term – document relationship
          for example:
Sim(Termi , Term j )                 (DocTerm
                               All Document K
                                                              k ,i   )(DocTermk , j )
                        CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




   Complete Term Relation Method
        Term1   Term2   Term3      Term4     Term5      Term6      Term7     Term8
Term1             7        16         15        14         14         9          7
Term2     7                 8         12         3         18         6         17
Term3    16       8                   18         6         16         0          8
Term4    15      12        18                    6         18         6          9
Term5    14       3         6         6                     6         9          3
Term6    14      18        16         18         6                    2         16
Term7     9       6         0         6          9          2                    3
Term8     7      17         8         9          3         16         3


          Set threshold to 10
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




      Complete Term Relation Method
          T1          T3
                                                       Group
                                                       T1,T3,T4,T6
                                      T2
T4                                                     T1,T5
                                                       T2,T4,T6

                                              T5
                                                       T2,T6,T8
     T6                                                T7
                              T8
               T7

								
To top