Docstoc

Problems

Document Sample
Problems Powered By Docstoc
					             Finding Word Groups …
       Finding Word Groups in Spoken Dialogue
        with Narrow Context Based Similarities

              Leif Grönqvist & Magnus Gunnarsson

Presentation for the GSLT course: Statistical Methods 1

Växjö University, 2002-05-02: 16:00


2002-05-02               Växjö: Statistical Methods I     1
                 Background
   NordTalk and SweDanes:
    Jens Allwood, Elisabeth Ahlsén, Peter Juel
    Henrichsen, Leif & Magnus
   Comparable Danish and Swedish corpora
   1.3 MToken each, natural spoken interaction
   We are mainly working with Spoken language –
    not written


2002-05-02          Växjö: Statistical Methods I   2
             Peter Juel Henrichsen’s ideas
   Words with similar context distibutions are
    called Siblings
   Some pairs (seed pairs) of Swedish and Danish
    words with ”the same” meaning are carefully
    selected: Cousins
   Groups of siblings in each corpus together with
    seed pairs gives new probable cousins.


2002-05-02             Växjö: Statistical Methods I   3
             Siblings as word groups
   Drop the Cousins for now – focus on Siblings
   Traditional parts-of-speech are not necessarily
    valid
   What we have is the corpus. Only the corpus
   We will take information from the 1+1 words
    context
   Nothing else like morphology or lexica


2002-05-02            Växjö: Statistical Methods I    4
             The original Sibling formula




2002-05-02            Växjö: Statistical Methods I   5
         Improvements of the Sibling measure

      Symmetry: sib(x1, x2)= sib(x2, x1)
      Similarity should be possible even if the context on
       one of the sides is different




2002-05-02                Växjö: Statistical Methods I        6
             Trees instead of groups
      Iterative use of the ggsib similarity measure
1.     Calculate ggsib between all word pairs above a
       frequency threshold
2.     Pairs with similarity above a rather high score
       threshold Sth are collected in a list L
3.     For each pair in L: replace the less frequent of
       the words with the other, in the corpus


2002-05-02             Växjö: Statistical Methods I       7
        Trees instead of groups (forts)
4.     If L is empty: decrement Sth slightly
5.     Run from step 1 again if Sth is above a lowest
       score threshold.

      The result may be interpreted as trees




2002-05-02             Växjö: Statistical Methods I     8
             An example tree




2002-05-02      Växjö: Statistical Methods I   9
                Implementation
   Easy to implement: Peter made a Perl script
   But… One step in the iteration with ~5000
    word types took 100 hours
   Our heavily optimized C-program ran on less
    than 60 minutes, and 100 iterations on less than
    100 hours




2002-05-02           Växjö: Statistical Methods I      10
         Most important optimizations
Starting point: we have enough memory but
  not enough time
   A compiled low level language instead of an
    interpreted high level
   Frequencies for words and word pairs are stored
    in letter trees instead of hash tables
   Try to move computation and counting out in
    the loop hierarchy


2002-05-02           Växjö: Statistical Methods I   11
             Optimizations (letter trees)
   Retrieving information from the letter trees is
    done at constant time to the size of the lexicon
    (compared to log(n) for hash tables)
   But in linear time to the average length of the
    words, but this is constant when the lexicon
    grows.
   Another drawback: our example needs 1GB to
    run (each node in the tree is an array of all
    possible characters), but who cares.

2002-05-02            Växjö: Statistical Methods I     12
             Optimizations (more)
   An example of moving computation to an outer
    loop is to calculate the set of all context words
    once, and use it for comparisons with all other
    words
   The set may be stored as an array of pointers to
    nodes (between words in word pairs) in the letter
    tree



2002-05-02           Växjö: Statistical Methods I   13
             Personal pronouns




2002-05-02       Växjö: Statistical Methods I   14
2002-05-02   Växjö: Statistical Methods I   15
             Colours




2002-05-02   Växjö: Statistical Methods I   16
                       Problems
       Sparse data
       Homonyms
       When to stop
       Memory and time complexity




2002-05-02            Växjö: Statistical Methods I   17
                  Conclusions
   Our method is an interesting way of finding
    word groups
   It works for all kinds of words (syncategorematic
    as well as categorematic)
   Difficult to handle low frequent words and
    homonyms




2002-05-02           Växjö: Statistical Methods I   18
2002-05-02   Växjö: Statistical Methods I   19
2002-05-02   Växjö: Statistical Methods I   20

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:8/19/2012
language:English
pages:20