# Problems

Document Sample

```					             Finding Word Groups …
Finding Word Groups in Spoken Dialogue
with Narrow Context Based Similarities

Presentation for the GSLT course: Statistical Methods 1

Växjö University, 2002-05-02: 16:00

2002-05-02               Växjö: Statistical Methods I     1
Background
   NordTalk and SweDanes:
Jens Allwood, Elisabeth Ahlsén, Peter Juel
Henrichsen, Leif & Magnus
   Comparable Danish and Swedish corpora
   1.3 MToken each, natural spoken interaction
   We are mainly working with Spoken language –
not written

2002-05-02          Växjö: Statistical Methods I   2
Peter Juel Henrichsen’s ideas
   Words with similar context distibutions are
called Siblings
   Some pairs (seed pairs) of Swedish and Danish
words with ”the same” meaning are carefully
selected: Cousins
   Groups of siblings in each corpus together with
seed pairs gives new probable cousins.

2002-05-02             Växjö: Statistical Methods I   3
Siblings as word groups
   Drop the Cousins for now – focus on Siblings
   Traditional parts-of-speech are not necessarily
valid
   What we have is the corpus. Only the corpus
   We will take information from the 1+1 words
context
   Nothing else like morphology or lexica

2002-05-02            Växjö: Statistical Methods I    4
The original Sibling formula

2002-05-02            Växjö: Statistical Methods I   5
Improvements of the Sibling measure

    Symmetry: sib(x1, x2)= sib(x2, x1)
    Similarity should be possible even if the context on
one of the sides is different

2002-05-02                Växjö: Statistical Methods I        6
      Iterative use of the ggsib similarity measure
1.     Calculate ggsib between all word pairs above a
frequency threshold
2.     Pairs with similarity above a rather high score
threshold Sth are collected in a list L
3.     For each pair in L: replace the less frequent of
the words with the other, in the corpus

2002-05-02             Växjö: Statistical Methods I       7
4.     If L is empty: decrement Sth slightly
5.     Run from step 1 again if Sth is above a lowest
score threshold.

      The result may be interpreted as trees

2002-05-02             Växjö: Statistical Methods I     8
An example tree

2002-05-02      Växjö: Statistical Methods I   9
Implementation
   Easy to implement: Peter made a Perl script
   But… One step in the iteration with ~5000
word types took 100 hours
   Our heavily optimized C-program ran on less
than 60 minutes, and 100 iterations on less than
100 hours

2002-05-02           Växjö: Statistical Methods I      10
Most important optimizations
Starting point: we have enough memory but
not enough time
   A compiled low level language instead of an
interpreted high level
   Frequencies for words and word pairs are stored
in letter trees instead of hash tables
   Try to move computation and counting out in
the loop hierarchy

2002-05-02           Växjö: Statistical Methods I   11
Optimizations (letter trees)
   Retrieving information from the letter trees is
done at constant time to the size of the lexicon
(compared to log(n) for hash tables)
   But in linear time to the average length of the
words, but this is constant when the lexicon
grows.
   Another drawback: our example needs 1GB to
run (each node in the tree is an array of all
possible characters), but who cares.

2002-05-02            Växjö: Statistical Methods I     12
Optimizations (more)
   An example of moving computation to an outer
loop is to calculate the set of all context words
once, and use it for comparisons with all other
words
   The set may be stored as an array of pointers to
nodes (between words in word pairs) in the letter
tree

2002-05-02           Växjö: Statistical Methods I   13
Personal pronouns

2002-05-02       Växjö: Statistical Methods I   14
2002-05-02   Växjö: Statistical Methods I   15
Colours

2002-05-02   Växjö: Statistical Methods I   16
Problems
    Sparse data
    Homonyms
    When to stop
    Memory and time complexity

2002-05-02            Växjö: Statistical Methods I   17
Conclusions
   Our method is an interesting way of finding
word groups
   It works for all kinds of words (syncategorematic
as well as categorematic)
   Difficult to handle low frequent words and
homonyms

2002-05-02           Växjö: Statistical Methods I   18
2002-05-02   Växjö: Statistical Methods I   19
2002-05-02   Växjö: Statistical Methods I   20

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 3 posted: 8/19/2012 language: English pages: 20