Linguistica by dffhrtcv3


 This presentation borrows heavily from
  slides written by John Goldsmith who has
  graciously given me permission to use
  them. Thanks, John.
 He also says I should enjoy my trip, and
  one way to do that is to not have to write
  as many slides while I’m here!
   A C++ program that runs under Windows,
    Mac OS X, and Linux that is available at:

There are explanations, papers, and other
 downloadable tools available there.
References (for the 1st part)
Goldsmith (2001) “Unsupervised Learning of
 the Morphology of a Natural Language”
 Computational Linguistics
 Look at Linguistica in action:
     English, French
 Theoretical foundations
 Underlying heuristics
 Further work
 A program that takes in a text in an
  “unknown” language…
 …and produces a morphological analysis:
    a  list of stems, prefixes, suffixes;
     more deeply embedded morphological
     regular allomorphy
              Here: lists of stems, affixes,
              signatures, etc.

Actions and             Here: some messages
outlines of             from the analyst to the
information             user.
Read a corpus
 Brown corpus: 1,200,000 words of typical
 French Encarta
 or anything else you like, in a text file.
 Set the number of words you want read,
  then select the file.
                                     List of stems

A stem’s signature is the list of suffixes it appears with in the corpus,
in alphabetical order.
         abilit          ies.y           abilities, ability
         aboli           tion            abolition
         absen           ce.t            absence, absent
         absolute         absolute, absolutely
List of signatures
for example,
account accounted accounting      accounts
add        added         adding   adds
      We’ll see how we can find a more sophisticated signature…

Signature   <e>ion . NULL
composite       concentrate      corporate         détente
discriminate    evacuate         inflate           opposite
participate     probate          prosecute         tense

What is this?

composite       and      composition

composite  composit  composit + ion

It infers that ion deletes a stem-final ‘e’ before attaching.
Top signatures in English
Over-arching theory
   The selection of a grammar, given the data, is
    an optimization problem.
   Optimization means finding a maximum or
    minimum of some objective function
   Minimum Description Length provides us with a
    means for understanding grammar selection as
    minimizing a function.
   (We’ll get to MDL in a moment)
What’s being minimized by
writing a good morphology?
   The number of letters is part of it

   Compare:
Naive Minimum Description Length
Corpus:                   Analysis:
jump, jumps, jumping      Stems: jump laugh sing
                             sang dog (20 letters)
laugh, laughed,           Suffixes: s ing ed (6
  laughing                   letters)
sing, sang, singing       Unanalyzed: the (3
the, dog, dogs               letters)
                          total: 29 letters.
total: 61 letters

               Notice that the description length
              goes UP if we analyze sing into s+ing
Minimum Description Length (MDL)
    Rissanen (1989) (not a CL paper)
    The best “theory” of a set of data is the
     one which is simultaneously:
    1. most compact or concise, and
    2. provides the best modeling of the data
    “Most compact” can be measured in
     bits, using information theory
    “Best modeling” can also be measured
     in bits…
              Essence of MDL




                                                          Length of morphology
300000                                                    Log prob of corpus



         Best analysis    Elegant theory Complex theory
                         that works badly modeled from
Description Length =
   Conciseness: Length of the morphology. It’s
    almost as if you count up the number of symbols
    in the morphology (in the stems, the affixes, and
    the rules).
   Length of the modeling of the data. We want a
    measure which gets bigger as the morphology is
    a worse description of the data.
   Add these two lengths together = Description
Conciseness of the morphology
Sum all the letters, plus all the structure
  inherent in the description, using information
Remember Entropy?
H(X)    p(x)log p(x)  2
             x X

Entropy was the weighted (by p(x)) sum of the
  information content or optimal compressed
  length (–log2 p(x)) of x. It’s called that because it
  is always possible to develop a compression
  scheme by which a symbol x, emitted with
  probability p(x), is represented by a placeholder
  of length -log2 p(x) bits.
Optimal Compressed Length
The reason this is mentioned is that we will have
lots of pieces of information in our model, and we’d
like to figure out how much “space” it takes up.

Remember, we want the smallest model possible,
so we are going to want the best compression for
anything in our model
Also, remember this:     log p(x)  log
         Conciseness of stem list and suffix
         list          Number of letters in suffix

                                                        [W A ] 
               (ii) Suffixlist            * | f | log [ f ] 
                                      f Suffixes              
                                                           [W ] 
                (iii) Stem list :  * | t |         log(      )
                                  t Stems
                                                              [t] 
      = number of bits/letter < 5

                                                        cost of setting up
                                                      this entity: length
                          Number of letters in stem     of pointer in bits
Signature list length
              [W ]
     log [ ]                  list of pointers to signatures
 Signatures

     log              
                  stems(  log suffixes    
  Signatures

                      [W ]                     [ ]
  (  log                        log [ f in  ])
  Sigs t Stems( )  [t]     f Suffixes( )

                               <X> indicates the number
                                of distinct elements in X
Length of the modeling of the data

 Probabilistic morphology: the measure:

    -1 * log probability ( data )

 where the morphology assigns a probability to any
   data set.
 This is known in information theory as the optimal
   compressed length of the data (given the
Probability of a data set?
A grammar can be used not (just) to specify
   what is grammatical and what is not, but to
   assign a probability to each string (or
If we have two grammars that assign
   different probabilities, then the one that
   assigns a higher probability to the
   observed data is the better one.
This follows from the basic principle of
 rationality in the Universe:

Maximize the
probability of the
observed data.
From all this, it follows:
There is an objective answer to the
 question: which of two analyses of a given
 set of data is better?
However, there is no general, practical
 guarantee of being able to find the best
 analysis of a given set of data.
Hence, we need to think of (this sort of)
 linguistics as being divided into two parts:
 An evaluator (which computes the
  Description Length); and
 A set of heuristics, which create
  grammars from data, and which propose
  modifications of grammars, in the hopes
  of improving the grammar.
(Remember, these “things” are
  mathematical things: algorithms.)
Let’s step back for a minute
 Why is this problem so hard at first?
 Because figuring out the best analysis of
  any given word generally requires having
  figured out the rough outlines of the whole
  overall morphology. (Same is true for other
  parts of the grammar!).
How do we start?
 You all know the answer to this question
 We start with Zellig Harris’ successor
 Although we got some good answers, we
  also saw that it made lots of mistakes
 So…
As a boot-strapping method
to construct a first approximation
of the signatures:
 Harris’ method is pretty good.
 We accept only stems of 5 letters or more;
 Only cuts where the SuccFreq is > 1, and
  where the neighboring SuccFreq is 1.
 (This setup was experiment 16 from the
  lab on Monday)
Let’s look at how the work
is done (in the abstract),
step by step...

         Pick a large corpus from a language --
         5,000 to 1,000,000 words.

                      Feed it into the
Bootstrap heuristic   “bootstrapping” heuristic...

Bootstrap heuristic
                      Out of which comes a
                      preliminary morphology,
    Morphology        which need not be superb.

Bootstrap heuristic

                      Feed it to the incremental
    Morphology        heuristics (…which we
                      haven’t seen yet)


                         Out comes a modified
Bootstrap heuristic      morphology.

    Morphology                   modified

                        Is the modification
                        an improvement?
                        Ask MDL!
Bootstrap heuristic

    Morphology                   modified

                       If it is an improvement,
                       replace the morphology...

 Bootstrap heuristic



                        Send it back to the
                        heuristics again...
Bootstrap heuristic


       Continue until there
       are no improvements
       to try.

Morphology                     modified

The details of learning morphology

   There is nothing sacred about the
    particular choice of heuristic steps
 Successor Frequency: strict
 Extend signatures to cases where a word
  is composed of a known stem and a
  known suffix.
 Loose fit: Look at all unanalyzed words.
  Look to see if they can cut: stem + suffix,
  where the suffix already exists. Do this in
  all possible ways. See if any of these
  lead to stems with signatures that already
  exist. If so, take the “best” one. If not,
  compute the utility of the signature using
   Check existing signatures: Using MDL
    to find best stem/suffix cut. Examples…
Check signatures (English)
 on/ve → ion/ive
 an/en → man/men
 l/tion → al/ation
 m/t → alism/alist, etc.

Check signatures
 Signature l/tion with stems:
federa     inaugura orienta substantia
We need to compute the Description Length
  of the analysis
as it stands          versus
as it would be if we shifted varying parts of
  the stems to the suffixes.
“Check signatures” French:
   NULL nt r >> a ant ar         me te >> ume ute
   NULL nt >> i int              eurs ion >> teurs tion
   ent t >> oient oit            f ve >> dif dive
   NULL r >> i ir                it nt >> ait ant
   f on ve >> sif sion sive      que sme >> ïque ïsme
   eur ion >> seur sion          NULL s ur >> e es eur
   ce t >> ruce rut              ient nt >> aient ant
   se x >> ouse oux              f on >> sif sion
   l ux >> al aux                nt r >> ent er
100,000 tokens, 12,208 types
Zellig redux 1,403   140          68 suffixes
             stems   signatures
Extend               226
signatures           signatures
Loose fit    2,395   702          68 suffixes
Check        2,409   730          110
Smooth       2,400   735          115
 Find relations among stems: find principles
  of allomorphy, like
“delete stem-final e before –ing” on the
  grounds that this simplifies the collection
  of Signatures:
Compare the signatures
    , and
     and
 its stems do not end in –e
 -ing (almost) never appears after stem-
  final e. (ex. singeing)
 So and can both be
  subsumed under:
 <e>ing.NULL, where <e>ing means a
  suffix ing which deletes a preceding e.
Find layers of affixation
   Find roots (from among the Stem collection)

   In other words, recursively look through our list
    of Stems and see if we could (or should) be
    analyzing them again:

   readings = reading+s = read+ing+s
   Etc.
What’s the future work?
1.   Identifying suffixes through syntactic
     behavior ( syntax)
2.   Better allomorphy ( phonology)
3.   Languages with more morphemes/ word
     (“rich” morphology)
   “Using eigenvectors of the bigram graph to
    infer grammatical features and categories”
    (Belkin & Goldsmith 2002)
   Build a graph in which “similar” words are
   Compute the normalized laplacian (linear
    algebra -- it just sound fancy!) of that graph;
   Compute the eigenvectors with the lowest non-
    zero eigenvalues; (more linear algebra)
   Plot them.
Map 1,000 English words by left-
hand neighbors
                            ?: and, to, in that, for, he, as, with,
                            on, by, at, or, from…

  finite verbs: was, had,                             world, way, same, united,
  has, would, said,                                   right, system, city, case,
  could, did, might,                                  church, problem, company,
  went, thought,                                      past, field, cost, department,
  told, knew, took,                                   university, rate, door,

                            non-finite verbs: be, do, go, make,
                            see, get, take, go, say, put,
                            find, give, provide, keep, run…
Map 1,000 English words by right-
hand neighbors

                              Prepositions: of in for on by at from
                              into after through under since
                              during against among within along
                              across including near

         social national white local political
         personal private strong medical final
         black French technical nuclear british

To top