Lecture 3 Walk-ing the walk, talk-ing the talk -…

Document Sample
Lecture 3 Walk-ing the walk, talk-ing the talk -… Powered By Docstoc
					Lecture 3: Walk-ing the walk, talk-ing the
      talk – My Fair Lady Lecture




        Professor Robert C. Berwick
           berwick@csail.mit.edu
                  The Menu Bar

• Administrivia
   • How to do Lab 1, parts 1 and 2
• My Fair Lady: “words, words, words, I’m so sick of
  words…”
• What can the two-level (finite-state) transducers do?
• What can’t fst’s do? Complexity issues &
  representational issue: decentralize the method?



                    6.863J/9.611J SP09 Lecture 3
     The fundamental insight: Computational
  morphology is mostly finite-state (hence, can be
        Analysis   done fast) Generation

leaf N Pl   leave N Pl    leave V Sg3                           hang V Past




               leaves                                  hanged           hung


     A regular relation consists of ordered pairs of strings
            leaf+N+Pl : leaves             hang+V+Past : hung

   Any finite collection of such pairs is a regular relation.
                                 6.863J/9.611J SP09 Lecture 3
 Describing (surface, underlying) forms as pairs of languages (a
 language being just a set of strings):
String 1         f        a         t                +Adj        +Comp

String 2         f        a         t        t               e    r



                     Relation(String 1, String 2)
       Forget about the terms ‘input’ and ‘output’

                              6.863J/9.611J SP09 Lecture 3
                          Lexical transducer
                                        •       Bidirectional: generation or analysis
vouloir +IndP +SG + P3
                                        •       Compact and fast
                                        •       Comprehensive systems have been built
                                                for over 40 languages:
    Finite-state
                                                 • English, German, Dutch, French,
    transducer
                                                    Italian, Spanish, Portuguese, Finnish,
                                                    Russian, Turkish, Japanese, Korean,
                                                    Basque, Greek, Arabic, Hebrew,
           veut
                                                    Bulgarian, …
                                        •       Even for translation (demo example)
                      citation form                              inflection codes
v      o          u        l      o         i         r        +IndP   +SG    +P3

v      e          u                                                            t
                                            inflected form


                                6.863J/9.611J SP09 Lecture 3
6.863J/9.611J SP09 Lecture 3
Two knowledge sources




     6.863J/9.611J SP09 Lecture 3
6.863J/9.611J SP09 Lecture 3
Must check all the spelling changes as well



               6.863J/9.611J SP09 Lecture 3
                   Outfoxed!

0
    f:f, o:o
                                          0
         x:x           +:0                               s:s
                         0                0:e



                                                   #:#

F    O         X   +         0            S       # lexical
f    o         x   0         e            s       # surface
                   6.863J/9.611J SP09 Lecture 3
What about…spy vs.spies ?




       6.863J/9.611J SP09 Lecture 3
Here there are several spelling change rules…
         This example requires two

   Epenthesis rule                                  y-i rule
   s p y 0 + s                              s p y 0 + s


   s p i e 0 s                              s p i e 0 s



   y:i <=> _ 0:e                 0:e <=> Cons: y: _ +:0 s:s




                     6.863J/9.611J SP09 Lecture 3
No… more than 1 Spelling change rule
Name               Description                       Example

Consonant          1-letter consonant                beg/begging
Doubling           doubled before -ing/ed
(gemination, G)
E deletion         Silent e dropped before - make/making
(elision, EL),     ing, -ed
E insertion        e added after -s, -z, -ch, - fox/foxes
(epenthesis, EP)   sh before -s
Y replacement      -y changes to -ie before - try/tried
(Y)                ed
I spelling (I)     I goes to y before vowel          lie/lying

                      6.863J/9.611J SP09 Lecture 3
      How was this done in linguistic theory?

• Insert e after ‘sh’, ‘x’, etc: “epenthesis”
• Statement of rule is actually quite complex:
   • Rewrite rule: x→ y | α _ β (Chomsky & Halle)
   • 0:e → [Csib (c h) (s h) y:i] +:0 _ s
                             (transducer notation)




                       6.863J/9.611J SP09 Lecture 3
Sequential model: based on the notion of
             ‘derivation’

  Lexical form

     fst 1                    Ordered sequence of rewrite rules
                              (Chomsky & Halle 1968)
Intermediate form
                              can be modeled by a cascade of
     fst 2                    finite-state transducers
                              Johnson 1972 if (what? HW
      ...                     question)

     fst n
                           Issue: ordering
  Surface form


                    6.863J/9.611J SP09 Lecture 3
Sequential application (Lab 1, question 4)


  k a N p a n
                          N -> m / _ p

  k a m p a n
                          p -> m / m _

  k a m m a n




                 6.863J/9.611J SP09 Lecture 3
 From underlying forms bubbling to the
     surface (from Halle, 1960)…


            accede recede assign resign

R1 dupe     a+ked re+ked a+sin re+sin
R2 s-to-z   a+kked re+ked a+ssin re+zin
R3 k-to-s   a+ksed re+sed a+ssin re+zin
R4 Vowel    akseyd reseyd assayn reziyn
  Shift     accede recede assign resign




                   6.863J/9.611J SP09 Lecture 3
 From underlying forms bubbling to the
     surface (from Halle, 1960)…

            accede recede assign resign



R1 dupe        a+ked re+ked a+sin re+sin
R2 s-to-z      a+kked re+ked a+ssin re+zin
R3 k-to-s      a+ksed re+sed a+ssin re+zin
R4 Vowel       akseyd reseyd assayn reziyn
  Shift        accede recede assign resign




                      6.863J/9.611J SP09 Lecture 3
4 ordered rules - write out lex:surf pairs
                     (L:S pairs)

• LR: a+ked re+ked a+sin re+sin
  SR: akseyd reseyd assayn rezayn

Pad out so LR and SR of equal length, also noting +:0
  correspondence

  a+ked       re+ked              a+sin            re+sin
  aksed       re0sed              assin            re0zin




                    6.863J/9.611J SP09 Lecture 3
How hairy can these rules be, after all?




              6.863J/9.611J SP09 Lecture 3
    And the root-affix fst constraints!
         Koskenniemi 1983

          Lexical form



rule 1    rule 2     ...      rule n            +       FSTLEX


          Surface form




                         6.863J/9.611J SP09 Lecture 3
Our implementation is a so-called two-level
                 system
         Lexical (‘upper’) form
                                                  Set of parallel
                                                  of two-level rules
 FST1                                             (constraints)
              fst 2     ...         fst n
  dict                                            compiled into finite-
                                                  state automata
                                                  interpreted as
                                                  transducers; one fst
         Surface (‘lower’) form                   is a ‘letter tree’ for
                                                  the roots+affixes

Each fst must pass all pairs of lexical, surface character
pairs: thus, the fst’s are really acting as constraint filters
(they are failure driven)
This is the intersection of all the fst’s: the ‘big’
stem+ending fst plus the bunch of spelling-change fst’s
                       6.863J/9.611J SP09 Lecture 3
    But is this way of doing things
semantically correct? How do we want to
           combine the fst’s?
      As it happens….Not! Unless… Let’s see why…




                  6.863J/9.611J SP09 Lecture 3
    Morphology is finite-state: regular relations
•   We can use regular relations (or rational relations) to define the
    pairings between surface ↔ lexical pairs
• Recall the formal definition of this as a transducer:
A finite-state transducer (FST) is a sextuple
 (Q, Σ1 , Σ2, δ, I, F) where
1. Q is a finite set of states (non-null);
2. Σ1 a finite set of input symbols;
3. Σ2 a finite set of output symbols;
4. δ ⊆ Q x Σ1ε x Σ2ε x Q, the transition mapping
5. I ⊆ Q, the initial states
6. F ⊆ Q, the final states
    A finite-state transducer T defines the regular relation R(T),
    the set of pairs (x,y) s.t. δ*(q∈I, x, y) ⊆ F
                             6.863J/9.611J SP09 Lecture 3
         An example regular relation – a pairing of sets
           of strings – each of which is finite-state –
                       described by an fst
  {(ab)n, (ba)n}, s.t. n>0, i.e., {(ab,ba), (abab, baba), (ababab,
  bababa), … } (“Interchanges” a’s and b’s)

                                    a:b
                       S                               B
                                    b:a
Note that the relation specifies no ‘directionality’ between X and Y (there is no ‘input’
and no ‘output,’ so neutral between ‘parsing’ & ‘generation’)
What are the properties of regular (rational) relations?
How do we implement these in our ‘word parsing’ application?
How do you implement them in your lab 1a?

                                      6.863J/9.611J SP09 Lecture 3
             Properties of regular (rational)
                relations/transductions
• Key differences: (important for implementation)
1. Not closed under intersection (unlike fsa’s, unless they
   obey the same-length constraint)
2. Are closed under composition
3. Cannot always be determinized (unlike nondeterministic
   fsa’s)

         Definition & example of composition, and then
         why not closed under intersection & why this
         matters


                           6.863J/9.611J SP09 Lecture 3
     Remembrance of languages past…
• Finite-state automaton defined by having a finite
  number of states…. (duh)
• Which define a finite # of equivalence classes or
  bins, ie, an equivalence relation R s.t. for all
  strings x, y over the defined alphabet Σ s.t. either
  xRy or else ¬xRy.
• (This equivalence relation is of finite rank )
• No fsa can properly ‘bin’ anbn, ∀n>1


                      6.863J/9.611J SP09 Lecture 3
Any given fsa must sometimes fail to classify an
 arbitrarily long string of this form properly…
                 aaabbb
                 aabb
                 aaaabb
                 a500b500




       1     2
                                 …           50

                      6.863J/9.611J SP09 Lecture 3
6.863J/9.611J SP09 Lecture 3
   FST’s (regular relations) not closed
       under intersection (gulp!)
• Regular relation 1: a pair of finite-state languages:
      {an, bnc*} Claim: this is a regular relation
                    0:c
                   a:b            0:c


• Regular relation 2: a similar pair of fsl’s:
       {an, b*cn} Claim: this too is a regular relation
• What is the intersection of (1) and (2)?
• Ans: {an, bncn}. But this is not a regular relation, by
  definition, because bncn is not a finite-state (regular)
  language (Why not? Salivate appropriately as bell rings…
                      6.863J/9.611J SP09 Lecture 3
    FSTs not closed under intersection…!
When we intersect two regular relations, do we always
get back a regular relation?
Recall: both parts of the relation (left and right) must be
finite-state languages
Regular relation: R1= ({an| n > 0}, {bnc*| n > 0})
Regular relation: R2= ({an| n > 0}, {b*cn| n > 0})

Intersection R1 ∩ R2= ({an| n > 0}, {bncn| n > 0})
Is this a rational relation?

                      6.863J/9.611J SP09 Lecture 3
Definition of rational relation (fst) composition

  • FST1= (Q1, Σ1, Σ2, δ1, I1, F1)
  • FST2 =(Q2, Σ2, Σ3, δ2, I2, F2)
  • Define composition, FST2 ° FST1 =
     FSTC = (Q1 × Q2, Σ1, Σ3, δ, I1 × I2, F1 × F2)
  Define δ: for all a∈Σ1, b∈Σ2, c∈Σ, q1,r1∈Q1; q2,r2∈Q2
  ([q1,q2],a,c,[r1,r2]) ∈ δ iff (q1, a,b,r1) ∈ δ1 and
                                (q2, b,c,r2) ∈ δ2


         Who cares?

                      6.863J/9.611J SP09 Lecture 3
           Why we care: Composition of Fst’s
             relates to cascading of rules
 • A o B: The relation C such that if A maps x to y
   and B maps y to z, C maps x to z:
A:    a:a           b:b           c:c


              b:B
B:                   c:C
     a:A
                                 C:
                                       a:A                b:B   c:C

                     d:D




                           6.863J/9.611J SP09 Lecture 3
There are several spelling change rules, typically
           This example requires two
 We cascade them - by composition, or, in our
             system, by intersection
    s p y 0 + s                            s p y 0 + s


    s p i e 0 s                            s p i e 0 s



    y:i <=> _ 0:e               0:e <=> Cons: y: _ +:0 s:s




                    6.863J/9.611J SP09 Lecture 3
 Extract contexts to find declarative
             constraints

a+ked     re+ked             a+sin            re+sin
aksed     re0sed             assin            re0zin

Rule:   +:k    a:a __ k:s
        +:s    a:a __ s:s

Rule:   k:s    e +:0 __ V | +:k __ V
        +:s    a:a __ s:s


               6.863J/9.611J SP09 Lecture 3
      Why don’t we need to look at the
         derivational steps now??

• We can look at the lexical:surface substrings
  simultaneously
• So, if a rule has applied (conversely, not applied),
  its effects should be visible via the joint l:s pairs
  surrounding it in some lexical:surface pairing
  example (or else not visible if the rule did not
  apply)
• Otherwise, the rule must have been superfluous (it
  has no visible effects on the relation between any
  lexical:surface pairs)

                      6.863J/9.611J SP09 Lecture 3
        But there is no free lunch…

• To avoid the ‘context-free explosion’ there is one
  constraint the re-write rules must obey such that
  we can use fst’s (regular relations) to describe
  them; but, alas, we cannot escape nondeterminism
• There is one additional constraint such that the
  method of intersecting spelling-change rules will
  work correctly




                   6.863J/9.611J SP09 Lecture 3
Constraint 1: avoiding the road to context-free
                 complexity
• Discovery by C.D. Johnson, 1972.
• Consider 2 rules:
  Rule 0: S→ ab Rule 1: ε→ ab/___b
• Start with string ab and apply Rule 0:
  S is replaced by ab. But now this can be read as:
  aεb
• So Rule 1 applies, and aεb is rewritten as:
  aabb
So, what can the output be, ultimately? Is this a finite-state
  language? Where does the rule apply in the string (after
  this first step)?
But there is another possible derivation where the rule applies
  at a different place in the string…
        a εb → aabε → aabab →… → a(ab)n
                          6.863J/9.611J SP09 Lecture 3
              What’s the difference?
• In derivation 1, Rule 1 applies to its own output (the ‘b’
  portion of the output is replaced)
• In derviation 2, Rule 1 applies to the left of its own output
  – it never applies to its own output (i.e., recursively)
  That’s why the language remains finite-state
• Johnson showed that all the ‘normal’ uses of rewrite rules
  in phonology were derivations of the first sort; therefore,
  could be represented as finite-state languages, and so, in
  surface:lexical form as regular relations
• Still, there is no escape from the following…



                       6.863J/9.611J SP09 Lecture 3
Nondeterministic to deterterministic conversion
 is not always possible with a finite transducer




       Try ‘subset construction’ trick on this!
       What possible union of states could the
       machine be in after seeing an x?

              Q: Why should we care?
              A: how do we implement?
                     6.863J/9.611J SP09 Lecture 3
Constraint 2: so that the ‘run automata in
  parallel’ (via intersection) works…




               6.863J/9.611J SP09 Lecture 3
 But, if multiple spelling changes, can we
             always intersect?
• Ah, there’s one more constraint…
• Are FSTs closed under intersection?


  F     O    X     +         0          S             #   lexical
  f     o    x     0         e          s             #   surface

  S     P    Y     0         +          S             #   lexical
  s     p    i     e         0          s             #

      The same-length constraint
      What is its practical import?
                       6.863J/9.611J SP09 Lecture 3
          The ‘same-length’ constraint
• Imposed to ensure that closure under intersection
  works
• Practically, it means we must have a way to
  ensure that the (lexical, surface) pairs are of the
  same length - pad them out
• So we use a special character 0, whose semantics
  is not the same as ‘epsilon,’ the empty string.
• 0 has length exactly 1. Never more.



                      6.863J/9.611J SP09 Lecture 3
6.863J/9.611J SP09 Lecture 3
      How was this done in linguistic theory?

• Insert e after ‘sh’, ‘x’, etc: “epenthesis”
• Statement of rule is actually quite complex:
   • Rewrite rule: x→ y | α _ β (Chomsky & Halle)
   • 0:e → [Csib (c h) (s h) y:i] +:0 _ s
                             (transducer notation)




                       6.863J/9.611J SP09 Lecture 3
              Examples for ‘e’ insertion…
Fox - foxes; church - churches; bus-buses
What elses?
Must look at non-examples as well… as - ases?
What is the rule?
    rewrites to letter e in left context x+ or s+ or z+ and right context s#
    i.e. insert e after the + when you see x+s# or s+s# or z+s#
    in particular, we have x+s# → x+es#

In traditional ‘rewrite form’:
   ε → e / {x,s,z}+__ s#
Now redo this in terms of (lexical, surface) pairs, which tells us how to build
    the transducer:
Lexical nothing (epsilon= 0) is paired with e, or 0:e in context of:
  x:x, +:0 ____ s:s, #:#



                               6.863J/9.611J SP09 Lecture 3
      How was this done in linguistic theory?

• Insert e after ‘sh’, ‘x’, etc: “epenthesis”
• Statement of rule is actually quite complex:
   • Rewrite rule: x→ y | α _ β (Chomsky & Halle)
   • 0:e → [Csib (c h) (s h) y:i] +:0 _ s
                             (transducer notation)




                       6.863J/9.611J SP09 Lecture 3
          Turning this into an fst

• Write down the left, center, and right context
• In this case:
      x:x +:0 0:e          s:s #:#
  Csib:Csib
• Pad out with nulls (0’s) [Why? We shall see…]
• Write an FST that accepts exactly this string, and
  rejects everything else (we want the FSTs to work
  basically as filters)


                 6.863J/9.611J SP09 Lecture 3
       Start with straightline fst



Csib:Csib +:0           0:e                    s:s       #:#
1      2           3                 4               5         6=1




                6.863J/9.611J SP09 Lecture 3
     Now add rejection notices…
                         reject                 reject    reject

                 @:@                   @:@           @:@
Csib:Csib +:0           0:e                    s:s       #:#
1      2           3                 4               5         6




                6.863J/9.611J SP09 Lecture 3
          And acceptance (cook until done)
                                   reject               reject    reject

                           @                   @             @
      Csib:Csib +:0             0:e                    s:s       #:#
@:@
      1       2            3                 4               5         6
                  @:@




                        6.863J/9.611J SP09 Lecture 3
        How we do this in the lab

In file (eg, english.yaml):
1. Specify the boundary marker (usually #)
2. Specify where to find the lexicon automaton
    (e.g., english2.lex)
3. Specify the lexical, surface alphabet as a set of
    ‘defaults’
4. Specify any special subsets for abbreviatory
    purposes
5. Specify the spelling change automata

                  6.863J/9.611J SP09 Lecture 3
                       The file itself
boundary: '#'
lexicon: english2.lex
defaults: "a:a b:b c:c d:d e:e f:f g:g h:h i:i j:j k:k
  l:l m:m n:n o:o p:p q:q r:r s:s t:t u:u v:v w:w x:x y:y
  z:z +:0 `:0 #:# ':' -:- -:0"
subsets:
  "@": "a b c d e f g h i j k l m n o p q r s t u v w x y
  z ' ` + # 0"
  "C": b c d f g h j k l m n p q r s t v w x y z
  "Csib": s x z
  "V": a e i o u
  "Vbk": a o u
(automata follow)



                        6.863J/9.611J SP09 Lecture 3
                  Tabular format for fst
         Rules:
          Epenthesis
                              Pairs
   lexical        c   h   s   Csib      +   #   0    @
  surface         c   h   s   Csib      0   #   e    @
             1:   2   1   4   3         1   1   0    1
             2:   2   3   3   3         1   1   0    1
             3:   2   1   3   3         5   1   0    1
States
             4:   2   3   3   3         5   1   0    1
             5:   2   1   2   2         1   1   6    1
             6.   0   0   7   0         0   0   0    0
             7.   0   0   0   0         0   1   0    0

0 = failure state; colon after state number: accepting state

                                6.863J/9.611J SP09 Lecture 3
    Sequential application as transducers

                                              N:m
                                                              2
  k a N p a n                       m:m                 p:p
0 0 0 2 0 0 0                        @:@ 0            m:m N:m
                                     p:p            @:@
  k a m p a n                                     N:N   1     N:N

 0 0 0 1 0 0 0
                                   p:p                  m:m
  k a m m a n                         @:@ 0             @:@       1   m:m
                                                        p:m
So, (1) does the order matter?
(2) what does the ‘composed’ machine look like? (Your
homework, Q4)
(3) Can we always do this? How do we implement this?
                         6.863J/9.611J SP09 Lecture 3
Spanish: Lab 1 last part – Your questions

 1. What is your name?

 2. What is your quest?

 3. What is your favorite color?




                  6.863J/9.611J SP09 Lecture 3
          Laboratory 1: Spanish

• What phenomena you’re covering
• How to build spelling-change fst’s - details
• How to build morpheme fst - details




                  6.863J/9.611J SP09 Lecture 3
                         The phenomena
•   You are given the orthography, including some special characters to
    stand for the accented ones á,é,ó,ü,ñ ; and some underlying characters
    you may find essential, such as J, C, Z.
•   Wise to proceed by first building the automata (yaml) file; then the
    lexicon(s) - because you can test the rules without any lexicon by
    generation of a surface form
•   The automata can be built (roughly) by considering each phenomenon
    separately
•   4 kinds of phenomena & 2 morpheme patterns




                              6.863J/9.611J SP09 Lecture 3
                    Some format details
# this is a comment at the top of my spanish.yaml file
boundary: '#'
lexicon: spanish.lex
defaults: "a e i o u a' e' i' o' u' b c d f g h j k l m n n~ p q r s t
   v w x y z +:0 #"
subsets:
 "Cons": "b c d f g h j k l m n n~ p q r s t v w x y z"
 "V": " a a' e e' i i' o o' u u'"
 "FRONT": "e i e' i'"
 "BACK": "u o a u' o' a'"
 "LOW": "e o a a' e' o'"
 "HIGH": "i i' u u'"
 "@": "a e i o u a' e' i' o' u' b c d f g h j k l m n n~ p q r s t v w
   x y z + ` # 0"
rules:




                            6.863J/9.611J SP09 Lecture 3
                    The phenomena
Spelling changes:
(freebie: u-insertion)
1. g-j mutation
2. z-c-z mutation
3. z-c mutation
4. Pluralization

Morpheme automaton:
Noun endings
Verb conjugation - 1 form


                         6.863J/9.611J SP09 Lecture 3
                       u-insertion
• Let’s see how to turn this into a spelling-change automaton
• The data: must insert u after g if followed by front vowel
  (e, i, é, í)
Accept:
         pague… (1st person subjunctive)
         pa0ue…
More generally, Accept:
         XguF
         Xg0F
But Reject:
u… ,     Xgua, Xg F
0… ,     Xg0a, Xg ¬F



                       6.863J/9.611J SP09 Lecture 3
                 In words…

• Loop until we find a g:g

• Pair with 0:u and see if a Front Vowel (Front)
  vowel follows; if so, accept; otherwise, reject




                 6.863J/9.611J SP09 Lecture 3
                     In a picture:
                                                   0:u
                              @:FRONT
  @:@           @:@
             Found         Inserted
             g:g           0:u
                                                   Else Reject
              @:0          g:g                     @:@
Start Loop
until                                     Now insert ‘rejects’:
                      Followed by
                      @:FRONT             • If @:FRONT after g:g
                      Accept              • If 0:u at Start
                                           Now insert ‘idling’:
                 Otherwise,
                 Start over                         • g:g Stay put
                        6.863J/9.611J SP09 Lecture 3
                                                     • @:0 Stay put
   pagar:   yo pago (1st person present;
            yo pague (1st person subjunctive)




u-insertion:
 start:
  'g': found_g
  '0:u': reject
  '@': start
 found_g:
  '0:u': inserted
  '@:0': found_g
  'g': found_g
  '@:FRONT': reject
  '@': start
 inserted:
  '@:FRONT': start
  '@': reject

                             6.863J/9.611J SP09 Lecture 3
            Phenomenon 2: z-c mutation

• z-c mutation
  z → c before front vowels, z otherwise
  cruzar (to cross); cruzo, cruzas, cruza, cruzamos,
  cruzan, cruce
• If s causes a front vowel (e.g., e) to surface, then
  the rule still applies:
  lápiz, lápices (pencil, pencils) [ la’piz, la’pices]



                        6.863J/9.611J SP09 Lecture 3
   Example: look at phenomenon, then see
           first how to describe

• What is the left and right context of the change?
• Write it as a declarative constraint
• Remember that you can use both the surface and
  the lexical characters to admit or to rule out a
  possibility
• Thinking in terms of constraints (what is ruled out
  by the rule) is the most difficult ‘mindset’ to
  attain…


                      6.863J/9.611J SP09 Lecture 3
Build automaton for lexical, surface pairs

 • But what are the lexical pairs?
 • Ah, your job!
 • Trying pairings – not generally the infinitive, e.g.
    cruzar, cruzamos → legit pair?
    cruzar
    cruzamos
 Look at the other pairs – what do you think the root
   is?


                   6.863J/9.611J SP09 Lecture 3
                          Writing rules
•  cruzar/cruzamos cruzar/cruce ?
•  We can try a (tentative) lexical/surface pair, and from that extract the
   right spelling change
• Do it step by step: use the alignment to write down the ‘straight-line’
   acceptance path:
  cruz
  cruce
Pad out length by using 0’s (nulls) (why is this important)– Remember the
   equal length constraint?
  cruz0            cruz0
  cruce            cruzo
Outline context – hmm, perhaps we do need root?




                             6.863J/9.611J SP09 Lecture 3
                           Writing rules

      From context to rule:
        cruz0,cruce c:c, r:r, u:u, z:c,
        0:e - accept
               z:c         0:e
     c:c
     r:r,
     u:u,…                           cruz+                 cruz+
                                     cruce                 cruzo
But… is this the correct
root?
                            6.863J/9.611J SP09 Lecture 3
         Design of morpheme automaton

• One big fsa, that handles two phenomena:
  • plurals and
  • verb endings




                      6.863J/9.611J SP09 Lecture 3
     Automaton design for lexicon
                     initial
Root: noun                      Root: verb




Q: what do we need to add to noun sequence?



                6.863J/9.611J SP09 Lecture 3
     The morpheme tree: Adding plurals -
                ciudades      Output:

                            Begin                       [

    Noun_root                          verb
                                                        Noun(city)
   Suffix
plural      singular
                                                      +Number: Plural
             End
  End                                                   ]
             Final output: [Noun(city)+Number: Plural]

                       6.863J/9.611J SP09 Lecture 3
                The lexicon – take 2

You will deal with two types of ‘endings’
Noun endings: plural suffix +s
Verb endings: verb stem + tense markers
  Simplest: infinitive marker +ar, +er, +ir
  See table in lab file for details: 5 x 3 table for Present
      tense; ditto for Subjunctive tense (“I might….”)




                       6.863J/9.611J SP09 Lecture 3
                   Lexicon specification details
•   Lowercase: alternation states - epsilon transitions
•   Uppercase: lexical states – actual spell-out of prefixes, roots, suffixes Prefix
Begin: Prefix Root
                                         Begin    ε
Prefix: ADJ_PREFIX V_PREFIX                            ε                          Root
Root: N_ROOT ADJ_ROOT V_ROOT_PREF V_ROOT_NO_PREFN_ROOT:

V_PREFIX:
re+   V_ROOT_PREF       REP+
un+   V_ROOT_PREF       REV+
…                                                 `cat, Noun(cat)
N_ROOT:
`cat    AfterNoun       Noun(cat)            N_ROOT                  AfterNoun
`dog    AfterNoun       Noun(dog)
…
End:     #


                                      6.863J/9.611J SP09 Lecture 3
         Is 2-level morphology sufficient?
• So, this lets us think what the system might not be good
  for… let’s look at English first….
• Is morphology really purely linear???
• There seem to be some kinds of ‘long distance’
  constraints…
• Prefix/suffix links: only some prefixes tied to some
  suffixes
   • Un___________able
   • Undoable, uncanny, ?uncannyable, unthinkable,
     *unthink, thinkable, readable, unreadable, unkind,
     *unkindable
• So, we have to ‘keep track’ that the un is first or not –
  what does lexicon look like?

                           6.863J/9.611J SP09 Lecture 3
  The limits of finite-state machines



Fsa: (a · b) · c ≡ a · (b · c)
FSA must be
an associative system algebraically

 ?   dark blue sky
              6.863J/9.611J SP09 Lecture 3
6.863J/9.611J SP09 Lecture 3