Docstoc

fsa

Document Sample
fsa Powered By Docstoc
					    Finite State Machines


            LING 362
           Intro to NLP




1
    Overview
        Regular Expressions
        FSAs

        Properties of Regular
         Languages




2
    Regular Expressions
      A regular  expression (RE) is a formula in a
       specialized language, used to characterize strings.
        - A string is a sequence of characters
        - REs allow us to search for patterns
      A finite-state machine is a device for
       recognizing/generating regular expressions
      We’ll use a “Perlish” notation for writing regular
       expressions, based on regular expressions in the
       Perl programming language.
        - The concepts are the important thing
        - NB: Perlish isn’t exactly the same as Perl
      We will write REs between slashes: /…/
3
    Regular expression inventory (1)

       Character Literals and Classes
         - Characters: /abcd/
         - Set: /p[aeiou]p/
         - Range: /ab[a-z]d/
       Operators (disjunction, negation)
         - Disjunction:
              Set elements: /[Aa]ardvark/
              Sequences of characters: /ant(eater|farm)/
         - Negation:
              Single item: /[^a]/ (any character but a)
              Range: [^a-z] (not a lowercase letter)
4
    Regular expression inventory (2)
       Counters
         - ?: Optionality (0 or 1 occurrence): /colou?r/
         - * (Kleene star): Any number of occurrences: /[0-9]*/
         - +: At least one occurrence: /[0-9]+/
         - {n}: n number of occurrences: /[0-9]{4}/
       Wildcard: matches any single character (.)
         - /beg.n/




5
    Regular expression inventory (3)
       Parentheses: used to group items together
         - /ant(farm)?/: all of farm is optional
       Escaped characters: needed to specify characters that have a
       special meaning: *, +, ?, (, ), |, [, ]:
         - Use a backslash: /why\?/
         - Period expressed as: _




6
    Regular expression inventory (4)
       Anchors: anchor expressions to various parts of the string
         - ^ start of line
              do not confuse with [^..] used to express negation;
               anywhere else it’s a start of line
         - $ end of line
         - \b non-word character
              word characters are digits, underscores, or letters, i.e.,
               [0-9A-Za-z\_]




7
    Examples of Regular Expressions
      /fire/ a sequence of f followed immediately by i, then
       immediately by r, then immediately by e
      /fires?/ matches fire or fires
      /fires\?/ matches fires ?
      /[abcd]/ matches a, b, c, or d
      /[0-9]/ matches any character in the range 0 to 9 (inclusive)
      /[^0-9]/ matches any non-digit character, i.e., any character
       except those in the set 0 thru 9
      /[0-9]+/ matches 0, 1, 11, 12, 367, …
      /[0-9]*/ matches 0, 1, 11, 12, 367, … and matches no string
      /fir./ matches fire, fir9, firm, firp, …
      /fir.*/ matches fir, fire, fir987, firppery, …
      /[fFHhs]ire/ matches fire, Fire, Hire, hire, sire
      /f|Fire/ matches f and Fire




8
    Precedence
      /fire|ings?/   the sequence fire or the sequence ing (the latter
       optionally followed by s)
      Why?
      Because sequences have precedence over disjunction
      To override precedence, use parentheses
      /fir(e|ings)/ the sequence fire followed by either the
       sequence e or the sequence ings




9
     Precedence Rules
      1) Parentheses have the highest precedence.
      2) Then come counters, *, +, ?, {}
      3) Then come sequences and anchors
         •   so, /good.*/ matches goodies, etc., and not (just)
             goodgood
         •   /echo{3}/          the sequence ech followed by ooo
         •   /(echo){3}/        the sequence echoechoecho
      4) Then comes disjunction




10
     Aliases
        Use aliases to designate particular recurrent sets of
        characters
        \d                     [0-9]: digit
        \D                     [^\d]: non-digit
        \w                     [a-zA-Z0-9\_]: alphanumeric
        \W                     [^\w]: non-alphanumeric
        \s                     [~\r\t\n\f]: whitespace character
                                \r: space, \t: tab
                                \n: newline, \f: formfeed
        \S                     [^\s]: non-whitespace



11
     Example 1

      /\$[0-9]+(\.[0-9][0-9])?/




12
     Example 2


      Times on a digital watch (hours and minutes)


      /[1-9]|(1[012]):[0-5][0-9]/




13
     Overgeneration

      /\d\d:\d\d/

      recognizes watch times, but also other sequences. In other
        words, the pattern overgenerates, covering expressions which
        aren’t in the target




14
     Undergeneration
      /1[012]:[0-5][0-9]/


      undergenerates, i.e., does not cover all watch times.




15
     Representing sentences

      ‘handling’ agreement:
      /the (student solves|students solve) the problem/

      an optional adjective:
      /the clever?(student solves|students solve) the problem/

      generating an infinite number of sentences
         /the clever?(student solves|students solve) the problem (and
        (the clever?(student solves|students solve) the problem)*/

      NOTE: here the symbols are words, not characters! Be sure to
       define the symbol type




16
     Overview
         Regular Expressions
         FSAs

         Properties of Regular
          Languages




17
     A Simple Finite State Analyzer (or FSA)
        Example: FSA to recognize strings of the form: /[ab]+/
        i.e., L ={a, b, ab, ba, aab, bab, aba, bba, …}

        Transition Table
          initial =0; final = {1}
          0–>a-> 1
          0->b->1
          1->a->1
          1->b->1




18
     How an FSA accepts or rejects a string
        The behavior of an FSA is completely determined by its
        transition table. The assumption is that there is a tape, with
        the input symbols are read off consecutive cells of the tape.
        The machine starts in the start (initial) state, about to read
        the contents of the first cell on the input ‘tape’.
        The FSA uses the transition table to decide where to go at
        each step
        A string is rejected in exactly two cases:
          - 1. a transition on an input symbol takes you nowhere
          - 2. the state you’re in after processing the entire input is
            not an accept (final) state
        Otherwise. the string is accepted.

19
     FSA formally
       Finite state automaton defined by the following parameters:
         - Q: finite set of (N) states: q0, q1, …, qN
         - : finite input alphabet
         - q0: designated start state
         - F: set of final states (subset of Q)
         - (q, i): transition function




20
     More Examples of FSA’s
       Let’s design FSA’s to recognize
         - the set of zero or more a’s
         - the set of all lowercase alphabetic strings ending in a b.
         - the set of all strings in [ab]* with exactly two a’s.
         - simple NPs, PPs, Ss
         - etc.




21
     The set of zero or more a’s
        L ={, a, aa, aaa, aaaa, …}

        Transition Table
           initial =0; final = {0}
           0–>a-> 0




22
     FSA for set of all lowercase alphabetic strings
     ending in b
        /[a-z]*b/
        initial =0; final ={1}
        0->[a, c-z]->0
        0->b->1
        1->b->1
        1->[a, c-z]->0




23
     The set of all strings in [ab]* with exactly 2 a’s
        Do this yourself
        It might help to first rewrite a more precise regular
        expression for this




24
     FSA for simple NPs, PPs, S, …
                                               Another FSA for NPs:
      initial=0; final ={2}
                                               initial=0; final ={2}
      0->D->1
                                               0->N->2
      0->->1
                                               0->D->1
      1->N->2
                                               1->N->2
                                               2->N->2

 • D is an alias for [the, a, an, all,…], N for [dog, cat, robin,…]
 • What if we wanted to add adjectives? Or recognize PPs?
 • What about one for simple sentences?
     • /(Prep D? A* N+)* (D? N) (Prep D? A* N+)* (V_tns|Aux
     V_ing) (Prep D? A* N+)*/
      • Note: FSA1 concat FSA2 recognizes L(FSA1) concat L(FSA2)

25
     Deterministic and Non-Deterministic FSA’s
        An FSA is non-deterministic (NFSA) when, for some state and
         input, there is more than one state it can go to
        Occurs when transition table allows for a transition to two or
         more states from one state on a given input symbol.
         - e.g., 1->a->2, 1->a->4
        Whenever epsilon-transitions occur, these can be taken without
         consuming input.
         - So, whenever epsilon-transitions occur, the machine could
           either take the epsilon-transition, or consume an input
           symbol, introducing non-determinism.
        Any NFSA can be reduced to a DFSA (deterministic) (at the
         expense of possibly more states).


26
     FAQ: Why Are These Machines Finite-State?
        Finite number of states
        Number of states bounded in advance -- determined by its
         transition table
         - Therefore, the machine has a limit to the amount of memory
           it uses.
         Its behavior at each stage is based on the transition table, and
         depends just on the state it’s in, and the input. So, the current
         state reflects the history of the processing so far.
        Certain classes of formal languages (and linguistic phenomena)
         which are not regular require additional memory to keep track
         of previous information (beyond current state and input)
         - e.g., center-embedding constructions (discussed later)


27
     Overview
         Regular Expressions
         FSAs

         Properties of Regular
          Languages




28
     Formal Languages Revisited
       We will view any formal language as a set of expressions
       The language will use a finite vocabulary  (called an alphabet),
        and a set of expression-combining operations
       Regular languages are the simplest class of formal languages

       Note: Kleene closure of a set
         Let L = {a, b}.
         Then L* = the set of a’s and b’s concatenated zero or more
          times
             = {, a, b, ab, aab, aaab, aaaab, ba, baa, ….}.



29
     Properties of Regular Languages
        The class of regular languages over  is defined as follows:
         1.  (the empty set) is a regular language.
         2.  a   U  , {a} is a regular language.
            ( = alphabet of symbols)
         3. If L1 and L2 are regular languages, so are:
             a. L1 U L2, the union (or disjunction) of L1 and L2
             b. L1.L2 = {xy | x L1, yL2}, concatenation of L1 and L2
             c. L1*, the Kleene closure of L1 (set formed by
               concatenating members of L1 zero or more times)
        So, if the language L is a regular language, any expression in L
         must be expressible by the three operations of concatenation,
         disjunction, and Kleene closure.
30
     General Closure Properties of Regular Languages

       Concatenation, Union, Kleene Closure
       Intersection: If L1 and L2 are regular languages, so are L1 
        L2.
       Set Difference: If L1 and L2 are regular languages, so are L1-
        L2.
       Reversal: If L1 is a regular language, so is L1R, the language
        formed by reversing all the strings in L1




31
     What sorts of expressions aren’t regular
        In natural language, examples include center-
        embedding constructions.
          The cat loves Mozart.
          The cat the dog chased loves Mozart.
          The cat the dog the rat bit chased loves Mozart.
          The cat the dog the rat the elephant admired bit chased
           loves Mozart.
          (the noun)n (transitive-verb)n-1 loves Mozart
        These aren’t regular
          - though /A*B*loves Mozart/ is regular



32
     Regular Expressions and FSAs
       Regular expressions are equivalent to FSA’s

       So, any FSA can be constructed by just concatenation, union,
        and Kleene *


       Question: how would you (graphically) combine FSA’s using:
         - Concatenation
         - Union
         - Kleene *




33

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:9
posted:9/5/2012
language:Unknown
pages:33