; lecture03
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

lecture03

VIEWS: 0 PAGES: 14

  • pg 1
									                        CS1622

                    Lecture 3
                 Lexical Analysis



                          CS 1622 Lecture 3                      1




Equivalence of DFA and NFA
   Theorem:
       For every non-deterministic finite-state machine M, there
        exists a deterministic machine M' such that M and M' accept
        the same language.


   Why is the theorem important for scanner
    generation?

   Theorem is not enough: what do we need for
    automatic scanner generation?



                         CS 1622 Lecture 3                      2




How to Implement a FSM
A table-driven approach:
 table:

       one row for each state in the machine, and
       one column for each possible character.
   Table[j][k]
       which state to go to from state j on character k,
       an empty entry corresponds to the machine
        getting stuck.


                         CS 1622 Lecture 3                      3




                                                                      1
    The table-driven program for a
    DFA
          state = S // S is the start state
          repeat {
                k = next character from the input
                if k == EOF the // end of input
                    if state is a final state then accept
                    else reject
                state = T[state,k]
                if state = empty then reject // got stuck
          }

                                  CS 1622 Lecture 3                  4




    Generating a scanner

                                NFA

          Regular
        expressions                                    DFA



          Lexical                              Table-driven
        Specification                      Implementation of DFA


                                  CS 1622 Lecture 3                  5




    Regular Expressions

   FA’s not good way to specify tokens - diagrams hard
    to write down
   regular expressions are another specification
    technique
       a compact way to define a language that can be accepted by
        an automaton.
   used as the input to a scanner generator
       define each token, and
       define white-space, comments, etc
             these do not correspond to tokens,
             but must be recognized and ignored.



                                  CS 1622 Lecture 3                  6




                                                                         2
         Example: Simple identifier
          English: A letter, followed by zero or
           more letters or digits.
          RE: letter . (letter | digit)*

         Operators:
    |         means "or"
    .         means "followed by” (usually just use position)
    *         means zero or more instances
    ()        are used for grouping
                                         CS 1622 Lecture 3                7




         Operands of a regular
         expression
            Operands are same as labels on the edges of an
             FSM
                single characters, or
                the special character ε (the empty string)
            "letter" is a shorthand for
                a | b | c | ... | z | A | ... | Z
            "digit“ is a shorthand for
                0|1|…|9
            sometimes we put the characters in quotes
                necessary when denoting characters: | . *



                                         CS 1622 Lecture 3                8




         Precedence of | . * operators.
     Regular                     Analogous                   Precedence
    Expression                   Arithmetic
     Operator                     Operator
        |                           plus                      lowest
        .                          times                      middle
        *                      exponentiation                 highest
   Consider regular expressions:
        letter.letter | digit*
        letter.(letter | digit)*
                                         CS 1622 Lecture 3                9




                                                                              3
        Examples
           Describe (in English) the language defined by
            each of the following regular expressions:
               letter (letter | digit*)




               digit digit* "." digit digit*




                                     CS 1622 Lecture 3                       10




        Example: Integer Literals
           An integer literal with an optional sign can be
            defined in English as:
               “(nothing or + or -) followed by one or more digits”
           The corresponding regular expression is:
               (+|-|epsilon).(digit.digit*)
           A new convenient operator ‘+’
             digit.digit*           is the same as
            digit+                   which means "one or more digits”


                                     CS 1622 Lecture 3                       11




        Language Defined by a
        Regular Expression
           Recall: language = set of strings
           Language defined by an automaton / RE
               the set of strings accepted by the automaton
               the set of strings that match the expression.
Regular Exp.                 Corresponding Set of Strings
epsilon                      {""}
a                            {"a"}
a.b.c                        {"abc"}
a|b|c                        {"a", "b", "c"}
(a | b | c)*                 {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}
                                     CS 1622 Lecture 3                       12




                                                                                  4
             REs describe regular
             languages
             Patterns form a regular language
                *** any finite language is regular ***
             Regular Expression (RE) (over alphabet Σ)
             ε is a RE denoting the set {ε}
             If a is in Σ, then a is a RE denoting {a}
             If x and y are REs denoting L(x) and L(y) then
                x is an RE denoting L(x); y is a RE denoting L(y);
                x | y is an RE denoting L(x) ∪ L(y)
                xy is an RE denoting L(x)L(y)
                x* is an RE denoting L(x)*
                Can combine RE to form other REs
                                                 CS 1622 Lecture 3                             13




     Example

 Consider the problem of recognizing
  register names
         Register → r (0|1|2| … | 9) (0|1|2| … | 9)*

 ν       Allows registers of arbitrary number
 ν       Requires at least one digit
 ν       RE corresponds to a recognizer (or DFA)
                        Recognizer for Register
                                                                                   ***With implicit
                                           (0|1|2| … 9)
                                                                                   transitions
         r             (0|1|2| … 9)
S0                S1                  S2
                                                                                   on other inputs to
                                                 CS 1622 Lecture 3                 error state 14




             Example (continued)
     ν   Start in state S0 & take transitions on
         each input character
     ν   DFA accepts a word x iff x leaves it in a
         final state (S2 )           (0|1|2| … 9)
                                       r               (0|1|2| … 9)
                               S0                 S1                  S2
                                                                           accepting state
     So,                   Recognizer for Register
     r17 takes it through s0, s1, s2 and accepts
     r takes it through s0, s1 and fails
                             se
     a takes it straight to 1622 Lecture 3
                          CS                                                                   15




                                                                                                        5
      Example
char ← next                action(state,char)
                                                                       Τ         action
character;                   switch(Τ(state) )
state ← s 0 ;                  case start:                             S0        start
call                             word ← char;                          S1        normal
action(state,char);               break;                               S2         final
while (char ≠ eof)             case normal:
                                                                       Se        error
  state ←                        word ← word +
δ(state,char);             char;
  call                            break;
action(state,char);            case final:
  char ← next                    word ← word +                         0,1,2,3,
                                                                        4,5,6,
character;                 char;                        δ         r
                                                                        7,8,9
                                                                                      other
                                  break;                S0        S1        Se            Se
if Τ(state) = final then       case error:
    report acceptance;            report error;         S1        Se        S2            Se
else                              break;                S2        Se        S2            Se
     report failure;           end;                     Se        Se        Se            Se

• The recognizer translates directly into
code
• To change D F A s, just change the tables
                                CS 1622 Lecture 3                                              16




      The Role of Regular
      Expressions
         Theorem:
              for every regular expression, there is a finite-state
               machine that defines the same language, and vice
               versa.

         Why is the theorem important for scanner
          generation?

         Theorem is not enough: what do we need for
          automatic scanner generation?


                                CS 1622 Lecture 3                                              17




      Non-deterministic Finite
      Automata
  Each RE corresponds to a deterministic finite
    automaton (DFA)
   Recall the recognizer for Register → r (0|1|2|
    … | 9) (0|1|2| … | 9)*
                a|b
  What about an RE such as ( a | b )* abb ?
          ε                a          b             b
          S0         S1        S2           S3               S4


  This is a little different
   S0 has a transition on ε

  ν S1 has two transitions on a
  This is a non-deterministic finite automaton
    (NFA)             CS 1622 Lecture 3                                                        18




                                                                                                    6
        Non-deterministic Finite
        Automata
   An NFA accepts a string x iff ∃ a path though the
    transition graph from s0 to a final state & the edge
    labels spell x
    ν   Transitions on ε consume no input
    ν   To “run” the NFA, start in s 0 and take all the transitions
        for each character
     Clone the NFA at each non-deterministic choice
      (guess correctly)
   NFAs are the key to automating the RE→DFA
                            ε
    construction       NFA       NFA       becomes   NFA
                                              an
ν                           NFAs 3
    We can paste together 1622 Lecture with ε-transitions 19
                         CS




        Relationship between NFAs
        and DFAs
         DFA is a special case of an NFA
         ν DFA has no ε transitions

         ν   DFA’s transition function is single-valued

         NFA can be simulated with a DFA
           (less obvious)
         ν Simulate sets of possible states

         ν Possible exponential blowup in the state
           space
         ν Still, one state per character in the input
           stream            CS 1622 Lecture 3                             20




        Automating Scanner
        Construction
         To convert a specification into code:
         1.  Write down the RE for the input language
         2.  Build a NFA
         3.  Build the DFA that simulates the NFA
         4.  Systematically shrink the DFA
         5.  Turn it into code
         •   Scanner generators
             1.   Lex, Flex, and Jlex work along these lines
             2.   Algorithms are well-known and well-understood
             3.   Key issue is interface to parser (define all parts of speech)

                                   CS 1622 Lecture 3                       21




                                                                                  7
          Automating Scanner
          Construction
          RE→NFA (Thompson’s construction)
                   ν    Build an NFA for each term
                   ν    Combine them with ε-moves

          NFA →DFA (subset construction)
          ν Build the simulation

                                                                         The Cycle of Constructions
          DFA →Minimal DFA
                   ν    Hopcroft’s algorithm                                                           minimal
                                                                         RE           NFA       DFA
                                                                                                        DFA
          DFA → RE
                   ν                      CS problem
                        All pairs, all paths1622 Lecture 3                                                       22




          Regular Expressions to NFA
          (1)
         For each kind of RE, define an NFA -
          essentially combine REs
                 Notation: NFA for RE M

                                                M

         • For ε
                                           ε

         • For input a
                                           a
                                               CS 1622 Lecture 3                                                 23




          RE →NFA using Thompson’s
          Construction
              ν   NFA pattern for each symbol & each operator
              ν   Join them with ε moves in precedence order

                             a                              a                 ε             b
                   S0                S1           S0                S1                S3          S4

                        NFA for                                      NFA for
                        a                                                 ab
                                                                                  ε
                  a
         S1                 S2
     ε                           ε                              ε                 a         ε
                                                       S0            S1                S3        S4
S0                                    S5
                                                                                  ε
     ε            b              ε
         S3                 S4                                       NFA for a *
                                                                         Ken Thompson, C ACM ,
          NFA for a |                                                           1968
                  b                            CS 1622 Lecture 3                                                 24




                                                                                                                      8
    Example of RE -> NFA
    conversion
       Consider the regular expression
                        (1 | 0)*1
       The NFA isε

                     ε   C 1 E ε
A                B                           G ε   H ε       1
                         D 0
        ε                                                I       J
                     ε           F ε
                          ε

                               CS 1622 Lecture 3                  25




    NFA to DFA. The Trick
       Simulate the NFA
       Each state of DFA
            = a non-empty subset of states of the NFA
       Start state
            = the set of NFA states reachable through ε-moves
              from NFA start state
       Add a transition S →a S’ to DFA iff
            ν   S’ is the set of NFA states reachable from any
                state in S after seeing the input a, considering ε-
                moves as well

                               CS 1622 Lecture 3                  26




    NFA to DFA. Remark
       An NFA may be in many states at any
        time
       How many different states ?
       If there are N states, the NFA must be
        in some subset of those N states
       How many subsets are there?
       2N - 1 = finitely many


                               CS 1622 Lecture 3                  27




                                                                       9
    NFA -> DFA Example
                            ε

                ε       C 1 E ε
A           B                                  G ε   H ε       1
                        D 0
        ε                                                  I       J
                ε                  F ε
                        ε
                                                     0
                                FGHIABCD
                    0
    ABCDHI              0                      1
                                                      1
                    1           EJGHIABCD

                                 CS 1622 Lecture 3                 28




    NFA to DFA: the practice
        NFA -> DFA conversion is at the heart
         of tools such as flex
        But, DFAs can be huge
        In practice, flex-like tools trade off
         speed for space in the choice of NFA
         and DFA representations


                                 CS 1622 Lecture 3                 29




    Putting it all together
                                NFA

          Regular
        expressions                                  DFA



          Lexical                              Table-driven
        Specification                      Implementation of DFA




                                 CS 1622 Lecture 3                 30




                                                                        10
 Example: a scanner for a very
 simple language

The language of assignment statements:
         left-hand side of assignment is an identifier:
              a letter followed by one or more letters or digits
         followed by a =

         right-hand side is one of the following:

              ID + ID
              ID * ID
              ID == ID


                                   CS 1622 Lecture 3                      31




      Step 1: Define tokens
   The language has five tokens,
         they can be defined by five regular
          expressions:
                  Token         Regular Expression




                                   CS 1622 Lecture 3                      32




      Step 2: Convert REs to NFAs
ASSIGN:                                                 “=”

ID:                                                    letter
                                                                    letter |
                                                                    digit
PLUS:                                                   “+”

TIMES:                                                  “*”

EQUALS:                             “=”                 “=”


                                   CS 1622 Lecture 3                      33




                                                                               11
Step 3: Convert NFAs to
DFAs
   Subset construction algorithm (aka
    Büchi’s algorithm)
       will learn soon




                           CS 1622 Lecture 3                        34




Step 4: Combining per-token
DFAs
   Goal of a scanner:
       find the longest prefix of the current input that
        corresponds to a token.

   This has two consequences:
       lookahead:
            Examine if the next input character can “extend” the
             current token. If yes, keep building a larger token.
       a real scanner cannot get stuck:
            What if we get stuck building the larger token?
             Solution: return characters back to input.

                           CS 1622 Lecture 3                        35




Operation Notes
   A value (the current token) must be
    returned when the regular expression is
    matched
       to be able to match input of more than one
        token
   Scanner should start up again trying to
    match another regular expression after
    throwing out whitespace
                           CS 1622 Lecture 3                        36




                                                                         12
    Extend the DFA
           modify the DFA so that an edge can have
                   an associated action to
                            "put back one character" or
                            "return token XXX",


           we must combine the DFAs for all of the
            tokens into a single DFA, and
           we must write a program for the "combined"
            DFA.

                                             CS 1622 Lecture 3                             37




    Step 4: Example of extending
    the DFA
   The DFA that recognizes simple identifiers must be
    modified as follows:
                                                       action:
                                letter | digit
                                                           • put back 1 char
                S           letter                         • return ID
                                                  any char except
                                                  letter or digit

           recall that scanner is called by parser
            (one token is return per each call)
           hence action return puts the scanner into state S

                                             CS 1622 Lecture 3                             38




    Implementing the extended
    DFA
       The table-driven technique works, with a few
        small modifications:
               Include a column for end-of-file
                           e.g., to find an identifier when it is the last token in the
                            input.
               besides ‘next state’, a table entry includes
                           an (optional) action: put back n characters, return token
               Instead of repeating
                           "read a character; update the state variable"
                            until the machine gets stuck or the entire input is read,
                           "read a character; update the state variable;
                            perform the action"
               (eventually, the action will be to return a value, so
                                 CS 1622 stop).
                the scanner code will Lecture 3                     39




                                                                                                13
      Step 4: Example: Combined DFA
      for our language
                     F3

          return PLUS
                              “+”
                                          letter | digit
                                                                    put back 1 char;
F4            “*”               letter                              return ID
                          S
      return TIMES                                                  any char except
                                           ID                       letter or digit
                                                                                        F2
                               return EQUALS
               TMP
                                         “=”                  F5
any char except “=”           put back 1 char; return ASSIGN

                              F1
                                          CS 1622 Lecture 3                            40




      Transition Table (part 1)
                 +                                *                          =


     F3,                             F4,
S                                                                   TMP
     return PLUS                     return TIMES
   F2,                               F2,                            F2,
ID put back 1 char;                  put back 1 char;               put back 1 char;
   return ID                         return ID                      return ID
T F1,                                F1,
                                                                    F5,
M put back 1 char;                   put back 1 char;
                                                                    return EQUALS
P return ASSIGN                      return ASSIGN

                                          CS 1622 Lecture 3                            41




      Transition Table (part 2)
              letter                           digit                      EOF


     ID

                                                                   F2,
     ID                             ID                             put back 1 char;
                                                                   return ID
     F1,                            F1,                            F1,
     put back 1 char;               put back 1 char;               put back 1 char;
     return ASSIGN                  return ASSIGN                  return ASSIGN


                                          CS 1622 Lecture 3                            42




                                                                                             14

								
To top