; lecture03
Documents
User Generated
Resources
Learning Center
Your Federal Quarterly Tax Payments are due April 15th

# lecture03

VIEWS: 0 PAGES: 14

• pg 1
```									                        CS1622

Lecture 3
Lexical Analysis

CS 1622 Lecture 3                      1

Equivalence of DFA and NFA
   Theorem:
   For every non-deterministic finite-state machine M, there
exists a deterministic machine M' such that M and M' accept
the same language.

   Why is the theorem important for scanner
generation?

   Theorem is not enough: what do we need for
automatic scanner generation?

CS 1622 Lecture 3                      2

How to Implement a FSM
A table-driven approach:
 table:

   one row for each state in the machine, and
   one column for each possible character.
   Table[j][k]
   which state to go to from state j on character k,
   an empty entry corresponds to the machine
getting stuck.

CS 1622 Lecture 3                      3

1
The table-driven program for a
DFA
state = S // S is the start state
repeat {
k = next character from the input
if k == EOF the // end of input
if state is a final state then accept
else reject
state = T[state,k]
if state = empty then reject // got stuck
}

CS 1622 Lecture 3                  4

Generating a scanner

NFA

Regular
expressions                                    DFA

Lexical                              Table-driven
Specification                      Implementation of DFA

CS 1622 Lecture 3                  5

Regular Expressions

   FA’s not good way to specify tokens - diagrams hard
to write down
   regular expressions are another specification
technique
   a compact way to define a language that can be accepted by
an automaton.
   used as the input to a scanner generator
   define each token, and
    these do not correspond to tokens,
    but must be recognized and ignored.

CS 1622 Lecture 3                  6

2
Example: Simple identifier
 English: A letter, followed by zero or
more letters or digits.
 RE: letter . (letter | digit)*

Operators:
|         means "or"
.         means "followed by” (usually just use position)
*         means zero or more instances
()        are used for grouping
CS 1622 Lecture 3                7

Operands of a regular
expression
   Operands are same as labels on the edges of an
FSM
   single characters, or
   the special character ε (the empty string)
   "letter" is a shorthand for
   a | b | c | ... | z | A | ... | Z
   "digit“ is a shorthand for
   0|1|…|9
   sometimes we put the characters in quotes
   necessary when denoting characters: | . *

CS 1622 Lecture 3                8

Precedence of | . * operators.
Regular                     Analogous                   Precedence
Expression                   Arithmetic
Operator                     Operator
|                           plus                      lowest
.                          times                      middle
*                      exponentiation                 highest
   Consider regular expressions:
    letter.letter | digit*
    letter.(letter | digit)*
CS 1622 Lecture 3                9

3
Examples
   Describe (in English) the language defined by
each of the following regular expressions:
   letter (letter | digit*)

   digit digit* "." digit digit*

CS 1622 Lecture 3                       10

Example: Integer Literals
   An integer literal with an optional sign can be
defined in English as:
   “(nothing or + or -) followed by one or more digits”
   The corresponding regular expression is:
   (+|-|epsilon).(digit.digit*)
   A new convenient operator ‘+’
 digit.digit*           is the same as
digit+                   which means "one or more digits”

CS 1622 Lecture 3                       11

Language Defined by a
Regular Expression
   Recall: language = set of strings
   Language defined by an automaton / RE
   the set of strings accepted by the automaton
   the set of strings that match the expression.
Regular Exp.                 Corresponding Set of Strings
epsilon                      {""}
a                            {"a"}
a.b.c                        {"abc"}
a|b|c                        {"a", "b", "c"}
(a | b | c)*                 {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}
CS 1622 Lecture 3                       12

4
REs describe regular
languages
Patterns form a regular language
*** any finite language is regular ***
Regular Expression (RE) (over alphabet Σ)
ε is a RE denoting the set {ε}
If a is in Σ, then a is a RE denoting {a}
If x and y are REs denoting L(x) and L(y) then
x is an RE denoting L(x); y is a RE denoting L(y);
x | y is an RE denoting L(x) ∪ L(y)
xy is an RE denoting L(x)L(y)
x* is an RE denoting L(x)*
Can combine RE to form other REs
CS 1622 Lecture 3                             13

Example

Consider the problem of recognizing
register names
Register → r (0|1|2| … | 9) (0|1|2| … | 9)*

ν       Allows registers of arbitrary number
ν       Requires at least one digit
ν       RE corresponds to a recognizer (or DFA)
Recognizer for Register
***With implicit
(0|1|2| … 9)
transitions
r             (0|1|2| … 9)
S0                S1                  S2
on other inputs to
CS 1622 Lecture 3                 error state 14

Example (continued)
ν   Start in state S0 & take transitions on
each input character
ν   DFA accepts a word x iff x leaves it in a
final state (S2 )           (0|1|2| … 9)
r               (0|1|2| … 9)
S0                 S1                  S2
accepting state
So,                   Recognizer for Register
r17 takes it through s0, s1, s2 and accepts
r takes it through s0, s1 and fails
se
a takes it straight to 1622 Lecture 3
CS                                                                   15

5
Example
char ← next                action(state,char)
Τ         action
character;                   switch(Τ(state) )
state ← s 0 ;                  case start:                             S0        start
call                             word ← char;                          S1        normal
action(state,char);               break;                               S2         final
while (char ≠ eof)             case normal:
Se        error
state ←                        word ← word +
δ(state,char);             char;
call                            break;
action(state,char);            case final:
char ← next                    word ← word +                         0,1,2,3,
4,5,6,
character;                 char;                        δ         r
7,8,9
other
break;                S0        S1        Se            Se
if Τ(state) = final then       case error:
else                              break;                S2        Se        S2            Se
report failure;           end;                     Se        Se        Se            Se

• The recognizer translates directly into
code
• To change D F A s, just change the tables
CS 1622 Lecture 3                                              16

The Role of Regular
Expressions
   Theorem:
    for every regular expression, there is a finite-state
machine that defines the same language, and vice
versa.

   Why is the theorem important for scanner
generation?

   Theorem is not enough: what do we need for
automatic scanner generation?

CS 1622 Lecture 3                                              17

Non-deterministic Finite
Automata
Each RE corresponds to a deterministic finite
automaton (DFA)
 Recall the recognizer for Register → r (0|1|2|
… | 9) (0|1|2| … | 9)*
a|b
What about an RE such as ( a | b )* abb ?
ε                a          b             b
S0         S1        S2           S3               S4

This is a little different
 S0 has a transition on ε

ν S1 has two transitions on a
This is a non-deterministic finite automaton
(NFA)             CS 1622 Lecture 3                                                        18

6
Non-deterministic Finite
Automata
   An NFA accepts a string x iff ∃ a path though the
transition graph from s0 to a final state & the edge
labels spell x
ν   Transitions on ε consume no input
ν   To “run” the NFA, start in s 0 and take all the transitions
for each character
 Clone the NFA at each non-deterministic choice
  (guess correctly)
   NFAs are the key to automating the RE→DFA
ε
construction       NFA       NFA       becomes   NFA
an
ν                           NFAs 3
We can paste together 1622 Lecture with ε-transitions 19
CS

Relationship between NFAs
and DFAs
DFA is a special case of an NFA
ν DFA has no ε transitions

ν   DFA’s transition function is single-valued

NFA can be simulated with a DFA
(less obvious)
ν Simulate sets of possible states

ν Possible exponential blowup in the state
space
ν Still, one state per character in the input
stream            CS 1622 Lecture 3                             20

Automating Scanner
Construction
To convert a specification into code:
1.  Write down the RE for the input language
2.  Build a NFA
3.  Build the DFA that simulates the NFA
4.  Systematically shrink the DFA
5.  Turn it into code
•   Scanner generators
1.   Lex, Flex, and Jlex work along these lines
2.   Algorithms are well-known and well-understood
3.   Key issue is interface to parser (define all parts of speech)

CS 1622 Lecture 3                       21

7
Automating Scanner
Construction
RE→NFA (Thompson’s construction)
ν    Build an NFA for each term
ν    Combine them with ε-moves

NFA →DFA (subset construction)
ν Build the simulation

The Cycle of Constructions
DFA →Minimal DFA
ν    Hopcroft’s algorithm                                                           minimal
RE           NFA       DFA
DFA
DFA → RE
ν                      CS problem
All pairs, all paths1622 Lecture 3                                                       22

Regular Expressions to NFA
(1)
    For each kind of RE, define an NFA -
essentially combine REs
       Notation: NFA for RE M

M

• For ε
ε

• For input a
a
CS 1622 Lecture 3                                                 23

RE →NFA using Thompson’s
Construction
ν   NFA pattern for each symbol & each operator
ν   Join them with ε moves in precedence order

a                              a                 ε             b
S0                S1           S0                S1                S3          S4

NFA for                                      NFA for
a                                                 ab
ε
a
S1                 S2
ε                           ε                              ε                 a         ε
S0            S1                S3        S4
S0                                    S5
ε
ε            b              ε
S3                 S4                                       NFA for a *
Ken Thompson, C ACM ,
NFA for a |                                                           1968
b                            CS 1622 Lecture 3                                                 24

8
Example of RE -> NFA
conversion
   Consider the regular expression
(1 | 0)*1
   The NFA isε

ε   C 1 E ε
A                B                           G ε   H ε       1
D 0
ε                                                I       J
ε           F ε
ε

CS 1622 Lecture 3                  25

NFA to DFA. The Trick
   Simulate the NFA
   Each state of DFA
= a non-empty subset of states of the NFA
   Start state
= the set of NFA states reachable through ε-moves
from NFA start state
   Add a transition S →a S’ to DFA iff
ν   S’ is the set of NFA states reachable from any
state in S after seeing the input a, considering ε-
moves as well

CS 1622 Lecture 3                  26

NFA to DFA. Remark
   An NFA may be in many states at any
time
   How many different states ?
   If there are N states, the NFA must be
in some subset of those N states
   How many subsets are there?
   2N - 1 = finitely many

CS 1622 Lecture 3                  27

9
NFA -> DFA Example
ε

ε       C 1 E ε
A           B                                  G ε   H ε       1
D 0
ε                                                  I       J
ε                  F ε
ε
0
FGHIABCD
0
ABCDHI              0                      1
1
1           EJGHIABCD

CS 1622 Lecture 3                 28

NFA to DFA: the practice
    NFA -> DFA conversion is at the heart
of tools such as flex
    But, DFAs can be huge
    In practice, flex-like tools trade off
speed for space in the choice of NFA
and DFA representations

CS 1622 Lecture 3                 29

Putting it all together
NFA

Regular
expressions                                  DFA

Lexical                              Table-driven
Specification                      Implementation of DFA

CS 1622 Lecture 3                 30

10
Example: a scanner for a very
simple language

The language of assignment statements:
   left-hand side of assignment is an identifier:
   a letter followed by one or more letters or digits
   followed by a =

   right-hand side is one of the following:

   ID + ID
   ID * ID
   ID == ID

CS 1622 Lecture 3                      31

Step 1: Define tokens
   The language has five tokens,
   they can be defined by five regular
expressions:
Token         Regular Expression

CS 1622 Lecture 3                      32

Step 2: Convert REs to NFAs
ASSIGN:                                                 “=”

ID:                                                    letter
letter |
digit
PLUS:                                                   “+”

TIMES:                                                  “*”

EQUALS:                             “=”                 “=”

CS 1622 Lecture 3                      33

11
Step 3: Convert NFAs to
DFAs
   Subset construction algorithm (aka
Büchi’s algorithm)
   will learn soon

CS 1622 Lecture 3                        34

Step 4: Combining per-token
DFAs
   Goal of a scanner:
   find the longest prefix of the current input that
corresponds to a token.

   This has two consequences:
   Examine if the next input character can “extend” the
current token. If yes, keep building a larger token.
   a real scanner cannot get stuck:
   What if we get stuck building the larger token?
Solution: return characters back to input.

CS 1622 Lecture 3                        35

Operation Notes
   A value (the current token) must be
returned when the regular expression is
matched
   to be able to match input of more than one
token
   Scanner should start up again trying to
match another regular expression after
throwing out whitespace
CS 1622 Lecture 3                        36

12
Extend the DFA
       modify the DFA so that an edge can have
       an associated action to
    "put back one character" or

       we must combine the DFAs for all of the
tokens into a single DFA, and
       we must write a program for the "combined"
DFA.

CS 1622 Lecture 3                             37

Step 4: Example of extending
the DFA
   The DFA that recognizes simple identifiers must be
modified as follows:
action:
letter | digit
• put back 1 char
S           letter                         • return ID
any char except
letter or digit

   recall that scanner is called by parser
(one token is return per each call)
   hence action return puts the scanner into state S

CS 1622 Lecture 3                             38

Implementing the extended
DFA
   The table-driven technique works, with a few
small modifications:
   Include a column for end-of-file
       e.g., to find an identifier when it is the last token in the
input.
   besides ‘next state’, a table entry includes
       "read a character; update the state variable"
until the machine gets stuck or the entire input is read,
       "read a character; update the state variable;
perform the action"
   (eventually, the action will be to return a value, so
CS 1622 stop).
the scanner code will Lecture 3                     39

13
Step 4: Example: Combined DFA
for our language
F3

return PLUS
“+”
letter | digit
put back 1 char;
F4            “*”               letter                              return ID
S
return TIMES                                                  any char except
ID                       letter or digit
F2
return EQUALS
TMP
“=”                  F5
any char except “=”           put back 1 char; return ASSIGN

F1
CS 1622 Lecture 3                            40

Transition Table (part 1)
+                                *                          =

F3,                             F4,
S                                                                   TMP
return PLUS                     return TIMES
F2,                               F2,                            F2,
ID put back 1 char;                  put back 1 char;               put back 1 char;
return ID                         return ID                      return ID
T F1,                                F1,
F5,
M put back 1 char;                   put back 1 char;
return EQUALS
P return ASSIGN                      return ASSIGN

CS 1622 Lecture 3                            41

Transition Table (part 2)
letter                           digit                      EOF

ID

F2,
ID                             ID                             put back 1 char;
return ID
F1,                            F1,                            F1,
put back 1 char;               put back 1 char;               put back 1 char;
return ASSIGN                  return ASSIGN                  return ASSIGN

CS 1622 Lecture 3                            42

14

```
To top