VIEWS: 0 PAGES: 14 POSTED ON: 2/13/2012
CS1622 Lecture 3 Lexical Analysis CS 1622 Lecture 3 1 Equivalence of DFA and NFA Theorem: For every non-deterministic finite-state machine M, there exists a deterministic machine M' such that M and M' accept the same language. Why is the theorem important for scanner generation? Theorem is not enough: what do we need for automatic scanner generation? CS 1622 Lecture 3 2 How to Implement a FSM A table-driven approach: table: one row for each state in the machine, and one column for each possible character. Table[j][k] which state to go to from state j on character k, an empty entry corresponds to the machine getting stuck. CS 1622 Lecture 3 3 1 The table-driven program for a DFA state = S // S is the start state repeat { k = next character from the input if k == EOF the // end of input if state is a final state then accept else reject state = T[state,k] if state = empty then reject // got stuck } CS 1622 Lecture 3 4 Generating a scanner NFA Regular expressions DFA Lexical Table-driven Specification Implementation of DFA CS 1622 Lecture 3 5 Regular Expressions FA’s not good way to specify tokens - diagrams hard to write down regular expressions are another specification technique a compact way to define a language that can be accepted by an automaton. used as the input to a scanner generator define each token, and define white-space, comments, etc these do not correspond to tokens, but must be recognized and ignored. CS 1622 Lecture 3 6 2 Example: Simple identifier English: A letter, followed by zero or more letters or digits. RE: letter . (letter | digit)* Operators: | means "or" . means "followed by” (usually just use position) * means zero or more instances () are used for grouping CS 1622 Lecture 3 7 Operands of a regular expression Operands are same as labels on the edges of an FSM single characters, or the special character ε (the empty string) "letter" is a shorthand for a | b | c | ... | z | A | ... | Z "digit“ is a shorthand for 0|1|…|9 sometimes we put the characters in quotes necessary when denoting characters: | . * CS 1622 Lecture 3 8 Precedence of | . * operators. Regular Analogous Precedence Expression Arithmetic Operator Operator | plus lowest . times middle * exponentiation highest Consider regular expressions: letter.letter | digit* letter.(letter | digit)* CS 1622 Lecture 3 9 3 Examples Describe (in English) the language defined by each of the following regular expressions: letter (letter | digit*) digit digit* "." digit digit* CS 1622 Lecture 3 10 Example: Integer Literals An integer literal with an optional sign can be defined in English as: “(nothing or + or -) followed by one or more digits” The corresponding regular expression is: (+|-|epsilon).(digit.digit*) A new convenient operator ‘+’ digit.digit* is the same as digit+ which means "one or more digits” CS 1622 Lecture 3 11 Language Defined by a Regular Expression Recall: language = set of strings Language defined by an automaton / RE the set of strings accepted by the automaton the set of strings that match the expression. Regular Exp. Corresponding Set of Strings epsilon {""} a {"a"} a.b.c {"abc"} a|b|c {"a", "b", "c"} (a | b | c)* {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...} CS 1622 Lecture 3 12 4 REs describe regular languages Patterns form a regular language *** any finite language is regular *** Regular Expression (RE) (over alphabet Σ) ε is a RE denoting the set {ε} If a is in Σ, then a is a RE denoting {a} If x and y are REs denoting L(x) and L(y) then x is an RE denoting L(x); y is a RE denoting L(y); x | y is an RE denoting L(x) ∪ L(y) xy is an RE denoting L(x)L(y) x* is an RE denoting L(x)* Can combine RE to form other REs CS 1622 Lecture 3 13 Example Consider the problem of recognizing register names Register → r (0|1|2| … | 9) (0|1|2| … | 9)* ν Allows registers of arbitrary number ν Requires at least one digit ν RE corresponds to a recognizer (or DFA) Recognizer for Register ***With implicit (0|1|2| … 9) transitions r (0|1|2| … 9) S0 S1 S2 on other inputs to CS 1622 Lecture 3 error state 14 Example (continued) ν Start in state S0 & take transitions on each input character ν DFA accepts a word x iff x leaves it in a final state (S2 ) (0|1|2| … 9) r (0|1|2| … 9) S0 S1 S2 accepting state So, Recognizer for Register r17 takes it through s0, s1, s2 and accepts r takes it through s0, s1 and fails se a takes it straight to 1622 Lecture 3 CS 15 5 Example char ← next action(state,char) Τ action character; switch(Τ(state) ) state ← s 0 ; case start: S0 start call word ← char; S1 normal action(state,char); break; S2 final while (char ≠ eof) case normal: Se error state ← word ← word + δ(state,char); char; call break; action(state,char); case final: char ← next word ← word + 0,1,2,3, 4,5,6, character; char; δ r 7,8,9 other break; S0 S1 Se Se if Τ(state) = final then case error: report acceptance; report error; S1 Se S2 Se else break; S2 Se S2 Se report failure; end; Se Se Se Se • The recognizer translates directly into code • To change D F A s, just change the tables CS 1622 Lecture 3 16 The Role of Regular Expressions Theorem: for every regular expression, there is a finite-state machine that defines the same language, and vice versa. Why is the theorem important for scanner generation? Theorem is not enough: what do we need for automatic scanner generation? CS 1622 Lecture 3 17 Non-deterministic Finite Automata Each RE corresponds to a deterministic finite automaton (DFA) Recall the recognizer for Register → r (0|1|2| … | 9) (0|1|2| … | 9)* a|b What about an RE such as ( a | b )* abb ? ε a b b S0 S1 S2 S3 S4 This is a little different S0 has a transition on ε ν S1 has two transitions on a This is a non-deterministic finite automaton (NFA) CS 1622 Lecture 3 18 6 Non-deterministic Finite Automata An NFA accepts a string x iff ∃ a path though the transition graph from s0 to a final state & the edge labels spell x ν Transitions on ε consume no input ν To “run” the NFA, start in s 0 and take all the transitions for each character Clone the NFA at each non-deterministic choice (guess correctly) NFAs are the key to automating the RE→DFA ε construction NFA NFA becomes NFA an ν NFAs 3 We can paste together 1622 Lecture with ε-transitions 19 CS Relationship between NFAs and DFAs DFA is a special case of an NFA ν DFA has no ε transitions ν DFA’s transition function is single-valued NFA can be simulated with a DFA (less obvious) ν Simulate sets of possible states ν Possible exponential blowup in the state space ν Still, one state per character in the input stream CS 1622 Lecture 3 20 Automating Scanner Construction To convert a specification into code: 1. Write down the RE for the input language 2. Build a NFA 3. Build the DFA that simulates the NFA 4. Systematically shrink the DFA 5. Turn it into code • Scanner generators 1. Lex, Flex, and Jlex work along these lines 2. Algorithms are well-known and well-understood 3. Key issue is interface to parser (define all parts of speech) CS 1622 Lecture 3 21 7 Automating Scanner Construction RE→NFA (Thompson’s construction) ν Build an NFA for each term ν Combine them with ε-moves NFA →DFA (subset construction) ν Build the simulation The Cycle of Constructions DFA →Minimal DFA ν Hopcroft’s algorithm minimal RE NFA DFA DFA DFA → RE ν CS problem All pairs, all paths1622 Lecture 3 22 Regular Expressions to NFA (1) For each kind of RE, define an NFA - essentially combine REs Notation: NFA for RE M M • For ε ε • For input a a CS 1622 Lecture 3 23 RE →NFA using Thompson’s Construction ν NFA pattern for each symbol & each operator ν Join them with ε moves in precedence order a a ε b S0 S1 S0 S1 S3 S4 NFA for NFA for a ab ε a S1 S2 ε ε ε a ε S0 S1 S3 S4 S0 S5 ε ε b ε S3 S4 NFA for a * Ken Thompson, C ACM , NFA for a | 1968 b CS 1622 Lecture 3 24 8 Example of RE -> NFA conversion Consider the regular expression (1 | 0)*1 The NFA isε ε C 1 E ε A B G ε H ε 1 D 0 ε I J ε F ε ε CS 1622 Lecture 3 25 NFA to DFA. The Trick Simulate the NFA Each state of DFA = a non-empty subset of states of the NFA Start state = the set of NFA states reachable through ε-moves from NFA start state Add a transition S →a S’ to DFA iff ν S’ is the set of NFA states reachable from any state in S after seeing the input a, considering ε- moves as well CS 1622 Lecture 3 26 NFA to DFA. Remark An NFA may be in many states at any time How many different states ? If there are N states, the NFA must be in some subset of those N states How many subsets are there? 2N - 1 = finitely many CS 1622 Lecture 3 27 9 NFA -> DFA Example ε ε C 1 E ε A B G ε H ε 1 D 0 ε I J ε F ε ε 0 FGHIABCD 0 ABCDHI 0 1 1 1 EJGHIABCD CS 1622 Lecture 3 28 NFA to DFA: the practice NFA -> DFA conversion is at the heart of tools such as flex But, DFAs can be huge In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations CS 1622 Lecture 3 29 Putting it all together NFA Regular expressions DFA Lexical Table-driven Specification Implementation of DFA CS 1622 Lecture 3 30 10 Example: a scanner for a very simple language The language of assignment statements: left-hand side of assignment is an identifier: a letter followed by one or more letters or digits followed by a = right-hand side is one of the following: ID + ID ID * ID ID == ID CS 1622 Lecture 3 31 Step 1: Define tokens The language has five tokens, they can be defined by five regular expressions: Token Regular Expression CS 1622 Lecture 3 32 Step 2: Convert REs to NFAs ASSIGN: “=” ID: letter letter | digit PLUS: “+” TIMES: “*” EQUALS: “=” “=” CS 1622 Lecture 3 33 11 Step 3: Convert NFAs to DFAs Subset construction algorithm (aka Büchi’s algorithm) will learn soon CS 1622 Lecture 3 34 Step 4: Combining per-token DFAs Goal of a scanner: find the longest prefix of the current input that corresponds to a token. This has two consequences: lookahead: Examine if the next input character can “extend” the current token. If yes, keep building a larger token. a real scanner cannot get stuck: What if we get stuck building the larger token? Solution: return characters back to input. CS 1622 Lecture 3 35 Operation Notes A value (the current token) must be returned when the regular expression is matched to be able to match input of more than one token Scanner should start up again trying to match another regular expression after throwing out whitespace CS 1622 Lecture 3 36 12 Extend the DFA modify the DFA so that an edge can have an associated action to "put back one character" or "return token XXX", we must combine the DFAs for all of the tokens into a single DFA, and we must write a program for the "combined" DFA. CS 1622 Lecture 3 37 Step 4: Example of extending the DFA The DFA that recognizes simple identifiers must be modified as follows: action: letter | digit • put back 1 char S letter • return ID any char except letter or digit recall that scanner is called by parser (one token is return per each call) hence action return puts the scanner into state S CS 1622 Lecture 3 38 Implementing the extended DFA The table-driven technique works, with a few small modifications: Include a column for end-of-file e.g., to find an identifier when it is the last token in the input. besides ‘next state’, a table entry includes an (optional) action: put back n characters, return token Instead of repeating "read a character; update the state variable" until the machine gets stuck or the entire input is read, "read a character; update the state variable; perform the action" (eventually, the action will be to return a value, so CS 1622 stop). the scanner code will Lecture 3 39 13 Step 4: Example: Combined DFA for our language F3 return PLUS “+” letter | digit put back 1 char; F4 “*” letter return ID S return TIMES any char except ID letter or digit F2 return EQUALS TMP “=” F5 any char except “=” put back 1 char; return ASSIGN F1 CS 1622 Lecture 3 40 Transition Table (part 1) + * = F3, F4, S TMP return PLUS return TIMES F2, F2, F2, ID put back 1 char; put back 1 char; put back 1 char; return ID return ID return ID T F1, F1, F5, M put back 1 char; put back 1 char; return EQUALS P return ASSIGN return ASSIGN CS 1622 Lecture 3 41 Transition Table (part 2) letter digit EOF ID F2, ID ID put back 1 char; return ID F1, F1, F1, put back 1 char; put back 1 char; put back 1 char; return ASSIGN return ASSIGN return ASSIGN CS 1622 Lecture 3 42 14