VIEWS: 20 PAGES: 22 POSTED ON: 10/1/2012 Public Domain
Regular Expressions • finite state machine is a good “visual” aid – but it is not very suitable as a specification (its textual description is too clumsy) • regular expressions are a suitable specification – a more compact way to define a language that can be accepted by an FSM • used to give the lexical description of a programming language – define each “token” (keywords, identifiers, literals, operators, punctuation, etc) – define white-space, comments, etc • these are not tokens, but must be recognized and ignored 1 Example: Pascal identifier • Lexical specification (in English): – a letter, followed by zero or more letters or digits • Lexical specification (as a regular expression): – letter . (letter | digit)* | means "or" . means "followed by“ (dot may be omitted) * means zero or more instances of ( ) are used for grouping 2 Operands of a regular expression • Operands are same as labels on the edges of an FSM – single characters, or – the special character (the empty string) • "letter" is a shorthand for – a | b | c | ... | z | A | B | C | ... | Z • "digit“ is a shorthand for – 0|1|2|…|9 • sometimes we put the characters in quotes – necessary when denoting | . * ( ) 3 Precedence of | . * operators. Regular Analogous Precedence Expression Arithmetic Operator Operator | plus lowest . times middle * exponentiation highest • Consider regular expressions: – letter.letter | digit* – letter.(letter | digit)* 4 TEST YOURSELF Question 1: Describe (in English) the language defined by each of the following regular expressions: – letter (letter* | digit*) – (letter | _ ) (letter | digit | _ )* – digit* "." digit* – digit digit* "." digit digit* 5 TEST YOURSELF Question 2: Write a regular expression for each of these languages: – The set of all C++ reserved words • Examples: if, while, for, class, int, case, char, true, false – C++ string literals that begin with ” and end with ” and don’t contain any other ” except possibly in the escape sequence \” • Example: ”The escape sequence \” occurs in this string” – C++ comments that begin with /* and end with */ and don’t contain any other */ within the string • Example: /* This is a comment * still the same comment */ 6 Example: Integer Literals • An integer literal with an optional sign can be defined in English as: – “(nothing or + or -) followed by one or more digits” • The corresponding regular expression is: – (+|-|) (digit.digit*) • A new convenient operator ‘+’ – same precedence as ‘*’ – digit digit* is the same as – digit + which means "one or more digits" 7 Language Defined by a Regular Expression • Recall: language = set of strings • Language defined by an automaton – the set of strings accepted by the automaton • Language defined by a regular expression – the set of strings that match the expression Regular Exp. Corresponding Set of Strings {""} a {"a"} a.b.c {"abc"} a|b|c {"a", "b", "c"} (a | b | c)* {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...} 8 Concept of Reg Exp Generating a String Rewrite regular expression until have only a sequence of letters (string) left Replacement Example Rules (0|1)* 2 (0|1)* 1) r1 | r2 ––> r1 (0|1) (0|1)* 2 (0|1)* 2) r1 | r2 ––> r2 1 (0|1)* 2 (0|1)* 3) r* ––> r r* 1 2 (0|1)* 4) r* ––> 1 2 (0|1) (0|1)* 1 2 (0|1) 9 120 Non–determinism in Generation • Different rule applications may yield different final results Example 1 Example 2 (0|1)* 2 (0|1)* (0|1)* 2 (0|1)* (0|1) (0|1)* 2 (0|1) (0|1)* 2 (0|1)* (0|1)* 1 (0|1)* 2 (0|1)* 0 (0|1)* 2 (0|1)* 1 2 (0|1)* 0 2 (0|1)* 1 2 (0|1) (0|1)* 0 2 (0|1) (0|1)* 1 2 (0|1) 0 2 (0|1) 120 021 10 Concept of Language Generated by Reg Exp • Set of all strings generated by a regular expression is the language of the regular expression • In general, language may be infinite • String generated by regular expression language is often called a “token” 11 Examples of Languages and Reg Exp • = { 0, 1, . } – (0 | 1)+ "." (0 | 1)* | (0 | 1)* "." (0 | 1)+ binary floating point numbers – (0 0)* even-length all-zero strings – 1* (0 1* 0 1*)* binary strings with even number of zeros • = { a,b,c, 0, 1, 2 } – (a|b|c)(a|b|c|0|1|2)* alphanumeric identifiers – (0|1|2)+ trinary numbers 12 Reg Exp Notational Shorthand • R + one or more strings of R: R(R*) • R? optional R: (R|) • [abcd] one of listed characters: (a|b|c|d) • [a-z] one character from this range: (a|b|c|d...|z) • [^abc] anything but one of the listed chars • [^a-z] any one character not from this range 13 Equivalence of FSM and Regular Expressions • Theorem: – For each finite state machine M, we can construct a regular expression R such that M and R accept the same language. – [proof omitted] • Theorem: – For each regular expression R, we can construct a finite state machine M such that R and M accept the same language. – [proof outline follows] 14 Regular Expressions to NFSM (1) • For each kind of reg exp, define a NFSM – Notation: NFSM for reg exp M M • For • For input a a 15 Regular Expressions to NFSM (2) • For A . B A B • For A | B A B 16 Regular Expressions to NFSM (3) • For A* A 17 Example of RegExp -> NFSM conversion • Consider the regular expression (1|0)*1 • The NFSM is C 1 E 1 0 F A B G H I J D 18 Converting NFSM to DFSM • Simulate the NFSM • Each state of DFSM – is a non-empty subset of states of the NFSM • Start state of DFSM – is the set of NFSM states reachable from the NFSM start state using only -moves • Add a transition S a > S’ to DFSM iff – S’ is the set of NFSM states reachable from any state in S after consuming only the input a, considering -moves as well 19 Remarks on converting NFSM to DFSM • An NFSM may be in many states at any time • How many different states ? • If there are N states, the NFSM must be in some subset of those N states • How many subsets are there? • 2N = finitely many • For example, if N = 5 then 2N = 32 subsets 20 NFSM -> DFSM Example C 1 E 1 0 F A B G H I J D 0 0 FGHIABCD 0 1 ABCDHI 1 1 EJGHIABCD 21 TEST YOURSELF Question 3: First convert each of these regular expressions to a NFSM – (a | b | ) (a | b) – (ab | ba)* (aa | bb) Question 4: Next convert each resulting NFSM to a DFSM 22