# Regular expressions by lZKSKZ

VIEWS: 20 PAGES: 22

• pg 1
```									Regular Expressions

• finite state machine is a good “visual” aid
– but it is not very suitable as a specification
(its textual description is too clumsy)
• regular expressions are a suitable specification
– a more compact way to define a language that can be accepted
by an FSM
• used to give the lexical description of a programming
language
– define each “token” (keywords, identifiers, literals, operators,
punctuation, etc)
• these are not tokens, but must be recognized and ignored

1
Example: Pascal identifier

• Lexical specification (in English):
– a letter, followed by zero or more letters or digits
• Lexical specification (as a regular expression):
– letter . (letter | digit)*

| means "or"
. means "followed by“ (dot may be omitted)
* means zero or more instances of
( ) are used for grouping

2
Operands of a regular expression

• Operands are same as labels on the edges of an FSM
– single characters, or
– the special character  (the empty string)

• "letter" is a shorthand for
– a | b | c | ... | z | A | B | C | ... | Z
• "digit“ is a shorthand for
– 0|1|2|…|9
• sometimes we put the characters in quotes
– necessary when denoting | . * ( )

3
Precedence of | . * operators.

Regular           Analogous       Precedence
Expression         Arithmetic
Operator           Operator
|                 plus         lowest
.                times         middle
*          exponentiation     highest

• Consider regular expressions:
– letter.letter | digit*
– letter.(letter | digit)*
4
TEST YOURSELF

Question 1: Describe (in English) the language
defined by each of the following regular
expressions:

– letter (letter* | digit*)

– (letter | _ ) (letter | digit | _ )*

– digit* "." digit*

– digit digit* "." digit digit*
5
TEST YOURSELF

Question 2: Write a regular expression for each
of these languages:

– The set of all C++ reserved words
• Examples: if, while, for, class, int, case, char, true, false
– C++ string literals that begin with ” and end with ”
and don’t contain any other ” except possibly in the
escape sequence \”
• Example: ”The escape sequence \” occurs in this string”
– C++ comments that begin with /* and end with */
and don’t contain any other */ within the string
• Example: /* This is a comment * still the same comment */
6
Example: Integer Literals

• An integer literal with an optional sign can be
defined in English as:
– “(nothing or + or -) followed by one or more digits”
• The corresponding regular expression is:
– (+|-|) (digit.digit*)
• A new convenient operator ‘+’
– same precedence as ‘*’
– digit digit*  is the same as
– digit + which means "one or more digits"

7
Language Defined by a Regular
Expression
• Recall: language = set of strings
• Language defined by an automaton
– the set of strings accepted by the automaton
• Language defined by a regular expression
– the set of strings that match the expression

Regular Exp.          Corresponding Set of Strings
                     {""}
a                     {"a"}
a.b.c                 {"abc"}
a|b|c                 {"a", "b", "c"}
(a | b | c)*          {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}
8
Concept of Reg Exp Generating a
String

Rewrite regular expression until have only a
sequence of letters (string) left
Replacement                Example
Rules            (0|1)* 2 (0|1)*
1) r1 | r2 ––> r1        (0|1) (0|1)* 2
(0|1)*
2) r1 | r2 ––> r2        1 (0|1)* 2 (0|1)*
3) r* ––> r r*           1 2 (0|1)*
4) r* ––>               1 2 (0|1) (0|1)*
1 2 (0|1)        9
120
Non–determinism in Generation

• Different rule applications may yield different
final results
Example 1                 Example 2
(0|1)* 2 (0|1)*           (0|1)* 2 (0|1)*
(0|1) (0|1)* 2            (0|1) (0|1)* 2
(0|1)*                    (0|1)*
1 (0|1)* 2 (0|1)*         0 (0|1)* 2 (0|1)*
1 2 (0|1)*                0 2 (0|1)*
1 2 (0|1) (0|1)*          0 2 (0|1) (0|1)*
1 2 (0|1)                 0 2 (0|1)
120                       021          10
Concept of Language Generated
by Reg Exp

• Set of all strings generated by a regular
expression is the language of the regular
expression
• In general, language may be infinite
• String generated by regular expression
language is often called a “token”

11
Examples of Languages and Reg
Exp

•  = { 0, 1, . }
– (0 | 1)+ "." (0 | 1)* | (0 | 1)* "." (0 | 1)+
 binary floating point numbers
– (0 0)*  even-length all-zero strings
– 1* (0 1* 0 1*)*  binary strings with even number
of zeros

•  = { a,b,c, 0, 1, 2 }
– (a|b|c)(a|b|c|0|1|2)*  alphanumeric identifiers
– (0|1|2)+  trinary numbers
12
Reg Exp Notational Shorthand

• R + one or more strings of R: R(R*)
• R? optional R: (R|)
• [abcd] one of listed characters: (a|b|c|d)
• [a-z] one character from this range:
(a|b|c|d...|z)
• [^abc] anything but one of the listed chars
• [^a-z] any one character not from this range

13
Equivalence of FSM and Regular
Expressions

• Theorem:
– For each finite state machine M, we can construct a
regular expression R such that M and R accept the
same language.
– [proof omitted]
• Theorem:
– For each regular expression R, we can construct a
finite state machine M such that R and M accept
the same language.
– [proof outline follows]

14
Regular Expressions to NFSM (1)

• For each kind of reg exp, define a NFSM
– Notation: NFSM for reg exp M

M

• For 


• For input a
a

15
Regular Expressions to NFSM (2)

• For A . B
A       
B

• For A | B


A       

   B

16
Regular Expressions to NFSM (3)

• For A*


A
               



17
Example of RegExp -> NFSM
conversion

• Consider the regular expression
(1|0)*1
• The NFSM is



   C
1       
E
1
             0 F
A       B                           G    H    I        J
   D     

18
Converting NFSM to DFSM

• Simulate the NFSM
• Each state of DFSM
– is a non-empty subset of states of the NFSM
• Start state of DFSM
– is the set of NFSM states reachable from the NFSM
start state using only -moves
• Add a transition S   a   > S’ to DFSM iff
– S’ is the set of NFSM states reachable from any
state in S after consuming only the input a,
considering -moves as well

19
Remarks on converting NFSM to
DFSM

• An NFSM may be in many states at any time
• How many different states ?
• If there are N states, the NFSM must be in
some subset of those N states
• How many subsets are there?
• 2N = finitely many
• For example, if N = 5 then 2N = 32 subsets

20
NFSM -> DFSM Example



   C 1       E   
1
             0 F
A        B                         G    H    I        J
   D     

0
0           FGHIABCD
0                 1
ABCDHI
1
1           EJGHIABCD
21
TEST YOURSELF

Question 3: First convert each of these regular
expressions to a NFSM
– (a | b | ) (a | b)

– (ab | ba)* (aa | bb)

Question 4: Next convert each resulting NFSM
to a DFSM

22

```
To top