Regular expressions by lZKSKZ

VIEWS: 20 PAGES: 22

									Regular Expressions

• finite state machine is a good “visual” aid
   – but it is not very suitable as a specification
     (its textual description is too clumsy)
• regular expressions are a suitable specification
   – a more compact way to define a language that can be accepted
     by an FSM
• used to give the lexical description of a programming
  language
   – define each “token” (keywords, identifiers, literals, operators,
     punctuation, etc)
   – define white-space, comments, etc
       • these are not tokens, but must be recognized and ignored

                                                                    1
Example: Pascal identifier

• Lexical specification (in English):
  – a letter, followed by zero or more letters or digits
• Lexical specification (as a regular expression):
  – letter . (letter | digit)*


    | means "or"
    . means "followed by“ (dot may be omitted)
   * means zero or more instances of
  ( ) are used for grouping

                                                       2
Operands of a regular expression

• Operands are same as labels on the edges of an FSM
   – single characters, or
   – the special character  (the empty string)


• "letter" is a shorthand for
   – a | b | c | ... | z | A | B | C | ... | Z
• "digit“ is a shorthand for
   – 0|1|2|…|9
• sometimes we put the characters in quotes
   – necessary when denoting | . * ( )


                                                  3
Precedence of | . * operators.

    Regular           Analogous       Precedence
   Expression         Arithmetic
    Operator           Operator
         |                 plus         lowest
         .                times         middle
          *          exponentiation     highest

• Consider regular expressions:
  – letter.letter | digit*
  – letter.(letter | digit)*
                                                   4
TEST YOURSELF

Question 1: Describe (in English) the language
 defined by each of the following regular
 expressions:

  – letter (letter* | digit*)

  – (letter | _ ) (letter | digit | _ )*

  – digit* "." digit*

  – digit digit* "." digit digit*
                                           5
TEST YOURSELF

Question 2: Write a regular expression for each
 of these languages:

  – The set of all C++ reserved words
     • Examples: if, while, for, class, int, case, char, true, false
  – C++ string literals that begin with ” and end with ”
    and don’t contain any other ” except possibly in the
    escape sequence \”
     • Example: ”The escape sequence \” occurs in this string”
  – C++ comments that begin with /* and end with */
    and don’t contain any other */ within the string
     • Example: /* This is a comment * still the same comment */
                                                           6
Example: Integer Literals

• An integer literal with an optional sign can be
  defined in English as:
  – “(nothing or + or -) followed by one or more digits”
• The corresponding regular expression is:
  – (+|-|) (digit.digit*)
• A new convenient operator ‘+’
  – same precedence as ‘*’
  – digit digit*  is the same as
  – digit + which means "one or more digits"

                                                      7
Language Defined by a Regular
Expression
• Recall: language = set of strings
• Language defined by an automaton
    – the set of strings accepted by the automaton
• Language defined by a regular expression
    – the set of strings that match the expression

Regular Exp.          Corresponding Set of Strings
                     {""}
a                     {"a"}
a.b.c                 {"abc"}
a|b|c                 {"a", "b", "c"}
(a | b | c)*          {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}
                                                                           8
Concept of Reg Exp Generating a
String

 Rewrite regular expression until have only a
        sequence of letters (string) left
    Replacement                Example
          Rules            (0|1)* 2 (0|1)*
  1) r1 | r2 ––> r1        (0|1) (0|1)* 2
                             (0|1)*
  2) r1 | r2 ––> r2        1 (0|1)* 2 (0|1)*
  3) r* ––> r r*           1 2 (0|1)*
  4) r* ––>               1 2 (0|1) (0|1)*
                           1 2 (0|1)        9
                           120
Non–determinism in Generation

• Different rule applications may yield different
  final results
        Example 1                 Example 2
       (0|1)* 2 (0|1)*           (0|1)* 2 (0|1)*
       (0|1) (0|1)* 2            (0|1) (0|1)* 2
      (0|1)*                    (0|1)*
       1 (0|1)* 2 (0|1)*         0 (0|1)* 2 (0|1)*
       1 2 (0|1)*                0 2 (0|1)*
       1 2 (0|1) (0|1)*          0 2 (0|1) (0|1)*
       1 2 (0|1)                 0 2 (0|1)
       120                       021          10
Concept of Language Generated
by Reg Exp

• Set of all strings generated by a regular
  expression is the language of the regular
  expression
• In general, language may be infinite
• String generated by regular expression
  language is often called a “token”




                                              11
Examples of Languages and Reg
Exp

•  = { 0, 1, . }
   – (0 | 1)+ "." (0 | 1)* | (0 | 1)* "." (0 | 1)+
               binary floating point numbers
   – (0 0)*  even-length all-zero strings
   – 1* (0 1* 0 1*)*  binary strings with even number
     of zeros


•  = { a,b,c, 0, 1, 2 }
   – (a|b|c)(a|b|c|0|1|2)*  alphanumeric identifiers
   – (0|1|2)+  trinary numbers
                                                  12
Reg Exp Notational Shorthand

• R + one or more strings of R: R(R*)
• R? optional R: (R|)
• [abcd] one of listed characters: (a|b|c|d)
• [a-z] one character from this range:
  (a|b|c|d...|z)
• [^abc] anything but one of the listed chars
• [^a-z] any one character not from this range


                                           13
Equivalence of FSM and Regular
Expressions

• Theorem:
  – For each finite state machine M, we can construct a
    regular expression R such that M and R accept the
    same language.
  – [proof omitted]
• Theorem:
  – For each regular expression R, we can construct a
    finite state machine M such that R and M accept
    the same language.
  – [proof outline follows]

                                                  14
Regular Expressions to NFSM (1)

• For each kind of reg exp, define a NFSM
  – Notation: NFSM for reg exp M

                   M

  • For 
                   

  • For input a
                    a


                                            15
Regular Expressions to NFSM (2)

• For A . B
              A       
                          B


• For A | B

              
                  A       
                          
                 B


                                  16
Regular Expressions to NFSM (3)

• For A*
               

                   A
                          

                       



                                  17
Example of RegExp -> NFSM
conversion

• Consider the regular expression
                      (1|0)*1
• The NFSM is

                      

                 C
                          1       
                              E
                                                      1
                   0 F
  A       B                           G    H    I        J
                 D     
                  
                                                      18
Converting NFSM to DFSM

• Simulate the NFSM
• Each state of DFSM
  – is a non-empty subset of states of the NFSM
• Start state of DFSM
  – is the set of NFSM states reachable from the NFSM
    start state using only -moves
• Add a transition S   a   > S’ to DFSM iff
  – S’ is the set of NFSM states reachable from any
    state in S after consuming only the input a,
    considering -moves as well

                                                  19
Remarks on converting NFSM to
DFSM

• An NFSM may be in many states at any time
• How many different states ?
• If there are N states, the NFSM must be in
  some subset of those N states
• How many subsets are there?
• 2N = finitely many
• For example, if N = 5 then 2N = 32 subsets


                                         20
NFSM -> DFSM Example

                      

                 C 1       E   
                                                    1
                   0 F
 A        B                         G    H    I        J
                 D     
                  
                                          0
              0           FGHIABCD
                  0                 1
     ABCDHI
                                           1
              1           EJGHIABCD
                                                    21
TEST YOURSELF

Question 3: First convert each of these regular
 expressions to a NFSM
  – (a | b | ) (a | b)


  – (ab | ba)* (aa | bb)


Question 4: Next convert each resulting NFSM
 to a DFSM


                                           22

								
To top