regular_expressions by wuzhengqin


									C311 Regular Grammars, Scanners, DFA                    

         Dana Vrajitoru
         C311 Programming Languages

                                Regular Grammars, Scanners, DFA
         DFA versus NFA

               DFA - deterministic finite state automaton.
               A state transition machine where from each state we can move to one of the others based on the
               current terminal symbol.
               There at most one outgoing branch for each possible symbol.
               NFA - non-deterministic finite state automaton.
               The difference with a DFA is that there can be more than one outgoing branch with the same
               Regular grammars can be translated as NFAs.

         NFA to DFA

               To write a scanner more easily, we can transform an NFA into a DFA.
               For this we construct a DFA where the states are combinations of states from the NFA that
               represent alternatives that can be reached from any given state with the same symbol.
               For example in the first grammar we would add a state called TV:

         Example: NFA (top) and equivalent DFA (bottom)

1 of 4                                                                                                       1/13/12 4:25 PM
C311 Regular Grammars, Scanners, DFA                      

         Regular Expressions

               Recognized by regular grammars.
               They can be parsed by a linear algorithm that doesn't need to store the global structure of the
               parsing tree.
               Only one non-terminal symbol is active at any moment.
               This type of parsing algorithm is called a scanner.
               The first compilation step, the lexical analysis, is done by a scanner.


               An identifier in Lisp:
               I    =>   letter J
               J    =>   letter J
               J    =>   digit J
               J    =>   _ J
               J    =>   - J
               J    =>   e

               Condense the rules:
               J =>{_|-} J
               where e is the empty string.

               An integer number:
               N     =>   UN
               N     =>   - UN
               N     =>   + UN
               UN    =>   digit UN
               UN    =>   digit

2 of 4                                                                                                         1/13/12 4:25 PM
C311 Regular Grammars, Scanners, DFA                        

                 Condensing the rules:
                 N => {e|+|-} UN

         Finite States Machines

                 A string (simple character means one that is not a quote, simple quote, backslash):
                 S    =>   " T
                 T    =>   simple-character T
                 T    =>   \ V
                 V    =>   {"|'|\|n|t...} T
                 T    =>   "

                 Real Number
                 N => {-|+|e} UN
                 UN => UI . UI
                 UI => digit UI
                 UI => digit

                 The second rule above is not a regular rule. It needs a parser and not a scanner to be processed.
                 We can transform it the following way:
                 UN    =>   digit UN
                 UN    =>   digit
                 UN    =>   . UF
                 UF    =>   digit UF
                 UF    =>   digit

         Scanner / Parser

                 The lexical analysis of a program is done by a scanner.
                 A scanner is based on a regular grammar where the end of any chain of rules (a non-terminal
                 converted to a single terminal) produces a token as a result.
                 The parsing process calls the scanner during its execution to retrieve every new token, then
                 processes it before invoking the scanner again.
                 They are not two separate phases: the scanner is called repeatedly from the parser.

         Scanner in Context

           A scanner that works in the context of a program in which it recognizes individual units.
           It can be written to recognize the termination of a unit by a character that is not part of it.
           We can replace:
         UN => digit UN
         UN => digit

         UN => digit UN
         UN => e

         Table Driven Scanner

3 of 4                                                                                                           1/13/12 4:25 PM
C311 Regular Grammars, Scanners, DFA                  

         If the grammar is equivalent to a DFA, then we can organize the rules in a table:

         I   =>   letter J
         I   =>   _ J
         J   =>   letter J
         J   =>   digit J
         J   =>   _ J
         J   =>   - J
         J   =>   e

             letter digit _ -   else
          I J       err   J err err
          J J       J     J J   end

4 of 4                                                                                                     1/13/12 4:25 PM

To top