Docstoc

Discussion #1 Finite State Machines

Document Sample
Discussion #1 Finite State Machines Powered By Docstoc
					    Discussion #1

Finite State Machines &
  Regular Expressions
                     Topics
•   Compilers and Interpreters
•   Lexical Analyzers
•   Regular Expressions
•   Finite State Machines
•   Project 1
      Compilers for Programming Languages


    Program                                          Code
                              Compiler



Program              Tokens            Internal Data      Code
          Lexical                                 Code
                                Parser
          Analyzer                              Generator
                  Keywords Syntax Analysis       Or Interpreter
                 String literals               (Executed directly)
                   Variables
                       …
                              Error messages
                  Series of 6 Projects:
                  Datalog Interpreter
Example Input:                                           Example Output:
Schemes:                                                 cn('CS101',Name)? Yes(3)
  snap(S,N,A,P)                                            Name='C. Brown'
  csg(C,S,G)                                               Name='P. Patty'
  cn(C,N)                                                  Name='Snoopy'
  ncg(N,C,G)
                                                         ncg('Snoopy',Course,Grade)? Yes(1)
Facts:                                                     Course='CS101', Grade='C'
  snap('12345','C. Brown','12 Apple St.','555-1234').
  snap('22222','P. Patty','56 Grape Blvd','555-9999').
  snap('33333','Snoopy','12 Apple St.','555-1234').
  csg('CS101','12345','A').
  csg('CS101','22222','B').
  csg('CS101','33333','C').
  csg('EE200','12345','B+').
  csg('EE200','22222','B').

Rules:
  cn(c,n) :- snap(S,n,A,P),csg(c,S,G).
  ncg(n,c,g) :- snap(S,n,A,P),csg(c,S,g).

Queries:
  cn('CS101',Name)?
  ncg('Snoopy',Course,Grade)?
     Project 1: Lexical Analyzer
Example Input:                            Example Output:
Queries:                                  (QUERIES,"Queries",1)
                                          (COLON,":",1)
         IsInRoomAtDH('Snoopy',R,'M',H)   (ID,"IsInRoomAtDH",2)
#SchemesFactsRules                        (LEFT_PAREN,"(",2)
.                                         (STRING,"'Snoopy'",2)
#|comment >=                              (COMMA,",",2)
wow|#                                     (ID,"R",2)
                                          (COMMA,",",2)
                                          (STRING,"'M'",2)
                                          (COMMA,",",2)
                                          (ID,"H",2)
                                          (RIGHT_PAREN,")",2)
                                          (COMMENT,"#SchemesFactsRules",3)
                                          (PERIOD,".",4)
                                          (COMMENT,"#|comment >=
                                          wow|#",5)
                                          (EOF,"",7)
                                          Total Tokens = 16
                    The Point of CS 236
•   Use mathematics to write better code.
     – in Project 1: some sample code to help get started
     – in later projects: continue this process independently
•   Project 1: Use a Finite State Machine to write a Lexical Analyzer.
                    Regular Expressions
•   Pattern description for strings
•   Standard patterns:
     – Concatenation: abc matches …abc… but not …abdc… or …ac…
     – Boolean or: ab|ac matches …ab… and …ac… but not …cba…
     – Kleene closure: ab* matches …a… and …ab… and …abb… and …
•   Common shorthand patterns
     – Optional: ab?c matches …ac… and …abc… but not …abbc…
       short for ac|abc
     – One or more: ab+ matches …ab… and …abb… and … but not …a…
       short for abb*
    Regular Expressions & Parens
• Parens group regular expressions as expected
• Examples:
  – (a|b)c matches …ac… and …bc…
  – (a|b)*c matches …c… and …ac… and …bac… and
    …ababababbbabbabaaaababaababbbbc… and …
  – (a|b)?c matches …c… and …ac… and …bc…
     Regular Expression Extensions
            (e.g. Google Regular Expressions)
• Additional shorthand and notation
   – [A-Z] = A|B|…|Z
   – [ABC] = A|B|C
   – \ is an escape character: \* matches …*…
• Languages and language extensions/packages
   – Perl
   – Java regular-expression packages
• Example: Google Regular Expressions
           Regular Expressions &
           Finite State Machines
• abc
               a           b           c   Note the special double-circle
                                           designation of an accepting state.

• a(b|c)
                       b
           a
                       c
• ab*              b
           a


• (a(b?c))+
                       b           c
           a                   c


                               a
             Finite State Machine:
             Mathematical Model

A deterministic finite state machine is a quintuple (Σ,S,s0,δ,F), where:
    • Σ is the input alphabet (a finite, non-empty set of symbols).
    • S is a finite, non-empty set of states.
    • s0 is an initial state, an element of S.
    • δ is the state-transition function: δ : S  Σ → S.
    • F is the set of final states, a (possibly empty) subset of S.

A finite state transducer is a 6-tuple (Σ,Γ,S,s0,δ,F) as above except:
     Γ is the output alphabet (a finite, non-empty set of symbols).

     δ is the state-transition function: δ : S  Σ → S  Γ.
               Project 1: Lexical Analyzer
Varieties        Description                                                           Example
<String>         Any sequence of characters enclosed in single quotes. Two             'quoted string'
                 single quotes denote an apostrophe within the string. For line-       'this isn''t two strings'
                 number counts, count all '\n's within a string. A string token’s line '' (empty string)
                 number is the line where the string starts.                           'don''t forget
                                                                                                      about multi-
                                                                                               line strings'

<Keyword>        One of the following four character sequences: Schemes,               Example: Schemesa is a single identifier
                 Facts, Rules, Queries. These keywords are case                        and not a keyword and an identifier.
                 sensitive.
<Identifier>     An identifier is a letter followed by a sequence of zero or more      Legal identifiers:   Invalid identifiers:
                 letters or numbers. No underscores.                                   Identifier1          1stPerson
                                                                                       Person               Person_Name
<Symbol>         One of the following character sequences:                             <=('a','b')
                 :       ,       <       >        =      (         *        ?          ( + ()
                 :-      .       <=      >=       !=     )         +                   ::- ???



White Space      Ignore white space; that is, do not output a token for white space,
                 just skip over it. White space includes any encountered spaces,
                 tabs, new lines, and carriage returns. Be sure to count the lines
                 when skipping over white space.

<Undefined>      Any character not tokenized as a string, keyword, identifier,         $&^ (Three individual tokens.)
                 symbol, or white space. Any non-terminating string or non-            'any string that doesn''t end
                 terminating comment is undefined. In both of the latter two cases
                 we reached EOF before finding the end of string or end of
                 comment.

<Comment>        A line comment starts with # and ends at newline.                     #this is a comment
                 A block comment starts at #| and ends with |#. The comment’s          #|this is a
                 line number is the line where the comment started.                          multiline comment|#
<EOF>            End of input file.
Partial Finite State Machine for
             Project 1
            <any character but single quote>
                                        ‘

              ‘                                    ‘          String
 Start                      String                            Quote
             <letter>
                                            <letter>
  :                               ‘


                  ‘          Id               <letter> | <digit>
                                                                       ‘
                        :

                                            <letter>
                             <letter>
      Colon_Or_                -
                                                 Colon_Dash
      Colon_Dash
                              :
              Get the Design Right
Code must directly represent a state machine:

 Set of states (enum)
 Transition function for each state:

       Input: the next character
       Output: a new Transition, i.e.:
           the next state

           a TokenType (if the current token is now complete)

               or null (if the current token is incomplete)
 State machine loop:

       Evaluates state transitions
       Builds and emits tokens
       Dirty work: discards whitespace tokens, tracks line numbers, etc.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:10/9/2011
language:English
pages:14
Jun Wang Jun Wang Dr
About Some of Those documents come from internet for research purpose,if you have the copyrights of one of them,tell me by mail vixychina@gmail.com.Thank you!