Outline Where we are What is Syntactic Analysis

Document Sample
Outline Where we are What is Syntactic Analysis Powered By Docstoc
					                                                                                                                              Outline
                                                                                           •   Context-Free Grammars (CFGs)
                                                                                           •   Derivations
                       CS 412
                                                                                           •   Parse trees and abstract syntax
              Introduction to Compilers
                                                                                           •   Ambiguous grammars
                            Andrew Myers
                           Cornell University

                Lecture 4: Syntactic Analysis
                         31 Jan 01

                                  CS 412/413 Introduction to                                   CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers               2
                                Compilers Spring '01 -- Andrew
                                           Myers




                         Where we are                                                          What is Syntactic Analysis?

Source code                  if (b == 0) a = b;                                           Source code          {
(character stream)                                                                        (token stream)            if (b == (0)) a = b;
                                                                                                                    while (a != 1) {
                                                                       Lexical analysis                                  stdio.print(a);
                                                                                                                         a = a - 1;
                                                                                                                    }
Token stream if ( b == 0 ) a = b ;                                                                             }
                                                                                          Abstract Syntax                          block
                                                                  Syntactic Analysis      Tree
                                    if
                                                                    (specification)                                if_stmt                     while_stmt
                              ==             =       ;
Abstract syntax tree                                                                                 bin_op                   ... ...          bin_op                         block
      (AST)          b             0     a       b
                                                                                          ==       variable          constant           !=    variable constant              expr_stmt =
                                                                  Semantic Analysis                     b                 0                        a             1            call
                                                                                                                                                                                        ... ...
                                                                                                                                                                         .
                                                                                                                                                                 stdio       print    variable
                                                                                                                                                                                        a
     CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers            3              CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers               4
                                 Parsing                                          Overview of Syntactic Analysis
• Parsing: recognizing whether a program                                        • Input: stream of tokens
  (or sentence) is grammatically well-                                          • Output: abstract syntax tree
  formed & identifying the function of each
  component.                                                                    • Implementation:
                                                    sentence                      – Parse token stream to traverse concrete
    “I gave him the book”                                                           syntax (parse tree)
                                                  object                          – During traversal, build abstract syntax tree
            subject: I verb:gave indirect object:
                                                                                  – Abstract syntax tree removes extra syntax
                                        him    noun phrase
                                                                                    a + b ! (a) + (b) ! ((a)+((b)))
                                                      article: the noun: book                                                                           bin_op
                                                                                                                                                    +    a       b
  CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers   5             CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers        6




      What Parsing doesn’t do                                                     Specifying Language Syntax
• Doesn’t check many things: type
  agreement, variables declared, variables                                      • First problem: how to describe language
  initialized, etc.                                                               syntax precisely and conveniently
  int x = true;                                                                 • Last time: can describe tokens using
                                                                                  regular expressions
  int y;
                                                                                • Regular expressions easy to implement,
  z = f(y);
                                                                                  efficient (by converting to DFA)
• Deferred until semantic analysis                                              • Why not use regular expressions (on
                                                                                  tokens) to specify programming language
                                                                                  syntax?

  CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers   7             CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers        8
                         Limits of REs                                                   Need more power!
• Programming languages are not regular --                                • RE = DFA
  cannot be described by regular exprs                                    • DFA has only finite number of states;
• Consider: language of all strings that                                    cannot perform unbounded counting
  contain balanced parentheses (easier than
  PLs)
   ()   (())  ()()() (())()((()()))
   (( )( ()) (()()                                                                              (          (         (         (         (
• Problem: need to keep track of number of
  parentheses seen so far: unbounded
  counting                                                                                       )         )         )         )         )
                                                                             maximum depth: 5 parens
   CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers   9      CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers   10




       Context-Free Grammars                                                               Definition of CFG
• A specification of the balanced-parenthesis
  language:                                                                • Terminals
  S! (S)S                                                                     – Token or !
  S! "                                                                                                                             S! (S)S
                                                                           • Non-terminals
• The definition is recursive                                                 – Syntactic variables                                S! "
• A context-free grammar                                                   • Start symbol
  – More expressive than regular expressions                                  – A special nonterminal is designated (S)
  – S = (S) " = ((S) S) " = ((") ") " = (())                               • Productions
                                                                              – Specify how non-terminals may be expanded to
If a grammar accepts a string, there is a                                       form strings
   derivation of that string using the                                        – LHS: single non-terminal, RHS: string of
   productions of the grammar                                                   terminals or non-terminals
                                                                           • Vertical bar is shorthand for multiple prod’ns
   CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers   11     CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers   12
               RE is subset of CFG                                                                            Sum grammar
                                                                                       S! E+S | E
 Regular Expression defn of real numbers:                                              E ! number | ( S )
     digit ! [0-9]
     posint ! digit+                                                                                                 e.g. (1 + 2 + (3+4))+5
     int ! -? posint
     real ! int . (! | posint)
                                                                                       S! E+S
 • RE symbolic names are only shorthand:
   no recursion, so all symbols can be fully                                           S!E                                    4 productions
                                                                                                                              2 non-terminals (S, E)
   expanded:                                                                           E ! number                             4 terminals: (, ), +, number
     real ! -? [0-9]+ . (! | ([0-9]+))                                                 E ! (S)                                start symbol S


   CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers       13              CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers   14




              Derivation Example                                                           Constructing a derivation
                             S! E+S|E                                                  • Start from start symbol (S)
                           E ! number | ( S )                                          • Productions are used to derive a sequence
Derive (1+2+ (3+4))+5:                                                                   of tokens from the start symbol
S ! E + S ! ( S ) + S ! (E + S )+ S                                                    • For arbitrary strings #, $ and %
  ! (1 + S)+S ! (1 + E + S)+S                                                            and a production A ! $
  ! (1 + 2 + S)+S ! (1 + 2 + E)+S
  ! (1 + 2 + ( S ) )+S! (1 + 2 + ( E + S ) )+S                                           A single step of derivation is
  ! (1 + 2 + ( 3 + S ) )+S                                                               #A% & #$%
  ! (1 + 2 + ( 3 + E ) )+S
  ! (1 + 2+ (3+4))+S                                                                     – i.e., substitute $ for an occurrence of A
  ! (1 + 2+ (3+4))+E            replacement string                                       (S + E) + E ! (E + S + E)+E
  ! (1 + 2+ (3+4))+5
                                                         non-terminal being expanded                                                          (A = S, $ = E + S)
   CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers       15              CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers   16
       Derivation & Parse Tree                                                                                           Parse Tree
                   S                                                                       • Also called “concrete syntax”
                 E + S                       • Tree representation of the
 Parse                                         derivation                                     parse tree/ S
             ( S ) E                         • Leaves of tree are terminals;
 Tree                                                                                     concrete syntax E + S                                       abstract
         E + S 5                               in-order traversal yields string                                                                       syntax tree
         1 E+S                               • Internal nodes: non-terminals                                ( S ) E
                                             • No information about order of                                                                                     +
            2 E                                derivation steps                                             E + S 5                                         +        5
              (S)                                                                                           1 E+S                                      1         +
                                                                     S! E+S|E
             E+S                            (1+2+ (3+4))+5                                                     2 E                                           2       +
             3 E
                                                                     E ! number | ( S )
                 4                                                                                               (S)                                    3 4
 Derivation
 S ! E + S ! ( S ) + S ! (E + S )+ S ! (1 + S)+S ! (1 + E + S)+S                                                E+S                        (Discards/abstracts
   ! (1 + 2 + S)+S ! … ! (1 + 2 + ( S ) )+S! (1 + 2 + ( E + S ) )
   +S ! … ! (1 + 2 + ( 3 + E))+S ! … ! (1 + 2+ (3+4))+5                                                         3 E 4                     unneeded information)
   CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers       17                    CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers           18




                   Derivation order                                                                                         Example                           S!E+S|E
• Can choose to apply productions in any order;                                                                                                             E ! number | ( S )
  select any non-terminal A
                                                                                            • Left-most derivation
   #A% & #$%
                                                                                            S !E+S !(S) + S ! (E + S )+ S ! (1 + S)+S ! (1+E
• Two standard orders: left- and right-most --                                                +S)+S ! (1+2+S)+S ! (1+2+E)+S ! (1+2+(S))+S
  useful for different kinds of automatic parsing                                             ! (1+2+(E+S))+S ! (1+2+(3+S))+S ! (1+2+(3+E))
                                                                                              +S ! (1+2+(3+4))+S ! (1+2+(3+4))+E !
• Leftmost derivation: In the string, find the                                                (1+2+(3+4))+5
  left-most non-terminal and apply a production                                             • Right-most derivation
  to it E + S ! 1 + S                                                                       S !E+S !E+E ! E+5 ! (S)+5 ! (E+S)+5 ! (E+E
• Rightmost derivation: find right-most non-                                                  +S)+5 ! (E+E+E)+5 ! (E+E+(S))+5 ! (E+E+(E
                                                                                              +S))+5 ! (E+E+(E+E))+5 ! (E+E+(E+4))+5 ! (E
  terminal…etc. E + S ! E + E + S                                                             +E+(3+4))+5! (E+2+(3+4))+5 ! (1+2+(3+4))+5
                                                                                            • Same parse tree: same productions chosen, diff. order

   CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers       19                    CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers           20
          Ambiguous Grammars                                                           An Ambiguous Grammar
• In example grammar, left-most and right-
  most derivations produced identical parse                                    • + associates to right because of right-
  trees                                                                          recursive production S ! E + S
• + operator associates to right in parse tree                                 • Consider another grammar:
  regardless of derivation order
                                                                                              S ! S + S | S * S | number
                                                          +
                                                     +            5            • Different derivations produce different
           (1+2+(3+4))+5                         1        +                      parse trees: ambiguous grammar
                                                      2           +
                                                              3       4
  CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers         21     CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers    22




           Differing Parse Trees                                                             Impact of Ambiguity
          S ! S + S | S * S | number                                           • Different parse trees correspond to
• Consider expression 1 + 2 * 3                                                  different evaluations!
• Derivation 1: S ! S + S ! 1 + S ! 1 + S * S !                                • Meaning of program not defined
  1+2*S!1+2*3
• Derivation 2: S ! S * S ! S * 3 ! S + S * 3 !
  S+2*3!1+2*3

                                                                                         +                                          *
                         +                                    *                               *       =7                        +              =9
                   1         *
                                       '             2+                            1      2       3                         1       2     3
                  1      2       3                   1     2          3
  CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers         23     CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers    24
            Eliminating Ambiguity                                                                    Limits of CFGs
• Often can eliminate ambiguity by adding
  non-terminals & allowing recursion only                                       • Syntactic analysis can’t catch all “syntactic”
  on right or left                                                                errors
                                                                                • Example: C++
  S ! S+T | T                       S                                              HashTable<Key,Value> x;
  T ! T * num | num                S+T                                          • Need to know whether HashTable is the name
                                                                      T   T*3     of a type to understand syntax! Problem: “<”,
                                    1 2
                                                                                  “,” are overloaded
• T non-terminal enforces precedence                                            • Iota:
• Left-recursion : left-associativity                                              f(4)[1][2] = 0;
                                                                                • Difficult to write grammar for LHS of assign –
                                                                                  may be easier to allow all exprs, check later
    CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers       25     CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers   26




                                      CFGs
• Context-free grammars allow concise
  specification of programming languages
• CFG specifies how to convert token stream
  to parse tree (if unambiguous!)
• Read Appel 3.1, 3.2



Next time: implementing a top-down parser (leftmost derivation)



    CS 412/413 Introduction to Compilers Spring '01 -- Andrew Myers       27