Chapter 4_4_

Document Sample
Chapter 4_4_ Powered By Docstoc
					Chapter 4
Lexical and Syntax Analysis

   CS 350 Programming Language Design
    Indiana University – Purdue University
                 Fort Wayne
Chapter 4 topics
 Lexical Analysis
   Recursive-Descent Parsing
   Bottom-Up Parsing

 The syntax analysis portion of a compiler typically
 consists of two parts:
   A low-level part called a lexical analyzer
       A deterministic finite automaton (DFA)
       Based on a regular grammar
   A high-level part called a parser (or syntax analyzer)
       A push-down automaton
       Based on a context-free grammar described with BNF

 Reasons to use BNF to describe syntax
    Provides a clear and concise syntax description
    The parser can be based directly on the BNF
    Parsers based on BNF are easy to maintain
 Reasons to separate lexical and syntax analysis:
        Less complex approaches can be used for lexical analysis
        Separating them out simplifies the parser
        Separation allows optimization of the lexical analyzer
        Parts of the lexical analyzer may not be portable, but the parser always is
Lexical Analysis
 A lexical analyzer is a pattern matcher for
 character strings
 A lexical analyzer is a “front-end” for the parser
 Identifies substrings of the source program that
 belong together (lexemes)
   Lexemes match a character pattern, which is
   associated with a lexical category called a token
        myCount is a lexeme
        The token for myCount might be called IDENT

Lexical Analysis
 A lexical analyzer also . . .
    Skips comments
    Skips blanks outside lexemes
    Inserts lexemes for identifiers and literals into a
    symbol table
    Detects syntactic errors in lexemes
        Ill-formed floating-point literals, for example

Lexical Analysis
 The lexical analyzer is typically a function that is called by
 the parser when it needs the next token
 Three common approaches to building a lexical analyzer
    Write a formal description of the tokens and use a software tool
    that constructs table-driven lexical analyzers using the description
        A UNIX tool that does this is lex
    Draw a state transition diagram that describes the tokens and
    write a program that implements the state diagram
    Draw a state transition diagram that describes the tokens and
    hand-construct a table-driven implementation of the state
 We examine the second approach
Lexical Analysis
 State diagram design
    A naïve state diagram would have a transition from every state
    resulting from every character in the source language
    Such a diagram would be very large!
 Transitions can usually be combined to simplify the state
    To recognize an identifier, all uppercase and lowercase letters
    are equivalent
        Use a character class that includes all letters
    To recognize an integer literal, all digits are equivalent
        Use a digit class

Example state transition diagram

Lexical Analysis
 Reserved words and identifiers can be recognized
   Then use a table lookup to determine whether a
   possible identifier is in fact a reserved word
   Alternative is to have a separate part of the diagram for
   each reserved word

Lexical Analysis
 A lexical analyzer typically has several instance variables
    Character nextChar
    CharClass (letter, digit, etc.)
    String lexeme
    int tokenType
 Some convenient utility subprograms are . . .
        Gets the next character of input, puts it in nextChar, determines its
         class, and puts the class in charClass
        Adds the character from nextChar to the lexeme string
        Determines whether the string in lexeme is a reserved word
        Returns code for the Ident token or for the appropriate reserved word token

A simple lexical analyzer
/* a simple lexical analyzer */
int lex( ) {
     getChar( );
     switch ( charClass ) {

   /* Parse identifiers and reserved words */
   case LETTER:
      addChar( );
      getChar( );
      while ( charClass == LETTER || charClass == DIGIT ) {
          addChar( );
          getChar( );
      tokenType = lookup( lexeme );
      // … continued on next slide …

A simple lexical analyzer
     /* Parse integer literals */
     case DIGIT:
        addChar( );
        getChar( );
        while ( charClass == DIGIT ) {
          addChar( );
          getChar( );
        tokenType = INT_LIT;
    } /* End of switch */
} /* End of function lex */

 A parser is a recognizer for a context-free language
 Given an input program, a parser . . .
   Finds all syntax errors
       For each syntax error, an appropriate diagnostic message is
       Recovery is attempted to find additional syntax errors
   Produces the parse tree for the program
       Possibly just a traversal of the nodes of the parse tree in

 Two categories of parsers
   Top down
       Produce the parse tree, beginning at the root
       Order is that of a leftmost derivation
   Bottom up
       Produce the parse tree, beginning at the leaves
       Order is that of the reverse of a rightmost derivation
 Parsers look only one token ahead in the input

Top-down parsers
 A top-down parser traces the parse tree in preorder
 It produces a leftmost derivation of the program
 Partway through the leftmost derivation, suppose that the
 sentential form, xA, has been derived
    Using the notational conventions of pp. 182-183 . . .
        x is a string of terminal symbols
        A is a single nonterminal symbol
         is a mixed string of terminals and/or nonterminals
 Nonterminal A must be replaced next (leftmost derivation)
 As a nonterminal symbol, A has a nonempty set of
 production rules
    Call these the A-rules
    If more than one, which A-rule should be used?
Top-down parsers
 The parser must choose the correct A-rule to get the
 next sentential form in the leftmost derivation
   The parser is guided by the single lookahead token
   The chosen A-rule must uniquely produce the lookahead
 The most common top-down parsing algorithms . . .
   Recursive descent parser
       A coded implementation
   LL parsers
       Table driven implementation
       Left-to-right scan of tokens produces a Leftmost derivation
Bottom-up parsers
 Start with the tokens of the program and work back
 to the start symbol
   We end up with a rightmost derivation in reverse order
 Try to match the RHS of some production rule with
 a substring of tokens and replace the substring
 with the LHS of the production rule
   This is called a reduction
 The goal is to find a series of reductions
   Each reduction should produce the previous sentential
   form in a rightmost derivation
Bottom-up parsers
   More than one RHS may match input
   The correct RHS must be correctly selected based only
   on the lookahead token
   The correct RHS is called the handle
 The most common bottom-up parsing algorithms
 are in the LR family
   Table driven implementation
   Left-to-right scan of tokens produces a Rightmost
   derivation (in reverse order)
The Complexity of Parsing
 Parsers that work for any unambiguous grammar
 are complex and inefficient
   The big-O is O(n3), where n is the length of the input
   General parsers often reach dead ends and must back
   up and reparse
 Practical parsers only work for a subset of all
 unambiguous grammars
   The big-O of these is O(n), where n is the input length
   Such grammars can usually be found

Recursive-descent parsing
 This involves a subprogram for each nonterminal
 in the grammar
   This subprogram parses the sub-sentences that can be
   generated by that nonterminal
 Recursive production rules lead to recursive
 EBNF is ideally suited for being the basis for a
 recursive-descent parser
   EBNF minimizes the number of nonterminals

Recursive-descent parsing
 Consider a grammar for simple expressions
 <expr>  <term> { ( + | - ) <term> }
 <term>  <factor> { ( * | / ) <factor> }
 <factor>  id | ( <expr> )

 For a production rule LHS with only one RHS . . .
   Work through the RHS, symbol-by-symbol
   For any terminal symbol, compare it with the lookahead
        If they match, continue; else there is an error
   For any nonterminal symbol, call the symbol’s
   associated parsing subprogram
Recursive-descent parsing
 Assume we have a lexical analyzer named lex,
 which puts the next token code in nextToken
 /* Function expr parses strings in the language generated by the rule:
                 <expr> → <term> { ( + | - ) <term> }                       */
 void expr( ) {
    /* Parse the first term */
    term( );
    /* As long as the next token is + or -, call lex to get the next token,
       and parse the next term                                              */
    while ( nextToken == PLUS_CODE || nextToken == MINUS_CODE ) {
        lex( );
        term( );                Convention: term() and every other
    }                           parsing subprogram leaves the next
 }                              token in nextToken when it finishes

 This particular routine does not detect errors
Recursive-descent parsing
 A production rule LHS that has more than one
 RHS requires an additional step to determine
 which RHS it is to parse
   The correct RHS is chosen on the basis of the
   lookahead token
   The lookahead is compared with the first token that can
   be generated by each RHS until a match is found
       The possible tokens that can be generated must be
        determined by analysis when the compiler is constructed
   If no match is found, it is a syntax error

Recursive-descent parsing
/* Function factor parses strings in the language generated by the rule
                         <factor> -> id | ( <expr> )                                  */
 void factor( ) {
    /* Determine which RHS */
    if ( nextToken == ID_CODE )                      /* For the RHS id, just call lex */
    else if ( nextToken == LEFT_PAREN_CODE ) {
       /* If the RHS is (<expr>), call lex to pass over the left parenthesis,
           call expr, and check for the right parenthesis                              */
      lex( );
      expr( );
      if ( nextToken == RIGHT_PAREN_CODE )
          lex( );
          error( );
    else error( ); /* Neither RHS matches */
 } /* end of factor */
Recursive-descent parsing
 The LL grammar class has a problem with left
   If a grammar has left recursion, either direct or indirect,
   it cannot be the basis for a top-down parser
   For example, no production rule may have the form
   A recursive descent parser subprogram for A would
   immediately call itself, resulting in an infinite chain of
   recursive calls
   Fortunately, a grammar can be modified to remove left

Recursive-descent parsing
 The LL grammar class also has a problem with
 pairwise disjointness
   Lack of pairwise disjointness is another characteristic
   of grammars that disallows top-down parsing
   This is the inability to determine the correct RHS on the
   basis of one lookahead token

Pairwise disjointness problem
 Define the FIRST set of a symbol string  by
 FIRST() = {a |  =>* a }
    =>* means 0 or more production rule replacements
    If  =>*  is possible,  is in FIRST())
        Here  represents the empty string
 The production rule A   is only possible in a leftmost
 derivation when A is the leftmost nonterminal and the
 lookahead symbol is in FIRST()
 Pairwise disjointness test
    Let A be any LHS nonterminal with more than one RHS
    Then, for each pair of rules, A  i and A  k, it must be true
         FIRST(i)  FIRST(k) = 
Pairwise disjointness problem
   The following group of production rules pass pairwise
   disjointness test
            A  a | bB | cAb
   This group of production rules do not pass
           A  a | aB
 A grammar that fails the pairwise disjointness test
 can often be modified successfully using left

Left factoring example
 The production rule group
    <id_list>  identifier | identifier , <id_list>
 fails the pairwise disjointness test
 Replace the group with
   <id_list>  identifier <new>
   <new>  , <id_list> | 
 Recall that  represents the empty string

Bottom-up parsing
 Recall that a bottom-up parser produces a
 rightmost derivation in reverse order by reading
 input from left to right
  Simple grammar   Rightmost derivation
  EE+T|T           E
  TT*F|F          E+T
  F  ( E ) | id   E+T*F

Bottom-up parsing
 Given a right sentential form, the bottom-up parsing
 problem is to find the correct RHS (the handle) to reduce
 to a LHS to get the previous right sentential form in a
 rightmost derivation
 Some handle definitions
    Definition:  is the handle of the right sentential form
     =   w if and only if S =>*  A w =>   w
    Definition:  is a phrase of the right sentential form 
    if and only if S =>*  = 1A2 =>+ 12
    Definition:  is a simple phrase of the right sentential form 
    if and only if S =>*  = 1A2 => 12

Bottom-up parsing
 Intuition about handles
   The handle of a right sentential form is its leftmost
   simple phrase
   Given a parse tree, it is now easy to find the handle
       Of course, you are not given the parse tree in advance
   Parsing can be thought of as handle pruning
 Examples from derivation of slide 31
   Worked out in class

Bottom-up parsing
 Bottom-up parsers are often called shift-reduce
 The focus of parser activity is a parse stack
   Holds current right sentential form during reverse
   rightmost derivation
 Shift activity and reduce activity
   Reduce is the action of replacing the handle on the top
   of the parse stack with its corresponding LHS
   Shift is the action of moving the next input token to the
   top of the parse stack
Bottom-up parsing
 Advantages of LR parsers
   They will work for nearly all grammars that describe
   programming languages.
   They work on a larger class of grammars than other
   bottom-up algorithms, but are as efficient as any other
   bottom-up parser.
   They can detect syntax errors as soon as it is possible
       LL parsers also have this property
   The LR class of grammars is a superset of the class of
   grammars that can be parsed by LL parsers

Bottom-up parsing
 LR parsers
   Are table driven
   It is usually not practical to construct a table by hand
   The table must be constructed automatically from the
   grammar by a program
       For example, the UNIX program is yacc

Bottom-up parsing
 LR parsing was discovered by Donald Knuth (1965)
 Knuth’s insight
   A bottom-up parser can use the entire history of the
   parse, up to the current point, to make parsing decisions
   There are only a finite and relatively small number of
   different parse situations that could have occurred, so
   the history can be stored as a sequence of states Sm, on
   the parse stack

Bottom-up parsing
 An LR configuration is the entire state of an LR
 It can be represented by
      (S0 X1 S1 X2 S2 … Xm Sm, ai ai+1…an $)
   The uppercase letters represent the parse stack
   Letters X1, X2, …, Xm represent the current right
   sentential form within the parse stack
   The lowercase letters represent the unread input
       ai is the lookahead symbol
   There is one state Sk for each grammar symbol Xk on
   the parse stack
Bottom-up parsing
 LR parser operation

Bottom-up parsing
 LR parser table has two components
   ACTION table
       The ACTION table specifies the action of the parser, given
        the parser state and the next token (see next slide)
          • Rows are state names
              – Current state is on top of the parse stack
          • Columns are terminals
              – Current lookahead symbol is used
          • Action R3 indicates to reduce by production rule #3
          • Action S5 indicates to shift and then push state #5 on the stack
   GOTO table
       The GOTO table specifies which state to put on top of the
        parse stack after a reduction action has taken place
          • Rows are state names
          • Columns are nonterminals
Form of an LR parsing table

     Note: an empty box indicates an error situation
Form of an LR parsing table
 This parsing table resulted from the grammar
             1. E  E + T
             2. E  T
             3. T  T * F
             4. T  F
             5. F  ( E )
             6. F  id

 Initial LR configuration: (S0, a1…an$)

Parser actions
 Assume state is currently (S0X1S1X2S2…XmSm, ai…an$)
    The lookahead symbol is ai
    Sm is on the top of the stack
 If ACTION[Sm, ai] = Shift S, the next configuration is
           (S0X1S1X2S2…XmSmaiS, ai+1…an$)
 after pushing ai and S on the stack
 If ACTION[Sm, ai] = Reduce A   and S = GOTO[Sm-r, A],
 where r = the length of , the next configuration is
      (S0 X1 S1 X2 S2…Xm-r Sm-r A S, ai ai+1…an $)
   after popping 2r items off the stack and pushing A and S
 If ACTION[Sm, ai] = Accept, the parse is complete and no errors were
 If ACTION[Sm, ai] = Error, the parser calls an error-handling routine


Shared By: