VIEWS: 36 PAGES: 52 POSTED ON: 7/13/2011
Chap. 4, Formal Grammars and Parsing J. H. Wang Mar. 18, 2011 Outline • Introduction • Context-Free Grammars • Properties of CFGs • Transforming Extended Grammars • Parsers and Recognizers • Grammar Analysis Algorithms Introduction • A natural language’s grammar: to capture a small but important aspect of a sentence’s validity with respect to a natural language • Regular sets: guiding the actions of automatically constructed scanner – Chap. 3 • Grammar: guiding the actions of the parsers – Chap. 5, 6 • Semantic analysis: enforcing programming language rules that are not easily expressed by grammars – Chap. 7, 8, 9 The Role of the Parser source token Parse tree program Lexical Rest of Intermediate Parser Analyzer Font End representation Get next token Symbol Table Context-Free Grammars • Components: G=(N,,P,S) – A finite terminal alphabet : the set of tokens produced by the scanner – A finite nonterminal alphabet N: variables of the grammar – A start symbol S: SN that initiates all derivations • Goal symbol – A finite set of productions P: AX1…Xm, where AN, XiN, 1≤i≤m and m≥0. • Rewriting rules • Vocabulary V=N – N= • CFG: recipe for creating strings • Derivation: a rewriting step using the production A replaces the nonterminal A with the vocabulary symbols in – Left-hand side (LHS): A – Right-hand side (RHS): • Context-free language of grammar G L(G): the set of terminal strings derivable from S Notations Names Beginning Represent Examples with Symbols In Uppercase N A, B, C, Prefix Lowercase and a, b, c, if, then, punctuation (, ; X, Y N Xi, Y3 Other Greek letters (N)* , , • Or • A=>: one step of derivation notation: using the production A – A – =>+: derives in one or more steps | – =>*: derives in zero or more steps … | • S=>*: is a sentential form of – A the CFG A • SF(G): the set of sentential forms … of G A • L(G)={w*|S=>+w} – L(G)=SF(G)* • Two conventions that nonterminals are rewritten in some systematic order – Leftmost derivation: from left to right – Rightmost derivation: from right to left Leftmost Derivation • A derivation that always chooses the leftmost possible nonterminal at each step – =>lm, =>+lm, =>*lm – A left sentential form • A sentential form produced via a leftmost derivation • E.g. production sequence in top-down parsers • (Fig. 4.1) • E.g: a leftmost derivation of f ( v + v ) – E =>lm Prefix ( E ) =>lm f ( E ) =>lm f ( v Tail ) =>lm f ( v + E ) =>lm f ( v + v Tail ) =>lm f ( v + v ) Rightmost Derivations • The rightmost possible nonterminal is always expanded – Canonical derivation – =>rm, =>+rm, =>*rm – A right sentential form • A sentential form produced via a rightmost derivation • E.g. produced by bottom-up parsers (Ch. 6) • (Fig. 4.1) • E.g: a rightmost derivation of f ( v + v ) – E =>rm Prefix ( E ) =>rm Prefix ( v Tail ) =>rm Prefix ( v + E ) =>rm Prefix ( v + v Tail ) =>rm Prefix ( v + v ) =>rm f ( v + v ) Parse Trees • Parse tree: graphical representation of a derivation – Root: start symbol S – Each node: either grammar symbol or λ – Interior nodes: nonterminals • An interior node and its children: production – E.g. Fig. 4.2 • Phrase of the sentential form: a sequence of symbols descended from a single nonterminal in the parse tree • A simple or prime phrase: a phrase that contains no smaller phrase • Handle of a sentential form: the leftmost simple phrase • E.g. f ( v Tail ) in Fig. 4.2 Other Types of Grammars • Regular grammars: less powerful • Context-sensitive and unrestricted grammars: more powerful Regular Grammars • A CFG that is limited to productions of the form AaB or Cd – RHS: either a symbol from {λ} followed by a nonterminal symbol, or a symbol from {λ} – Regular set • E.g. {[i]i|i>=1} not regular – ST T[T] |λ • Regular sets are a proper subset of the context-free languages Beyond Context-Free Grammars • Context-sensitive grammar: nonterminals are rewritten only when they appear in a particular context (A), provided the rule never causes the sentential form to contract in length • Unrestricted grammar (type-0 grammar): the most general • More powerful, but less useful – Efficient parsers for such grammars do not exist – It’s difficult to prove properties about such grammars • CFGs: a nice balance between generality and practicability Properties of CFGs • Some grammars might have problems: – Include useless symbols – Allow multiple, distinct derivations for some input string – Include strings not in the language, or exclude strings in the language Reduced Grammars • Each of its nonterminals and productions participates in the derivation of some string – Useless nonterminals: can be safely removed – E.g. • SA |B Aa BB b Cc – Algorithms to detect useless nonterminals • Ex.16 and Ex.19 Ambiguity • Allow a derived string to have two or more different parse trees – E.g. • Expr Expr – Expr | id • Two different parse trees for id – id – id – Fig. 4.3 – No algorithm for checking an arbitrary CFG for ambiguity • Undecidable Faulty Language Definition • Terminal strings derivable by the grammar do not correspond exactly to the strings in the language • Determining in general whether two CFGs generate the same language is an undecidable problem Transforming Extended Grammars • BNF (Backus-Naur form) – Optional symbols: enclosed in square brackets • A [X1…Xn] – Repeated symbols: enclosed in braces • B {X1…Xm} – E.g. Java-like declaration • Declaration [final][static][const] Type identifier {, identifier } – Transforming extended BNF grammars into standard form • Fig. 4.4 EW ON ERM EW ON ERM Parsers and Recognizers • Recognizer: to determine if input string x L(G) • Parser: to determine the string’s validity and structure (parse tree) – Top-down: starting at the root, expanding the tree in a depth-first manner • Preorder traversal, predictive – Bottom-up: starting at the leaves • Postorder traversal • E.g. grammar – Program begin Stmts end $ Stmts Stmt; Stmts |λ Stmt simplestmt | begin Stmts end – String: begin simplestmt; simplestmt; end $ • Top-down parse: Fig. 4.5 • Bottom-up parse: Fig. 4.6 • Parsing techniques – E.g. LL(1), LR(1) are the best-known top- down and bottom-up parsing strategies • L: token sequence is processed from left to right • L,R: Leftmost or Rightmost parse • 1: the number of lookahead symbols Grammar Analysis Algorithms • Grammar representation – Programming language constructs: • A set: an unordered collection of distinct entities • A list: an ordered collection of entities • An iterator: a construct that enumerates the contents of a set or list – Observations • Symbols are rarely deleted from a grammar • Transformations can add symbols and productions to a grammar • Typically visit all rules for a nonterminal, or visit all occurrences of a symbol in productions • A production’s RHS processed on symbol at a time Grammar Utilities • Creating or adding: – Grammar(S) – Production(A, rhs) – Nonterminal(A) – Terminal(x) • Iterators: – Productions() – Noterminals() – Terminals() – RHS(p) – LHS(p) – ProductionsFor(A) – Occurrences(X) – Tail(y) • Others – IsTerminal(X) – Production(y) Deriving the Empty String • It’s common to determine which nonterminals can derive λ – Not trivial because the derivation can take more than one step • A=>BCD=>BC=>B=> λ – Fig. 4.7 ERIVES MPTY TRING ON ERMINALS RODUCTIONS HECK OR MPTY CCURRENCES RODUCTION HECK OR MPTY HECK OR MPTY • The algorithm establishes two structures – RuleDerivesEmpty(p) – SymbolDerivesEmpty(A) – Useful in grammar analysis and parsing algorithms in Chap.4, 5, & 6 First Sets • The set of all terminal symbols that can begin a sentential form derivable from the string – First()={ a| =>*a } – We never include λ in First() even if =>λ – E.g. (in Fig.4.1) • First(Tail) = {+} • First(Prefix) = {f} • First(E) = {v, f, (} – Fig.4.8, Fig. 4.9, Fig. 4.10 IRST ON ERMINALS NTERNAL IRST NTERNAL IRST NTERNAL IRST NTERNAL IRST Follow Sets • The set of terminals that can follow a nonterminal A in some sentential form – For AN, • Follow(A) = {b|S=>+Ab} – The right context associated with A – Fig. 4.11 OLLOW ON ERMINALS NTERNAL OLLOW NTERNAL OLLOW CCURRENCES IRST AIL LL ERIVE MPTY RODUCTION NTERNAL OLLOW LL ERIVE MPTY • First and Follow sets can be generalized to include strings of length k – Firstk(), Followk(A) – Useful in parsing techniques that use k- symbol lookaheads (e.g. LL(k), LR(k)) More on FIRST and FOLLOW • Two functions FIRST and FOLLOW allow us to choose which production to apply, based on the next input symbol • FIRST(): the set of terminals that begin strings derived from – Ex: (Fig. 4.15) A=>* c, c is in FIRST(A) • FOLLOW(a): the set of terminals a that can appear immediately to the right of A in some sentential form – Ex: S =>* Aa • To compute FIRST(X) for all grammar symbols X – If X is a terminal, FIRST(X)={X} – If X is a nonterminal and XY1Y2…Yk, then place a in FIRST(X) if for some i, a is in FIRST(Yi) and Y1…Yi-1=>* – If Xe is a production, add to FIRST(X) • To compute FOLLOW(A) for all nonterminals A – Place $ in FOLLOW(S) – If there’s a production AB, then everything in FIRST() except is in FOLLOW(B) – If there’s a production AB, or AB, where FIRST() contains , then everything in FOLLOW(A) is in FOLLOW(B) Example • Ex: (4.28) – E T E’ E’ + T E’ | T F T’ T’ * F T’ | F (E) | id – FIRST(F)=FIRST(T)=FISRT(E)={(,id} – FIRST(E’)={+,e} – FIRST(T’)={*,e} – FOLLOW(E)=FOLLOW(E’)={),$} – FOLLOW(T)=FOLLOW(T’)={+,),$} – FOLLOW(F)={+,*,),$} Thanks for Your Attention!