VIEWS: 21 PAGES: 16 POSTED ON: 7/30/2012 Public Domain
COMPILER CONSTRUCTION WEEK-2: LANGUAGE DESCRIPTION- SYNTACTIC STRUCTURE: An Overview • Clear and complete descriptions of a language are needed by programmers, implementers, and even language designers. • The syntax of a language specifies how programs in the language are built up. • The semantics of the language specifies what programs mean. • For example, dates are built up from digits represented by D and the symbol / as follows: • DD/DD/DDDD • According to this syntax, 01/02/2001 is a date. • The day this date refers to is not identified by the syntax. • In the United States, this date refer to January 2, 2001, but elsewhere 01 is interpreted as the day and 02 as the month, so the date refers to February 1, 2001. • The same syntax therefore has different semantics in different parts of the world. Expression Notations: • Expression such as a+b*c have been in use for centuries and were a starting point for design of programming languages. • For example, a expression in Fortran can be written as: • (- b + b2 – 4 * a * c ) / (2 * a) • (- b + sqrt (b * b – 4.0 * a * c)) / (2.0 * a) • Programming languages use a mix of infix, prefix, and postfix notations. (Assignment) • A binary operator is applied to two operands. • In infix notation, a binary operator is written between its operands, as in the expression a+b. • Other alternative are prefix notation, in which the operator is written first, as + a b, and postfix notation, in which the operator is written last, as a b +. • An expression can be enclosed within parentheses without affecting its value. • Expression E has the same value as (E), as a rule. • Prefix and Postfix notations are sometimes called parenthesis-free because as we shall see, the operands of each operator can be found unambiguously, without the need for parentheses. Prefix Notation: • An expression in prefix notation is written as follows: • The prefix notation for a constant or a variable is the constant or variable itself. • The application of an operator op to sub-expressions E1 and E2 is written in prefix notation as op E1 E2. • An advantage of prefix notation is that it is easy to decode during a left-to-right scan of an expression. • If a prefix expression begins with operator +, the next expression after + must be the first operands of + and the expression after that must be the second operand of +. • For example, the sum of x and y is written in prefix notation as + x y. • The product of + x y and z is written as * + x y z. • Thus + 20 30 equals to 50 and * + 20 30 60 = * 50 60 = 3000 • Or * 20 + 30 60 = * 20 90 = 1800 Postfix Notation: • An expression in postfix notation is written as follows: • The postfix notation for a constant or a variable is the constant or variable itself. • The application of an operator op to sub-expressions E1 and E2 is written in postfix notation as E1 E2 op. • An advantage of postfix notation is that they can be mechanically evaluated with the help of a stack data structure. • For example, the sum of x and y is written in postfix notation as x y +. • The product of x y + and z is written as x y + z *. • Thus 20 30 + equals to 50 and 20 30 + 60 * = 50 60* = 3000 • Or 20 30 60 + * = 20 90 * = 1800 Infix Notation: • In infix notation, operators appear between their operands; + appear between a and b in the sum a + b. • An advantage of infix notation is that it is familiar and hence easy to read. • Infix notation comes with rules for precedence and associativity. • How is an expression like a + b * c to be decoded? • Is it the sum of a and b * c, or is it the product of a + b and c? • The operator * usually takes its operands before + does. • An operator at a higher precedence level takes its operands before an operator at a lower precedence level. • BODMAS rules is an example. Mixfix Notation: • Operations specified by a combination of symbols do not fit neatly into the prefix, infix, postfix classification. • For example the keywords, if, then, and else are used together in the expression • if a > b then a else b • The meaningful components of this expression are the condition a>b and the expressions a and b. • If a>b evaluates to true, then the value of the expression is a, otherwise, it is b. • When symbols or keywords appear interspersed with the components of an expression, the operation will be said to be in mixfix notation. Abstract Syntax Trees: • The abstract syntax of a language identifies the meaningful components of each construct in the language. • The prefix expression +ab, the infix expression a+b, and the postfix expression ab+ all have the same meaningful components; the operator + and the sub-expressions a and b. • A corresponding tree representation is a better grammar can be designed if the abstract syntax of a language is known before the grammar is specified. + a b • An operator and its operands are represented by a node and its children. • A tree consists of a node with k 0 trees as its children. • When k = 0, a tree consists of just a node, with no children. • A node with no children is called a leaf. • The root of a tree is a node with no parent; that is, it is not a child of any node. Lexical Syntax: • Keyword like if and symbol like <= are treated as units in a programming language, just as words are treated as units in English. • The meaning of the word dote (love / admire) bears no relation to the meaning of dot, despite the similarity of their written representations. • The two-characters symbol <= is treated as a unit in Pascal and C. • It is distinct from the one-character < and =, which have different meaning of their own. • For example: • <> in Pascal and != in C • mod in Pascal and % in C etc. • Grammars deal with units called tokens • The syntax of a programming language is specified in terms of units called tokens or terminals. • A lexical syntax for language specifies the correspondence between the written representation of the language and the tokens or terminals in a grammar for the language. • Alphabetic character sequences that are treated as units in a language are called keywords. Lexical Syntax: • Similarly comments between tokens are ignored. • Informal descriptions usually suffice for white space, comments and the correspondence between tokens and their spellings, so lexical syntax will not be formalized. • Real numbers are a possible exception. • The most complex rules in a lexical syntax are typically the ones describing the syntax of real numbers, because parts of the syntax are optional. • The following some of the ways of writing the same number: • 314.E-2 = 3.14 = 0.314E+1 = 0.313E1 • and leading 0 can sometimes be dropped as: .314E1 Context-Free Grammars: • The concrete syntax of a language describes its written representation, including lexical details such as the placement of keywords and punctuation marks. • Context-free grammars, or simply grammars, are a notation for specifying concrete syntax. • BNF-form, Backus-Nour Form, is a one way of writing grammars. (Assignment) • A grammar for a language imposes a hierarchical structure, called a parse tree on programs in the language. • The following is a parse tree for the string 3.14 in a language of real numbers: Context-Free Grammars: real number Integer part fraction fraction digit digit digit 1 3 4 . Context-Free Grammars: • The leaves at the bottom of a parse tree are labeled with terminals or tokens like 3; tokens represent themselves. • By contrast, the other nodes of a parse tree are labeled with non- terminals like real-number and digit; non-terminal represent language constructs. • Each node in the parse tree is based on a production, a rule that defines a non-terminal in terms of a sequence of terminals and non- terminals. • The root of the parse tree for 3.14 is based on the following informally stated production: • “A real number consists of an integer part, a point, and a fraction part”. • Together the tokens, the non-terminals, the productions, and a distinguished non-terminal, called the starting non-terminal, constitute a grammar for a language. • The starting non-terminal may represent a portion of a complete program when fragments of a programming language are studies. • Both tokens and non-terminals are referred to as grammar symbols, or simply symbols. Definition of Context-Free Grammars: • Given a set of symbols, a starting over the set is a finite sequence of zero or more symbols from the set. • The number of symbol in the sequence is said to be the length of the string. • The length of the string “teddy” is 5. • An empty string is a string of length zero. • A context-free grammar, or simply grammar, has four parts: – A set of tokens or terminal; these are the atomic symbols in the language – A set of non-terminals; these are the variable representing constructs in the language. – A set of rules called productions for identifying the components of a construct. Each production has a non-terminals as its left side, the symbol =, and a string over the sets of terminals and non-terminals as its right side – A non-terminal chosen as the starting non-terminal; it represents the main construct of the language. • Unless otherwise stated, the production for the starting non-terminal appear first. BNF: Backus-Naur Form: • The concept of a context-free grammar, consisting of terminals, non- terminals, productions, and a string non-terminal, is independent of the notation used to write grammars. • BNF is one such notation, made popular by its use to organize the report on the Algol-60 programming language. Grammars for Expressions: • A well-designed grammar can make it easy to pick out the meaningful components of a construct. • In other words, with a well-designed grammar, parse trees are similar enough to abstract syntax trees that the grammar can be used to organize a language description or a program that exploits the syntax. • An example of a program that exploits syntax is an “expression evaluator” that analyzes and evaluates expressions. • After expressions, the remaining syntax is often easy. Variants of Grammars: • The other ways of grammars are Extended BNF and Syntax Charts. • EBNF is an extension of BNF that allows lists and optional elements to be specified. Lists or sequences of elements appear frequently in the syntax of programming language. The appeal of EBNF is convenience, not additional capability, since anything that can be specified with EBNF can also be specified using BNF. • Syntax charts are a graphical notation for grammars. They have visual appeal; again, anything that can be specified using syntax charts can also be specified using BNF. (Assignment)