Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

COMPILER CONSTRUCTION

VIEWS: 21 PAGES: 16

  • pg 1
									   COMPILER
 CONSTRUCTION
       WEEK-2:
LANGUAGE DESCRIPTION-
 SYNTACTIC STRUCTURE:
                 An Overview
• Clear and complete descriptions of a language are needed
  by programmers, implementers, and even language
  designers.
• The syntax of a language specifies how programs in the
  language are built up.
• The semantics of the language specifies what programs
  mean.
• For example, dates are built up from digits represented by
  D and the symbol / as follows:
• DD/DD/DDDD
• According to this syntax, 01/02/2001 is a date.
• The day this date refers to is not identified by the syntax.
• In the United States, this date refer to January 2, 2001, but
  elsewhere 01 is interpreted as the day and 02 as the
  month, so the date refers to February 1, 2001.
• The same syntax therefore has different semantics in
  different parts of the world.
            Expression Notations:
• Expression such as a+b*c have been in use for centuries and were a
  starting point for design of programming languages.
• For example, a expression in Fortran can be written as:
• (- b +  b2 – 4 * a * c ) / (2 * a)
•        (- b + sqrt (b * b – 4.0 * a * c)) / (2.0 * a)
• Programming languages use a mix of infix, prefix, and postfix
  notations. (Assignment)
• A binary operator is applied to two operands.
• In infix notation, a binary operator is written between its operands, as
  in the expression a+b.
• Other alternative are prefix notation, in which the operator is written
  first, as + a b, and postfix notation, in which the operator is written last,
  as a b +.
• An expression can be enclosed within parentheses without affecting
  its value.
• Expression E has the same value as (E), as a rule.
• Prefix and Postfix notations are sometimes called parenthesis-free
  because as we shall see, the operands of each operator can be found
  unambiguously, without the need for parentheses.
                Prefix Notation:
• An expression in prefix notation is written as follows:
• The prefix notation for a constant or a variable is the
  constant or variable itself.
• The application of an operator op to sub-expressions E1
  and E2 is written in prefix notation as op E1 E2.
• An advantage of prefix notation is that it is easy to decode
  during a left-to-right scan of an expression.
• If a prefix expression begins with operator +, the next
  expression after + must be the first operands of + and the
  expression after that must be the second operand of +.
• For example, the sum of x and y is written in prefix
  notation as + x y.
• The product of + x y and z is written as * + x y z.
• Thus + 20 30 equals to 50 and * + 20 30 60 = * 50 60 =
  3000
• Or * 20 + 30 60 = * 20 90 = 1800
              Postfix Notation:
• An expression in postfix notation is written as follows:
• The postfix notation for a constant or a variable is the
  constant or variable itself.
• The application of an operator op to sub-expressions E1
  and E2 is written in postfix notation as E1 E2 op.
• An advantage of postfix notation is that they can be
  mechanically evaluated with the help of a stack data
  structure.
• For example, the sum of x and y is written in postfix
  notation as x y +.
• The product of x y + and z is written as x y + z *.
• Thus 20 30 + equals to 50 and 20 30 + 60 * = 50 60* =
  3000
• Or 20 30 60 + * = 20 90 * = 1800
                  Infix Notation:
• In infix notation, operators appear between their operands;
  + appear between a and b in the sum a + b.
• An advantage of infix notation is that it is familiar and
  hence easy to read.
• Infix notation comes with rules for precedence and
  associativity.
• How is an expression like a + b * c to be decoded?
• Is it the sum of a and b * c, or is it the product of a + b and
  c?
• The operator * usually takes its operands before + does.
• An operator at a higher precedence level takes its
  operands before an operator at a lower precedence level.
• BODMAS rules is an example.
                Mixfix Notation:
• Operations specified by a combination of symbols do not
  fit neatly into the prefix, infix, postfix classification.
• For example the keywords, if, then, and else are used
  together in the expression
• if a > b then a else b
• The meaningful components of this expression are the
  condition a>b and the expressions a and b.
• If a>b evaluates to true, then the value of the expression is
  a, otherwise, it is b.
• When symbols or keywords appear interspersed with the
  components of an expression, the operation will be said to
  be in mixfix notation.
            Abstract Syntax Trees:
• The abstract syntax of a language identifies the meaningful
  components of each construct in the language.
• The prefix expression +ab, the infix expression a+b, and the postfix
  expression ab+ all have the same meaningful components; the
  operator + and the sub-expressions a and b.
• A corresponding tree representation is a better grammar can be
  designed if the abstract syntax of a language is known before the
  grammar is specified.
                +

          a                b
•   An operator and its operands are represented by a node and its
    children.
•   A tree consists of a node with k  0 trees as its children.
•   When k = 0, a tree consists of just a node, with no children.
•   A node with no children is called a leaf.
•   The root of a tree is a node with no parent; that is, it is not a child of
    any node.
                   Lexical Syntax:
• Keyword like if and symbol like <= are treated as units in a
  programming language, just as words are treated as units in English.
• The meaning of the word dote (love / admire) bears no relation to the
  meaning of dot, despite the similarity of their written representations.
• The two-characters symbol <= is treated as a unit in Pascal and C.
• It is distinct from the one-character < and =, which have different
  meaning of their own.
• For example:
• <> in Pascal and != in C
• mod in Pascal and % in C etc.
• Grammars deal with units called tokens
• The syntax of a programming language is specified in terms of units
  called tokens or terminals.
• A lexical syntax for language specifies the correspondence between
  the written representation of the language and the tokens or terminals
  in a grammar for the language.
• Alphabetic character sequences that are treated as units in a
  language are called keywords.
                Lexical Syntax:
• Similarly comments between tokens are ignored.
• Informal descriptions usually suffice for white space,
  comments and the correspondence between tokens and
  their spellings, so lexical syntax will not be formalized.
• Real numbers are a possible exception.
• The most complex rules in a lexical syntax are typically the
  ones describing the syntax of real numbers, because parts
  of the syntax are optional.
• The following some of the ways of writing the same
  number:
• 314.E-2 =           3.14 =        0.314E+1       =
       0.313E1
• and leading 0 can sometimes be dropped as:
       .314E1
       Context-Free Grammars:
• The concrete syntax of a language describes its written
  representation, including lexical details such as the
  placement of keywords and punctuation marks.
• Context-free grammars, or simply grammars, are a
  notation for specifying concrete syntax.
• BNF-form, Backus-Nour Form, is a one way of writing
  grammars. (Assignment)
• A grammar for a language imposes a hierarchical
  structure, called a parse tree on programs in the language.
• The following is a parse tree for the string 3.14 in a
  language of real numbers:
Context-Free Grammars:
               real number




Integer part                         fraction


                                                fraction
   digit                     digit



                                                  digit


                             1
     3                                               4
                    .
         Context-Free Grammars:
• The leaves at the bottom of a parse tree are labeled with terminals or
  tokens like 3; tokens represent themselves.
• By contrast, the other nodes of a parse tree are labeled with non-
  terminals like real-number and digit; non-terminal represent language
  constructs.
• Each node in the parse tree is based on a production, a rule that
  defines a non-terminal in terms of a sequence of terminals and non-
  terminals.
• The root of the parse tree for 3.14 is based on the following informally
  stated production:
• “A real number consists of an integer part, a point, and a fraction part”.
• Together the tokens, the non-terminals, the productions, and a
  distinguished non-terminal, called the starting non-terminal, constitute
  a grammar for a language.
• The starting non-terminal may represent a portion of a complete
  program when fragments of a programming language are studies.
• Both tokens and non-terminals are referred to as grammar symbols, or
  simply symbols.
   Definition of Context-Free Grammars:
• Given a set of symbols, a starting over the set is a finite sequence of
  zero or more symbols from the set.
• The number of symbol in the sequence is said to be the length of the
  string.
• The length of the string “teddy” is 5.
• An empty string is a string of length zero.
• A context-free grammar, or simply grammar, has four parts:
    – A set of tokens or terminal; these are the atomic symbols in the
      language
    – A set of non-terminals; these are the variable representing
      constructs in the language.
    – A set of rules called productions for identifying the components of a
      construct. Each production has a non-terminals as its left side, the
      symbol =, and a string over the sets of terminals and non-terminals
      as its right side
    – A non-terminal chosen as the starting non-terminal; it represents the
      main construct of the language.
• Unless otherwise stated, the production for the starting non-terminal
  appear first.
BNF: Backus-Naur Form:
• The concept of a context-free grammar, consisting of terminals, non-
  terminals, productions, and a string non-terminal, is independent of the
  notation used to write grammars.
• BNF is one such notation, made popular by its use to organize the
  report on the Algol-60 programming language.

Grammars for Expressions:
• A well-designed grammar can make it easy to pick out the meaningful
  components of a construct.
• In other words, with a well-designed grammar, parse trees are similar
  enough to abstract syntax trees that the grammar can be used to
  organize a language description or a program that exploits the syntax.
• An example of a program that exploits syntax is an “expression
  evaluator” that analyzes and evaluates expressions.
• After expressions, the remaining syntax is often easy.
Variants of Grammars:
• The other ways of grammars are Extended BNF and Syntax Charts.
• EBNF is an extension of BNF that allows lists and optional elements to
  be specified. Lists or sequences of elements appear frequently in the
  syntax of programming language. The appeal of EBNF is
  convenience, not additional capability, since anything that can be
  specified with EBNF can also be specified using BNF.
• Syntax charts are a graphical notation for grammars. They have
  visual appeal; again, anything that can be specified using syntax
  charts can also be specified using BNF. (Assignment)

								
To top