whats a parser

W
Shared by: alendar
-
Stats
views:
22
posted:
3/5/2010
language:
English
pages:
4
Document Sample
scope of work template
							Parsers                                        data structure called a parse tree. This
by John A. Green                               parse tree is the desired result.

For some of us who are graduates only          That's the executive summary of lexers
of the University of Hard Knocks, the          and parsers, but let's talk a bit more
term parser might mean little more than        about what each of these do for us.
"something which reads text". For us,
parsing is something that we've done           The lexer, when analyzing the character
only in some small amount, usually in          stream, discards whitespace because
the form of IMPORT statements which            once the input has been tokenized the
load tabular data from text files into         whitespace normally has no impact on
database records.                              the meaning of the source code. There
                                               may also be a step prior to the lexer: the
This article explains parsers in two parts.    preprocessor. The preprocessor normally
The first part is a very brief introduction    discards comments. It also deals with
to parsers and compilers. It describes         macro expansion and include files.
what it takes to parse text written in a       Sometimes (but not often), the
programming language. The second part          preprocessor is built right into the lexer.
is a discussion of a few ways in which         The stream of tokens which we get out
parsers are used, and why parsers can be       of the lexer is normally the bare
interesting even for those with no             minimum necessary for syntactic
intention of writing compilers.                analysis.

Part I – What Is a Parser?                     Sometimes the text which made up a
                                               token is discarded once the token has
Parsing a nontrivial language normally         been created. For example, token types
involves at least two major components:        are normally expressed as integers so
one for lexical analysis and one for           that the parser can use integer
syntactic analysis.                            comparisons, which are very fast. For
                                               keywords, the text that made up that
The lexical analysis is done by a lexer        token (keyword) is no longer important –
(also called a scanner). Its job is to scan    internally “FIND” can be represented as
the input - a character stream - one           an integer, and the text string would be
character at a time. It understands what       unnecessary baggage.
to do with particular sequences of
characters. It combines particular             The parser also normally does a good
sequences of characters into tokens (also      deal of discarding. Once a hierarchical
called lexemes). The lexer’s fundamental       data structure has been built, much of the
job is to generate a token stream.             information from the token stream is no
                                               longer necessary. For example, end-of-
This token stream is the input for the         statement tokens can be discarded
syntactic part of the parser. Built into the   because the structure of the hierarchical
parser is an understanding of the syntax       tree is enough to determine where
of the language. It combines particular        statements begin and end.
sequences of tokens into a hierarchical
There are different types of trees which       the definition of compile is sometimes
can be built, such as parse trees and          unclear – some people might mistakenly
syntax trees. (I won’t get into the            understand it to be defined strictly as the
distinction here.) Parsers often generate      process of turning source code into
a refinement of what would have been a         executable code. In fact, compiling is
full syntax tree, and such a refined tree is   simply the process of translating. It
called an Abstract Syntax Tree (AST).          doesn’t matter if the target code is
For the purpose of this article, I refer to    machine code, source code in another
“parse tree” as the output of a parser –       language, or something in between.
regardless of the form of tree generated.
                                               In some relatively straightforward
For a simple example, let’s say we’re          translations, it is possible for target
parsing an expression like “A + B * C”.        source code to be written out directly by
A resulting parse tree might look like         the parser while it consumes tokens,
this:                                          rather than having the parser generate an
                                               intermediate tree. Because the parser is
         PLUS                                  doing a straightforward translation from
         / \                                   one syntax to another, this is called
     A          MULT
                / \                            syntax directed translation. Normally
            B          C                       though, a parse tree is built, and multiple
                                               passes are made through that parse tree
The PLUS and MULT nodes in the parse           to transform it into something which
tree would probably be represented just        more closely resembles the structure of
as integers. The nodes in the parse tree       the desired target code.
which represent the variables A, B, and
C might contain pointers to the                Additionally, there may be optimization
respective variable’s representation in a      passes. For example, if we have "1 + 2"
symbol table.                                  in the source code, an optimizing pass
                                               can recognize this as a constant
Building a parse tree is only a                expression, and evaluate it to 3 at
preparatory step. It is the first step for a   compile time, rather than at run time.
number of different types of tools –
parsing itself is only a means to an end.      Part II – What Is a Parser Good For?
Typically, once the parse tree is built,
the useful work comes from a process of        In this part of the article, we’ll do a brief
walking the parse tree to extract useful       review of a few of the general classes of
information. Extracting information and        tools which make use of parsers (leaving
meaning (semantics) from the source            aside compilers which are an obvious
code is what programmers do when they          example).
read the source code. Automatically
extracting semantics is done by way of         There is a class of tools called lint tools.
programs which walk a parse tree.              From an old Unix “man” page for C lint,
                                               you might have seen a description like
A compiler will walk the parse tree and        “picks little bits of fluff from your
may use one or more processes to               source code”. Lint tools find things in
generate the output code. As an aside,         your source code which, although they
compile, probably aren’t what you meant        can be added to the reference materials.
to do. For example, lint might find            This actually requires a different sort of
unused variables, or statements which          parser than is typically used for a
cannot possibly have an effect on the          compiler. As mentioned in the first part
operation of the program. Lint tools can       of this article, comments and whitespace
be used for finding performance                are not necessary for a compiler and are
mistakes (missing NO-UNDO on any of            usually discarded at a very early stage.
your variable definitions?), and they can
be used for checking that your code            Along the same line as the code
follows your company’s code style              documenter, parsers can be used for
conventions (are you supposed to have          printing out source code in a format
comments at the top of every program?).        which is easier for the programmer to
                                               follow. Code could be pretty-printed, or
For a lint tool to do its work, it must        could be written out with mark ups
sometimes be able to follow the syntax,        showing where include files begin and
and sometimes get into the semantics, of       end, comments could be stripped or
a particular program. For example, it          highlighted, particular include files
would take a relatively sophisticated sed      could be left out of the listing while all
or perl script to even find DEFINE             other include files are expanded, etc.
VARIABLE statements which are
missing the NO-UNDO option.                    Code browsers are a class of tool in
However, even with sophisticated sed or        which the user interface is very
perl scripts it would be rather difficult to   important. A code browser allows you to
determine if a variable’s value is ever        quickly and easily navigate through
accessed.                                      source code. They allow you to browse
                                               through lists of modules. They allow you
Code documenters are another type of           to take high level views of your system,
tool which benefit from a parser.              and zoom in on the piece in which you
Typically, a code documenter builds            are interested. They allow you to quickly
summary reference pages for each unit          jump from one code location to another
of a given system. With a parser, one can      when you want to see the definition of
pick through the code to pull out only         something which is being referred to in
the most interesting bits. A code              the code currently being viewed. Code
documenter might want to present the           browsers typically employ a minimalist
parameters for a particular unit. It might     and very fast parser to extract only the
also want to present a summary of each         bits of information necessary for code
of the public methods (functions or            browsing. ED for Windows is an
procedures) available within that unit.        example of a popular programmer’s
We may also want to see a summary of           editor which has built in code browsing
the parameters required for each of those      capability for a number of languages.
methods.
                                               Parsers can also be used for building ad
Additionally, a code documenter                hoc queries about your source code. Do
probably wants to pull out specific            you need to find and report every place
comments from the source code, so that         in your system where you have a
descriptions of methods and parameters         database access FOR EACH loop which
contains UI? Examining every FOR
EACH loop in your system may take             Although Proparse today maintains
days or weeks. Using a parser would           information about comments,
help you reduce this analysis time            whitespace, include files, and line
dramatically.                                 numbers, it is not yet capable of
                                              regenerating the source code’s original
As a final example, parsers can be used       preprocessor directives. This means that
when you want to make automated               it is capable of doing automated
changes to your code. Refactoring has         refactoring, but it would be as if the
been defined as the process of improving      refactoring was done on the code output
the structure of the source code without      from COMPILE with PREPROCESS.
changing its behavior. Normally when          The ability to regenerate original code as
refactoring is discussed, people are          it appeared before preprocessing is
talking about manually restructuring          something that will available in the
object oriented systems to follow better      coming months.
design patterns. However, I tend to use
the term for describing any restructuring     Feedback? Write me: john@joanju.com.
process, whether dealing with object          Thanks to Judy Hoffman Green, Gerry
oriented design patterns or not.              Winning, and Greg Wutzke for
                                              reviewing.
Automated code refactoring is best done
in a manner similar to the way compiling      ---
is done. First, the code is parsed and a      John Allen Green has started using all
parse tree is built. Second, the tree is      three names because there are too many
analyzed with one or more passes              people named John Green. He is not the
through the tree. With each pass, the         John Green who wrote books about
parse tree would likely be marked up          Sasquatch, so please don’t ask him about
with special attributes which are             it.
meaningful to later passes. Third, the
source tree is transformed. Branches of
the tree may be moved or deleted, and
new nodes may be added to the tree.
Fourth and finally, the modified parse
tree is written out as source code.

Progress code refactoring is the
particular area of interest for Proparse, a
parser built by us at Joanju. We have
used it to build an example of a Progress
code documentation system: AutoDox.
Proparse and AutoDox can be found at
our website: www.joanju.com . Jurjen
Dijkstra (of www.global-shared.com
fame) has built a splendid Progress lint
tool called Prolint, which can be found at
his website.

						
Related docs
Other docs by alendar