whats a parser
Document Sample


Parsers data structure called a parse tree. This
by John A. Green parse tree is the desired result.
For some of us who are graduates only That's the executive summary of lexers
of the University of Hard Knocks, the and parsers, but let's talk a bit more
term parser might mean little more than about what each of these do for us.
"something which reads text". For us,
parsing is something that we've done The lexer, when analyzing the character
only in some small amount, usually in stream, discards whitespace because
the form of IMPORT statements which once the input has been tokenized the
load tabular data from text files into whitespace normally has no impact on
database records. the meaning of the source code. There
may also be a step prior to the lexer: the
This article explains parsers in two parts. preprocessor. The preprocessor normally
The first part is a very brief introduction discards comments. It also deals with
to parsers and compilers. It describes macro expansion and include files.
what it takes to parse text written in a Sometimes (but not often), the
programming language. The second part preprocessor is built right into the lexer.
is a discussion of a few ways in which The stream of tokens which we get out
parsers are used, and why parsers can be of the lexer is normally the bare
interesting even for those with no minimum necessary for syntactic
intention of writing compilers. analysis.
Part I – What Is a Parser? Sometimes the text which made up a
token is discarded once the token has
Parsing a nontrivial language normally been created. For example, token types
involves at least two major components: are normally expressed as integers so
one for lexical analysis and one for that the parser can use integer
syntactic analysis. comparisons, which are very fast. For
keywords, the text that made up that
The lexical analysis is done by a lexer token (keyword) is no longer important –
(also called a scanner). Its job is to scan internally “FIND” can be represented as
the input - a character stream - one an integer, and the text string would be
character at a time. It understands what unnecessary baggage.
to do with particular sequences of
characters. It combines particular The parser also normally does a good
sequences of characters into tokens (also deal of discarding. Once a hierarchical
called lexemes). The lexer’s fundamental data structure has been built, much of the
job is to generate a token stream. information from the token stream is no
longer necessary. For example, end-of-
This token stream is the input for the statement tokens can be discarded
syntactic part of the parser. Built into the because the structure of the hierarchical
parser is an understanding of the syntax tree is enough to determine where
of the language. It combines particular statements begin and end.
sequences of tokens into a hierarchical
There are different types of trees which the definition of compile is sometimes
can be built, such as parse trees and unclear – some people might mistakenly
syntax trees. (I won’t get into the understand it to be defined strictly as the
distinction here.) Parsers often generate process of turning source code into
a refinement of what would have been a executable code. In fact, compiling is
full syntax tree, and such a refined tree is simply the process of translating. It
called an Abstract Syntax Tree (AST). doesn’t matter if the target code is
For the purpose of this article, I refer to machine code, source code in another
“parse tree” as the output of a parser – language, or something in between.
regardless of the form of tree generated.
In some relatively straightforward
For a simple example, let’s say we’re translations, it is possible for target
parsing an expression like “A + B * C”. source code to be written out directly by
A resulting parse tree might look like the parser while it consumes tokens,
this: rather than having the parser generate an
intermediate tree. Because the parser is
PLUS doing a straightforward translation from
/ \ one syntax to another, this is called
A MULT
/ \ syntax directed translation. Normally
B C though, a parse tree is built, and multiple
passes are made through that parse tree
The PLUS and MULT nodes in the parse to transform it into something which
tree would probably be represented just more closely resembles the structure of
as integers. The nodes in the parse tree the desired target code.
which represent the variables A, B, and
C might contain pointers to the Additionally, there may be optimization
respective variable’s representation in a passes. For example, if we have "1 + 2"
symbol table. in the source code, an optimizing pass
can recognize this as a constant
Building a parse tree is only a expression, and evaluate it to 3 at
preparatory step. It is the first step for a compile time, rather than at run time.
number of different types of tools –
parsing itself is only a means to an end. Part II – What Is a Parser Good For?
Typically, once the parse tree is built,
the useful work comes from a process of In this part of the article, we’ll do a brief
walking the parse tree to extract useful review of a few of the general classes of
information. Extracting information and tools which make use of parsers (leaving
meaning (semantics) from the source aside compilers which are an obvious
code is what programmers do when they example).
read the source code. Automatically
extracting semantics is done by way of There is a class of tools called lint tools.
programs which walk a parse tree. From an old Unix “man” page for C lint,
you might have seen a description like
A compiler will walk the parse tree and “picks little bits of fluff from your
may use one or more processes to source code”. Lint tools find things in
generate the output code. As an aside, your source code which, although they
compile, probably aren’t what you meant can be added to the reference materials.
to do. For example, lint might find This actually requires a different sort of
unused variables, or statements which parser than is typically used for a
cannot possibly have an effect on the compiler. As mentioned in the first part
operation of the program. Lint tools can of this article, comments and whitespace
be used for finding performance are not necessary for a compiler and are
mistakes (missing NO-UNDO on any of usually discarded at a very early stage.
your variable definitions?), and they can
be used for checking that your code Along the same line as the code
follows your company’s code style documenter, parsers can be used for
conventions (are you supposed to have printing out source code in a format
comments at the top of every program?). which is easier for the programmer to
follow. Code could be pretty-printed, or
For a lint tool to do its work, it must could be written out with mark ups
sometimes be able to follow the syntax, showing where include files begin and
and sometimes get into the semantics, of end, comments could be stripped or
a particular program. For example, it highlighted, particular include files
would take a relatively sophisticated sed could be left out of the listing while all
or perl script to even find DEFINE other include files are expanded, etc.
VARIABLE statements which are
missing the NO-UNDO option. Code browsers are a class of tool in
However, even with sophisticated sed or which the user interface is very
perl scripts it would be rather difficult to important. A code browser allows you to
determine if a variable’s value is ever quickly and easily navigate through
accessed. source code. They allow you to browse
through lists of modules. They allow you
Code documenters are another type of to take high level views of your system,
tool which benefit from a parser. and zoom in on the piece in which you
Typically, a code documenter builds are interested. They allow you to quickly
summary reference pages for each unit jump from one code location to another
of a given system. With a parser, one can when you want to see the definition of
pick through the code to pull out only something which is being referred to in
the most interesting bits. A code the code currently being viewed. Code
documenter might want to present the browsers typically employ a minimalist
parameters for a particular unit. It might and very fast parser to extract only the
also want to present a summary of each bits of information necessary for code
of the public methods (functions or browsing. ED for Windows is an
procedures) available within that unit. example of a popular programmer’s
We may also want to see a summary of editor which has built in code browsing
the parameters required for each of those capability for a number of languages.
methods.
Parsers can also be used for building ad
Additionally, a code documenter hoc queries about your source code. Do
probably wants to pull out specific you need to find and report every place
comments from the source code, so that in your system where you have a
descriptions of methods and parameters database access FOR EACH loop which
contains UI? Examining every FOR
EACH loop in your system may take Although Proparse today maintains
days or weeks. Using a parser would information about comments,
help you reduce this analysis time whitespace, include files, and line
dramatically. numbers, it is not yet capable of
regenerating the source code’s original
As a final example, parsers can be used preprocessor directives. This means that
when you want to make automated it is capable of doing automated
changes to your code. Refactoring has refactoring, but it would be as if the
been defined as the process of improving refactoring was done on the code output
the structure of the source code without from COMPILE with PREPROCESS.
changing its behavior. Normally when The ability to regenerate original code as
refactoring is discussed, people are it appeared before preprocessing is
talking about manually restructuring something that will available in the
object oriented systems to follow better coming months.
design patterns. However, I tend to use
the term for describing any restructuring Feedback? Write me: john@joanju.com.
process, whether dealing with object Thanks to Judy Hoffman Green, Gerry
oriented design patterns or not. Winning, and Greg Wutzke for
reviewing.
Automated code refactoring is best done
in a manner similar to the way compiling ---
is done. First, the code is parsed and a John Allen Green has started using all
parse tree is built. Second, the tree is three names because there are too many
analyzed with one or more passes people named John Green. He is not the
through the tree. With each pass, the John Green who wrote books about
parse tree would likely be marked up Sasquatch, so please don’t ask him about
with special attributes which are it.
meaningful to later passes. Third, the
source tree is transformed. Branches of
the tree may be moved or deleted, and
new nodes may be added to the tree.
Fourth and finally, the modified parse
tree is written out as source code.
Progress code refactoring is the
particular area of interest for Proparse, a
parser built by us at Joanju. We have
used it to build an example of a Progress
code documentation system: AutoDox.
Proparse and AutoDox can be found at
our website: www.joanju.com . Jurjen
Dijkstra (of www.global-shared.com
fame) has built a splendid Progress lint
tool called Prolint, which can be found at
his website.
Get documents about "