Parsers data structure called a parse tree. This by John A. Green parse tree is the desired result. For some of us who are graduates only That's the executive summary of lexers of the University of Hard Knocks, the and parsers, but let's talk a bit more term parser might mean little more than about what each of these do for us. "something which reads text". For us, parsing is something that we've done The lexer, when analyzing the character only in some small amount, usually in stream, discards whitespace because the form of IMPORT statements which once the input has been tokenized the load tabular data from text files into whitespace normally has no impact on database records. the meaning of the source code. There may also be a step prior to the lexer: the This article explains parsers in two parts. preprocessor. The preprocessor normally The first part is a very brief introduction discards comments. It also deals with to parsers and compilers. It describes macro expansion and include files. what it takes to parse text written in a Sometimes (but not often), the programming language. The second part preprocessor is built right into the lexer. is a discussion of a few ways in which The stream of tokens which we get out parsers are used, and why parsers can be of the lexer is normally the bare interesting even for those with no minimum necessary for syntactic intention of writing compilers. analysis. Part I – What Is a Parser? Sometimes the text which made up a token is discarded once the token has Parsing a nontrivial language normally been created. For example, token types involves at least two major components: are normally expressed as integers so one for lexical analysis and one for that the parser can use integer syntactic analysis. comparisons, which are very fast. For keywords, the text that made up that The lexical analysis is done by a lexer token (keyword) is no longer important – (also called a scanner). Its job is to scan internally “FIND” can be represented as the input - a character stream - one an integer, and the text string would be character at a time. It understands what unnecessary baggage. to do with particular sequences of characters. It combines particular The parser also normally does a good sequences of characters into tokens (also deal of discarding. Once a hierarchical called lexemes). The lexer’s fundamental data structure has been built, much of the job is to generate a token stream. information from the token stream is no longer necessary. For example, end-of- This token stream is the input for the statement tokens can be discarded syntactic part of the parser. Built into the because the structure of the hierarchical parser is an understanding of the syntax tree is enough to determine where of the language. It combines particular statements begin and end. sequences of tokens into a hierarchical There are different types of trees which the definition of compile is sometimes can be built, such as parse trees and unclear – some people might mistakenly syntax trees. (I won’t get into the understand it to be defined strictly as the distinction here.) Parsers often generate process of turning source code into a refinement of what would have been a executable code. In fact, compiling is full syntax tree, and such a refined tree is simply the process of translating. It called an Abstract Syntax Tree (AST). doesn’t matter if the target code is For the purpose of this article, I refer to machine code, source code in another “parse tree” as the output of a parser – language, or something in between. regardless of the form of tree generated. In some relatively straightforward For a simple example, let’s say we’re translations, it is possible for target parsing an expression like “A + B * C”. source code to be written out directly by A resulting parse tree might look like the parser while it consumes tokens, this: rather than having the parser generate an intermediate tree. Because the parser is PLUS doing a straightforward translation from / \ one syntax to another, this is called A MULT / \ syntax directed translation. Normally B C though, a parse tree is built, and multiple passes are made through that parse tree The PLUS and MULT nodes in the parse to transform it into something which tree would probably be represented just more closely resembles the structure of as integers. The nodes in the parse tree the desired target code. which represent the variables A, B, and C might contain pointers to the Additionally, there may be optimization respective variable’s representation in a passes. For example, if we have "1 + 2" symbol table. in the source code, an optimizing pass can recognize this as a constant Building a parse tree is only a expression, and evaluate it to 3 at preparatory step. It is the first step for a compile time, rather than at run time. number of different types of tools – parsing itself is only a means to an end. Part II – What Is a Parser Good For? Typically, once the parse tree is built, the useful work comes from a process of In this part of the article, we’ll do a brief walking the parse tree to extract useful review of a few of the general classes of information. Extracting information and tools which make use of parsers (leaving meaning (semantics) from the source aside compilers which are an obvious code is what programmers do when they example). read the source code. Automatically extracting semantics is done by way of There is a class of tools called lint tools. programs which walk a parse tree. From an old Unix “man” page for C lint, you might have seen a description like A compiler will walk the parse tree and “picks little bits of fluff from your may use one or more processes to source code”. Lint tools find things in generate the output code. As an aside, your source code which, although they compile, probably aren’t what you meant can be added to the reference materials. to do. For example, lint might find This actually requires a different sort of unused variables, or statements which parser than is typically used for a cannot possibly have an effect on the compiler. As mentioned in the first part operation of the program. Lint tools can of this article, comments and whitespace be used for finding performance are not necessary for a compiler and are mistakes (missing NO-UNDO on any of usually discarded at a very early stage. your variable definitions?), and they can be used for checking that your code Along the same line as the code follows your company’s code style documenter, parsers can be used for conventions (are you supposed to have printing out source code in a format comments at the top of every program?). which is easier for the programmer to follow. Code could be pretty-printed, or For a lint tool to do its work, it must could be written out with mark ups sometimes be able to follow the syntax, showing where include files begin and and sometimes get into the semantics, of end, comments could be stripped or a particular program. For example, it highlighted, particular include files would take a relatively sophisticated sed could be left out of the listing while all or perl script to even find DEFINE other include files are expanded, etc. VARIABLE statements which are missing the NO-UNDO option. Code browsers are a class of tool in However, even with sophisticated sed or which the user interface is very perl scripts it would be rather difficult to important. A code browser allows you to determine if a variable’s value is ever quickly and easily navigate through accessed. source code. They allow you to browse through lists of modules. They allow you Code documenters are another type of to take high level views of your system, tool which benefit from a parser. and zoom in on the piece in which you Typically, a code documenter builds are interested. They allow you to quickly summary reference pages for each unit jump from one code location to another of a given system. With a parser, one can when you want to see the definition of pick through the code to pull out only something which is being referred to in the most interesting bits. A code the code currently being viewed. Code documenter might want to present the browsers typically employ a minimalist parameters for a particular unit. It might and very fast parser to extract only the also want to present a summary of each bits of information necessary for code of the public methods (functions or browsing. ED for Windows is an procedures) available within that unit. example of a popular programmer’s We may also want to see a summary of editor which has built in code browsing the parameters required for each of those capability for a number of languages. methods. Parsers can also be used for building ad Additionally, a code documenter hoc queries about your source code. Do probably wants to pull out specific you need to find and report every place comments from the source code, so that in your system where you have a descriptions of methods and parameters database access FOR EACH loop which contains UI? Examining every FOR EACH loop in your system may take Although Proparse today maintains days or weeks. Using a parser would information about comments, help you reduce this analysis time whitespace, include files, and line dramatically. numbers, it is not yet capable of regenerating the source code’s original As a final example, parsers can be used preprocessor directives. This means that when you want to make automated it is capable of doing automated changes to your code. Refactoring has refactoring, but it would be as if the been defined as the process of improving refactoring was done on the code output the structure of the source code without from COMPILE with PREPROCESS. changing its behavior. Normally when The ability to regenerate original code as refactoring is discussed, people are it appeared before preprocessing is talking about manually restructuring something that will available in the object oriented systems to follow better coming months. design patterns. However, I tend to use the term for describing any restructuring Feedback? Write me: email@example.com. process, whether dealing with object Thanks to Judy Hoffman Green, Gerry oriented design patterns or not. Winning, and Greg Wutzke for reviewing. Automated code refactoring is best done in a manner similar to the way compiling --- is done. First, the code is parsed and a John Allen Green has started using all parse tree is built. Second, the tree is three names because there are too many analyzed with one or more passes people named John Green. He is not the through the tree. With each pass, the John Green who wrote books about parse tree would likely be marked up Sasquatch, so please don’t ask him about with special attributes which are it. meaningful to later passes. Third, the source tree is transformed. Branches of the tree may be moved or deleted, and new nodes may be added to the tree. Fourth and finally, the modified parse tree is written out as source code. Progress code refactoring is the particular area of interest for Proparse, a parser built by us at Joanju. We have used it to build an example of a Progress code documentation system: AutoDox. Proparse and AutoDox can be found at our website: www.joanju.com . Jurjen Dijkstra (of www.global-shared.com fame) has built a splendid Progress lint tool called Prolint, which can be found at his website.