Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Efficient Semantic Analysis for Text Editors by fjn47816

VIEWS: 0 PAGES: 25

									    Efficient Semantic Analysis for Text Editors∗
                                   Elad Lahav
              School of Computer Science, University of Waterloo
                           elahav@uwaterloo.ca

                                August 16, 2004


                                     Abstract
          Meddle is a programmer’s text editor designed to provide as-you-type
      semantic information to the user. This is accomplished by using algo-
      rithms for tracking changes to the editor’s text buffer, incremental scan-
      ning and incremental parsing. These algorithms are presented and ex-
      plained.


1    Introduction
Almost all modern text editors provide syntactic information, usually in the
form of syntax highlighting. Some text editors, such as Emacs1 and Kate2 even
provide limited semantic information, such as displaying matching sets of paren-
theses. It is hard, however, to find a text editor that provides full semantic in-
formation of the file being edited. Two examples of editors that accomplish this
task are SourceInsight3 and Visual SlickEdit4, both of which are closed-
source, commercial editors with price tags of several hundreds of dollars. These
applications, however, show a decisive advantage over the aforementioned free
editors in that they give the programmer a much better view of the edited files,
including such information as the distinction between global and local variables,
function declarations vs. function calls, type resolution for non-primitive types,
etc. This information cannot be obtained without the construction of a complete
parse tree for the file being edited.
    It is easy to see that semantic information is much harder to generate than
syntactic information: while the latter task only requires scanning the text
and extracting tokens with a simple finite automaton, the former relies on a
parser for context-free grammars. Even though some semantic information can
be extracted from the text by simple extensions to the scanner’s automaton
(e.g., a parentheses stack), the construction of a complete parse tree requires a
full-blown parser.
   ∗ Submitted as a final project for CS842: Programming Language Design and Implemen-

tation. Sadly, this work is self-funded.
   1 www.emacs.org
   2 kate.kde.org
   3 www.sourceinsight.com
   4 www.slickedit.com




                                         1
Figure 1: The editing-scanning-parsing cycle. The user makes changes to the
editor’s text buffer, which are recorded for later use by the incremental scan-
ner. Scanning is triggered by a timer, and the incremental scanner generates a
stream of tokens for the parser. These tokens can be used to display syntactic
information, which is not dependent upon parsing, such as syntax highlighting.
A second timer invokes the incremental parser, which generates the parse tree,
and provides semantic information that can be displayed in various forms.


    The real problem, however, is not the construction of parsers for program-
ming languages. This is relatively easy to do using the numerous parser gener-
ators available, such as Yacc and Bison. These parsers, however, are designed
to work in batch mode: they require, as input, the complete stream of tokens
extracted from a text document in order to build a parse tree. This method,
while appropriate for tasks such as compilation, is much too time consuming for
rapidly-changing environments, such as text editors.
    Luckily, not every change to the editor’s text buffer requires a complete re-
parse of the entire stream of tokens. In many cases, a correct parse tree can be
obtained by reusing some of the sub-trees from the previous version of the tree,
a procedure referred to as incremental parsing. Incremental parsing is based
upon incremental scanning of the text, a process that identifies the modified
tokens. Incremental scanning, in turn, requires a method for tracking changes
to the text buffer. The complete editing-scanning-parsing cycle is depicted in
figure 1.
    Strangely enough, even though algorithms for incremental parsing have been
known for more than 20 years [3], it seems that they are rarely incorporated into
modern text editors. A possible reason for this can be that previous papers on
these algorithms were either incomplete, or described the algorithms in a manner
that was not implementation-oriented. The primary goal of this paper is to give


                                       2
a complete description of the incremental parsing process, as implemented in a
demo text editor called Meddle. The source code for Meddle is freely available,
and its documentation follows this paper, so it can be used as a tutorial for
constructing other semantic text editors.
    Since this paper is implementation-oriented, some of the algorithms pre-
sented here are not necessarily optimal. Instead, I have tried to maintain a
balance between performance and ease of implementation.
    The remainder of this paper is structured as follows: Section 2 discusses
some of the previous work published on incremental scanning and parsing; Sec-
tions 3 to 5 describe the implementation of the editing-scanning-parsing cycle
in Meddle; Finally, section 6 suggests some ideas for future work.


2     Related Work
Jalili and Gallier were the first to present a complete and correct algorithm for
incremental parsing of LR(1) grammars [3]. Their algorithm is based on a state-
matching criterion, which assigns to each node in the parse tree the state of the
parser at the time it was shifted. Note that unlike traditional parsing, in this
case the parse stack does not hold just the states of the finite state machine, but
rather (state, node) pairs. Jalili and Gallier showed that sub-trees of the parse
tree can be reused when parsing the new token stream, based on the following
key observation:
Proposition 1. Let G = (VT , VN , P, S) be a LR(1) grammar, let x, y ∈ (VT ∪
VN )+ be (non-empty) strings in the grammar vocabulary, and let a ∈ VT be a
terminal symbol. Assume that the text to parse is xya, and let s be the top state
on the parse stack after parsing x. If y is reduced to a single non-terminal A
based on the look-ahead symbol a, then it will also be reduced to A when parsing
x ya for any string x such that the top state after parsing x is s.
    In other words, a deterministic LR(1) parser will reduce a string to a single
symbol based solely on the current state and look-ahead symbol, and regardless
of the symbols and actions that have led to that state. This means that a sub-
tree rooted at a node A can be reused if, after performing all reductions using
First(A), the following conditions hold:

    1. The parser is at the same state as it was when A was shifted during the
       construction of the original tree; and
    2. The look-ahead symbol is the same one that was used to determine that
       A should be shifted.

The term “reuse” means that the parser can push the state goto(s, A) on the
top of the stack without further examination of the contents of the terminal
string spanned by A.
    Another key feature of the Jalili-Gallier algorithm is that if a sub-tree cannot
be reused in its entirety, it can be broken into sub-trees that can in turn be
considered for reuse. Meddle uses the state-matching test as well as the Divide,
Undo-Reductions and Replace operations described in [3]. The Delete
operation is implemented differently. The use of two stacks, a parse stack and a
tree-reuse stack is replaced by a data structure that is described in section 5.1.

                                         3
    Unfortunately, [3] assumes that the changes to the token list are given, and
so does not provide any details as to how these changes are recorded.
    A complete description of the editing-scanning-parsing cycle for a text editor
is given by Beetem and Beetem in [2]. Their method for tracking changes, based
on a data structure called a kmn-list is used by Meddle, and is described in
detail in section 3. However, while [2] presents a single complex procedure for
manipulating the kmn-list, Meddle uses two separate procedures, one for text
insertions and one for text deletion. The use of the complex procedure is only
required in order to support text replacement, which can be easily broken into
two separate actions of first deleting the old text and then inserting the new
one.
    Meddle also follows the gist of the algorithm presented in [2] for incremental
scanning, as described in section 4. However, the algorithm given in that paper
is based on different data structures than used here, and its presentation, in
the form of pseudo-code, is difficult to comprehend. Therefore the algorithm
presented in section 4 was developed from scratch, based only on the guidelines
given in [2].
    On the other hand, the incremental parsing algorithm given in [2] is based
on a recursive descent parser, suitable for languages described by LL(1) gram-
mars. Most modern programming languages, however, are based on LALR(1)
grammars, and common parser generators such as Yacc require that languages
be specified using such grammars. The parser described in [2] is therefore inad-
equate for a real-life text editor (in fact, the editor described in [2] is tailored
around a specific programming language used for research purposes.)
    Much of this paper has been inspired by the work of the “Harmonia” project5 ,
led by Susan L. Graham. This project has resulted in numerous papers on
text editing in general and incremental parsing in particular. Unfortunately, it
seems that none of the editors mentioned in these papers (Pan, Ensemble and
Harmonia) were ever released for public use6 , so the actual implementation of
the data structures and algorithms presented in these papers is unknown. I
hope to remedy this by providing the full source code for Meddle along with
this paper.
    In [1], Ballance, Butcher and Graham suggest the use of grammar abstrac-
tions for text editors. Informally, one grammar is referred to as an abstraction
of another grammar, if every structure described by the abstract grammar is
also described by the original grammar. While abstract grammars are usually
not suitable for compilation, as they may ignore structures required for correct
code generation, they may be suitable for text editors, that are not required to
provide all the semantics of the edited text. For example, abstract grammars
may ignore operator precedence in algebraic expressions. Abstract grammars
may thus result in faster parsers, making them more adequate for dynamic
environments such as text editors.
    All of the papers mentioned so far describe methods for incremental pars-
ing of unambiguous programming languages. In real life, however, many of the
more commonly used languages, such as C, contain semantic ambiguities. This
happens when syntactic analysis is dependent upon semantic information, such
as with the use of type definitions in C: a scanner cannot identify a token as a
  5 harmonia.cs.berkeley.edu
  6A   Harmonia plug-in for Emacs is available in binary-form only.



                                              4
type, without knowing that it has been declared as such (except for primitive
types, such as int and char.) This knowledge, in turn, depends upon correct
parsing of the preceding code (including header files.) This problem is described
and resolved in [4], where generalised grammars are used for the creation of in-
cremental parsers. Semantic information, such as type definitions, is propagated
through the parse tree for correct scanning.
    Finally, Wagner and Graham presented in [5] an optimal incremental parsing
algorithm, based on the Jalili-Gallier algorithm, that is guaranteed to reuse as
much of the previous parsing tree as possible. The algorithm used by Meddle is
sub-optimal, but hopefully simpler and easier to understand. As opposed to the
text editor described in that paper, Meddle does not require multiple versions
of the parse tree to be kept. Instead, it generalises the notion of a parse tree to
that of an ordered parse forest, which is changed in-place. This is described in
section 5. Another difference between the two algorithms is that Meddle uses the
state matching criterion, while [5] suggests a new mechanism, called “sentential-
form” parsing. The first reason for using the latter is ascribed by the authors to
the additional space required for holding the parser state for each node in the
state-matching method. Since the size of each node in Meddle’s parse tree is 56
bytes, saving 4 bytes can hardly justify a change in the algorithm. The second,
and more substantial reason, stated in [5], is that the state-matching condition,
while suitable for LALR(1) parsers, is too restrictive for LR(1) parsers. Meddle
uses the output of Bison to generate the parser’s tables, which are given in
LALR(1) form. Thus the state-matching algorithm of Jalili and Gallier seems
to be adequate in this case.
    Wagner and Graham have also shown that the use of two stacks, a parsing
stack and a tree-reuse stack, is not necessary, as these can be inferred with the
use of a correct data structure. The P-Tree structure, presented in section 5.1,
follows this result.


3    Tracking Changes to the Text Buffer
A typical scenario for a text editor session would start with the user loading
a file into the editor’s buffer. The contents of this file can then be scanned
and parsed using batch-like methods, that create the initial stream of tokens
and parse tree. The major difference between a compiler-like tool and a text
editor is unveiled when the user begins modifying the file. This usually involves
rapid changes to the buffer at random, unpredictable locations. In order for the
changes to be reflected, first in the token stream and then in the parse tree, the
editor needs to rescan and re-parse the buffer. Since an editor needs to be highly
responsive, the batch methods are replaced with more efficient algorithms for
incremental scanning and parsing.
    A natural assumption would be that incremental scanning and parsing should
not commence while the user is actively editing the file, that is, that rapid
changes should be logged until the editor is free of processing user commands,
and can dedicate its resources to text analysis. Note that a multi-threaded ap-
proach would usually not be useful in this case, as we would like the results of
the scanner and parser to affect the visual appearance of the text, so changes
must not occur in the buffer before scanning and parsing are completed.
    One trivial, yet important, observation is that although modern text edi-


                                        5
tors support a variety of user commands for text manipulation, such as cutting,
pasting, multiple undo/redo levels, etc., the underlying text buffer can only be
modified in two ways: text insertion and range deletion. Some papers ([3],[2])
also suggest that text replacement is a primitive operation, which leads to com-
plex algorithms to support it. It is easy to see that text replacement is actually
a combination of range deletion followed by text insertion, which significantly
simplifies the algorithms involved in the editing-scanning-parsing cycle.
    An easy and efficient way to keep track of changes to the text buffer is given
in [2]. The algorithm described by Beetem and Beetem is based on a data
structure called a kmn-list. This structure is a linked list of nodes, each of
which is composed of 3 fields:

   1. k: The number of (consecutive) unmodified characters;
   2. m: The number of following characters that were deleted;
   3. n: The number of characters inserted after the unmodified ones.

For example, a node written as (5, 2, 4) describes a segment of the text composed
of 5 unmodified characters, followed by 2 deleted characters and 4 inserted
characters (as in the transformation of commons to commodity.) In the rest of
this paper the segment of the text matching the k field will be referred to as
the stable region of the node; the one matching the m field as the deleted region;
and the one matching the n field as the inserted region.
    Consider a text buffer that holds k characters. This buffer can be represented
by a kmn-list that consists of a single node, namely (k, 0, 0). As changes are ap-
plied to the buffer, this list may grow and take the form (k1 , m1 , n1 ), ..., (kl , ml , nl ).
An important property of this list is that the old buffer, before any modifications,
is depicted by the sequence k1 , m1 , k2 , m2 , ..., kl , ml : the buffer contained k1
characters that were not changed, followed by m1 characters that were deleted,
followed by k2 characters that were not changed, etc. Similarly, the new buffer
is depicted by the sequence k1 , n1 , k2 , n2 , ..., kl , nl .
Example 1. A text buffer holds the following sentence:

          The brown fox jumps over the lazy dog.

We now change this sentence to read

          The brown ferret jumps all over the dog.

The underlined segments in the original sentence were thus deleted, and the
underlined segments in the new sentence were inserted. These changes are
represented by the following kmn-list:

                          (11, 2, 5); (5, 0, 4); (10, 5, 0); (4, 0, 0)
This list is visually depicted in figure 2. Note that the list is compact: the first
node contains both the deleted region and the inserted region that follow the
first stable region.




                                               6
Figure 2: The kmn-list described in example 1. The upper bar illustrates the
buffer before the changes, as described by the k and m fields of the list nodes.
The lower bar shows the state of affairs in the new buffer, as described by the
k and n fields of the list.

    In order to maintain the kmn-list, [2] suggests the use of a single function,
called KMN-Insert. This function accepts the position of a change pos, the
number of characters deleted from pos, and the number of characters inserted at
pos. The function then modifies the kmn-list to include the changes specified by
these parameters. The use of a single, unnecessarily complex, function is derived
from the need to support text replacement as a primitive buffer operation. This
function can be replaced by two, much simpler, ones, for the separate handling
of text insertions and range deletions.
    We first look at insertions. The insert operation is handled by the function
KMN-Insert, described in algorithm 1. Line 1 skips all kmn nodes that are
not affected by the change. Note that in order to calculate the segment of the
buffer spanned by a node we take into account both its stable and its inserted
regions, which are the fields that describe the state of the current buffer. Once
we have found the node that is affected by the change, we are left with 2 cases
to consider:

Case 1 Insertion was made inside the stable region of a node (lines 3-8.) The
node is then split into two: the first node holds the part of the stable region (up
to the insert position), and the newly inserted region; the second node holds the
other part of the original stable region (after the insert position) as well as all
previous changes recorded by the original node.

Case 2 Insertion was made inside the inserted region of a node, or immediately
after it (lines 9-10.) In this case the inserted region of the node is simply
increased to include the new region.

  Note that, unlike the KMN-Insert function described in [2], text insertion
does not require any node merges in order for the list to remain compact.
    Range deletion, on the other hand, is somewhat harder. This operation is
handled by the function KMN-Delete, described in algorithm 2. As with text
insertion, we begin by finding the first node N affected by the change (lines
1-2.) We then consider two cases:

Case 1 The deleted range is contained in its entirety within the stable region
of N (lines 3-7.) We split N in a way that is similar to that described in
algorithm 1, only with the m field now taken into account.




                                        7
Case 2 The deleted range goes to the end of the stable region of N , and
perhaps beyond that (lines 8-31.) This case involves several stages (the variable
d always holds the number of characters that still need to be deleted):

    1. Delete as many characters as possible from the stable range of N , updating
       its stable and deleted regions (lines 9-13.)
    2. For this stage we note that the inserted region of a node along with the
       stable region of the next node always form a contiguous segment of the text
       buffer. Remove all nodes for which the inserted region of their predecessor
       along with their stable region is completely included within the deleted
       range (lines 14-19.) All previous deletions in these nodes are incorporated
       into the deleted region of N , and the inserted region of N is set to be the
       inserted region of the last node removed (since all other previously-inserted
       regions were deleted.)
    3. Delete as many characters as possible from the inserted region of N (lines
       20-21.)
    4. If there are still characters to delete, remove them from the stable region
       of the next node (lines 22-24.)
    5. N may now contain no changes at all. If that is the case, merge it with
       the next node (lines 25-30.)

     The time complexity of both algorithms is linear in the length of the kmn-
list. The original paper raises concerns as to the increase in this length, and
suggests that it should be bounded by a fixed size [2]. Such a bound requires
further handling by the algorithms. Beetem and Beetem give 12 as a reasonable
upper bound, based on their experiments. My own experiments, however, show
that the list never grows, in practice, beyond two nodes: Meddle rescans the
text every 300 milliseconds, and successful scanning resets the kmn-list to a
single node (see section 4.) For the list to grow beyond two nodes, a user
needs to perform two non-consecutive changes in less then 0.3 seconds, which is
practically impossible. The only way for the list to grow further is for scanner
errors to occur, which is quite rare.
     As a result of this analysis I have decided to not complicate the algorithms
by using an upper bound. In fact, even though kmn-lists handle correctly
any number of changes, they may not be necessary in practice: if scanning is
triggered by the first non-consecutive change instead of a timer, we can simply
use a single record of the change’s position, type and length. This technique is
left for future study.


4     Incremental Scanning
Scanning refers to the process that takes as input a stream of characters, and
outputs a stream of tokens. In the case of programming languages, tokens are
the terminal symbols defined by the grammar of the language7. The objective
of incremental scanning is to scan as few tokens as possible, and still achieve
   7 Some tokens required for syntactic analysis, such as comments and preprocessor directives,

are not defined in the grammar, and will be treated in a special way.


                                              8
Algorithm 1 KMN-Insert(inspos, len)
 1: Skip all nodes for which pos[node] + node.k + node.n ≤ inspos
 2: node ←first node for which the above condition does not hold
 3: if pos[node] + node.k > inspos then
 4:    Add a node new before node
 5:    new.k ← pos − pos[node], new.m ← 0, new.n ← len
 6:    node.k ← node.k − new.k
 7: else
 8:    node.n ← node.n + len
 9: end if




Algorithm 2 KMN-Delete(delpos, len)
 1: Skip all nodes for which pos[node] + node.k + node.n ≤ delpos
 2: node ←first node for which the above condition does not hold
 3: if pos[node] + node.k > delpos + len then
 4:    Add a node new before node
 5:    new.k ← delpos − pos[node]
 6:    new.m ← len
 7:    new.n ← 0
 8:    node.k ← node.k − new.k − len
 9: else
10:    d ← min{len, pos[node] + node.k − delpos}
11:    len ← len − d
12:    node.k ← node.k − d,
13:    node.m ← node.m + d
14:    while next[node] = Nil and d ≥ node.n + next[node].k do
15:       d ← d − node.n − next[node].k
16:       node.n ← next[node].n
17:       node.m ← node.m + next[node].ms
18:       Remove next[node]
19:    end while
20:    node.n ← node.n − min{d, node.n}
21:    d ← d − min{d, node.n}
22:    if d > 0 then
23:       next[node].k ← next[node].k − d
24:    end if
25:    if node.m = 0 and node.n = 0 and next[node] = Nil then
26:       node.k ← node.k + next[node].k
27:       node.m ← next[node].m
28:       node.n ← next[node].n
29:       Remove next[node]
30:    end if
31: end if




                                      9
Figure 3: The token streams for the code segments given in example 2. Each
token holds a terminal symbol from the grammar, and a pair of coordinates
(p, l), which specify its position and length, respectively.


a complete and correct stream of tokens for the edited text. This requires an
iteration over a list of modifications made to the buffer since the last scan, and
correct identification of the tokens that were affected by these changes.
    The task of identifying the tokens that need to be rescanned may be harder
than it seems at first sight. First of all, not all tokens affected by a change to
the buffer need to actually be rescanned: all tokens that follow a change (either
insertion or deletion) to the text are shifted, but do not necessarily change
their contents. On the other hand, some tokens that reside outside the changed
region, may still need to be rescanned, as shown in example 2.
Example 2. Consider the following code written in some programming lan-
guage:
      nn := 10
      pri nt n
The errors in the code are then corrected by the programmer, and the new code
has the following form:
      n := 10
      print n
The token streams for the original and modified code are given in figure 3a and
figure 3b, respectively. Each token shows the terminal symbol it represents, as
well as its position in the buffer and its length (print is assumed to be a reserved
keyword.) The first token was changed, and needs to be rescanned. The next 3
tokens changed their positions, but no scanning is required for them. The next
token, however, must be rescanned, even though no change was made to the
region spanned by it. If the scanner does not recognise that, it will leave pri
as a separate token from nt, which results in a completely different semantic
interpretation.
  The algorithm for incremental scanning is based on the structure of the
kmn-list. Informally, this algorithm can be stated as follows:

          As long as we are in a stable region, and the current token is fully
      contained within that region, advance to the next token. Otherwise,
      start scanning tokens and separators until we are again in a stable
      region. [2]


                                        10
    While this definition is correct, and captures the gist of the algorithm, it is
not accurate enough for a concrete algorithm to be developed. For instance,
it does not specify where exactly scanning should start, and which tokens are
replaced during (re)scanning. The pseudo-code for incremental scanning given
in [2] is, on the other hand, difficult to comprehend and relies heavily on the
implementation of the parse tree given in that paper. These difficulties required
that a new algorithm for incremental scanning be developed for Meddle. The
function Inc-Scan, given in algorithm 3 is based on the general idea quoted
above, but provides complete details as to the parts of the text that need to be
rescanned, and the tokens that are replaced as a result.
    Inc-Scan accepts a list of text modifications (in the form of a kmn-list),
and a list of tokens. The variable node holds the head of the kmn-list, shif t is
the value by which to change the position of tokens (positive or negative), and
last points to the last token that does not require scanning (initially set to Nil
to suggest that all tokens are candidates for rescanning.)
    The algorithm begins by evaluating the changes in the first node of the kmn-
list (since the list is compact, each node in that list, with the exception of the
last, must contain some changes.) It first skips all tokens that are contained in
the stable region defined by this node (lines 3-6.) A token token is said to be
within the stable region of the node node if

           pos[token] + length[token] + shif t < pos[node] + node.k
Note that this is a strict inequality, or else tokens that abet unstable regions
will not be rescanned (such as pri in example 2.)
    The next stage is to rescan tokens (lines 7-17.) Scanned tokens are inserted
into a temporary list, so scanner errors do not damage the original token list.
Scanning is then started on the first character that immediately follows last
(lines 8-12.) Recall the last holds the last token that does not need to be
rescanned. This means that token delimiters may be rescanned, even if they are
contained within a stable region. Though this may seem somewhat less efficient,
it greatly simplifies the incremental scanning algorithm (and overall increases
its efficiency.)
    The tokens are scanned and appended to the temporary list, until a stable
region is reached again (lines 13-17.) Once a token has been scanned, the
function KMN-Move, described in algorithm 4, sets the current kmn node
to be the one that spans the new position of the scanner. This, however, is
not done by a simple iteration over the kmn-list. Instead, all nodes before
the position passed to KMN-Move are considered as “consumed”, i.e., that
the changes that were represented by these nodes were applied. KMN-Move
therefore resets the current node, and merges the changes represented by the
next one into it. This is why node does not change, in effect, but is rather
expanded to contain it successors. The shif t value is updated while kmn nodes
are consumed.
    Before the new tokens can be merged into the token list, the ones invalidated
by the recent changes need to be removed (lines 18-19.) The tokens that should
be deleted are all the successors of last that end before the new position of the
scanner. However, this position needs to be translated first (line 18), as it is
given in coordinates relative to the current buffer, while the tokens to be deleted
hold their positions relative to the old buffer.


                                       11
Algorithm 3 Inc-Scan(kmn list, token list)
 1: node ← head[kmn list], shif t ← 0, last ← Nil
 2: repeat
 3:   for all token in token list such that token ∈ stable[node] do
 4:      pos[token] ← pos[token] + shif t
 5:      last ← token
 6:   end for
 7:   Create an empty token list temp list
 8:   if last = Nil then
 9:      Start scanning at pos[last] + length[last]
10:   else
11:      Start scanning at position 0
12:   end if
13:   repeat
14:      Read a new token token
15:      Append token to temp list
16:      shif t ← KMN-Move(pos[token] + length[token], node, shif t)
17:   until token ∈ stable[node]
18:   old scan pos ← pos[token] + length[token] − shif t
19:   Remove all tokens after last that end before old scan pos
20:   Merge temp list into token list after last
21:   last ← last scanned token
22: until The kmn list contains no changes, or a scanner error has occurred




Algorithm 4 KMN-Move(pos, node, shif t)
 1: while Position[node] + node.k + node.n ≤ pos do
 2:   shif t ← shif t + node.n − node.m
 3:   node.k ← node.k + node.n
 4:   if Next[node] = Nil then
 5:      node.k ← node.k + Next[node].k
 6:      node.m ← Next[node].m
 7:      node.n ← Next[node].n
 8:   else
 9:      node.m ← 0
10:      node.n ← 0
11:   end if
12: end while
13: return shif t




                                     12
               File Size   Batch Scanner     Incremental Scanner
                1 Kb           3.2 ms              0.2 ms
                10 Kb         13.1 ms              0.4 ms
               100 Kb        143.2 ms              3.3 ms

Table 1: The average scan times for different file sizes. Each file was modified
using the same set of tests, described in appendix A. Note that the times
specified for the incremental scanner include the initial (batch-like) scan.


    The final step is to merge the temporary list into the original stream, re-
placing the deleted tokens (lines 20-21.) The algorithm goes back to the first
stage, using the updated kmn node, until all changes are applied, or a scanner
error occurs. A successful scan resets the kmn list, so that the single remaining
node does not contain any changes (i.e., node.m = node.n = 0.) A scanner
error causes the algorithm to abort. This should happen immediately after the
scanner returns with an error result, so that the kmn nodes holding the next
changes are not consumed.
    Even though only modified tokens are rescanned, Inc-Scan is linear in the
number of tokens in the stream. This is because it needs to iterate over the entire
stream in order to determine which tokens need to be rescanned, and to shift
the tokens that are kept. This raises a natural question whether incremental
scanning is at all necessary. This concern is further justified by the nature of
the scanning process itself: scanners are based on deterministic finite automata,
so their time complexity is linear in the size of the input text.
    Empiric tests show, however, that incremental scanning is significantly faster
than batch-mode scanning (see table 1.) Moreover, the difference between the
scanning times of the two methods increases with the size of the text. These
results cannot be solely ascribed to the difference between linearity in the size of
the token stream vs. linearity in the size of the text (characters), as this should
have resulted in a constant improvement only. A better suggestion is that scan-
ning involves more than just identifying tokens. It involves allocating memory,
maintaining the stream (which, in the case of Meddle involves more than just
a linked list, as explained in section 5) and other tasks, such as updating the
syntax highlighting tags.


5     Incremental Parsing
The most challenging part of the editing-scanning-parsing cycle is incremental
parsing. Not only is parsing inherently more difficult than scanning, identifying
and managing sub-trees is significantly harder than identifying and managing
linear streams of tokens.

5.1    The P-Tree Data Structure
The purpose of the P-Tree (for Parse Tree) data structure is to facilitate in-
cremental parsing, based on the Jalili-Gallier algorithm. Moreover, a P-Tree




                                        13
provides multiple views, both as a parse tree and as a list of tokens8 , which
allows it to be used for the scanning procedure as well as for parsing. Finally, a
P-Tree is used to implement both stacks (the parse stack and tree-reuse stack)
required by the algorithm.
    The P-Tree is actually not a parse tree, but rather an ordered parse forest.
At any given point, a P-Tree holds a list of correctly parsed trees, referred
to as the stream, which are candidates for reuse as parts of a new tree being
constructed. These trees may also be singletons, in which case the node is a
terminal node that needs to be parsed.
    Each node in a P-Tree has the following fields:

    • The grammar symbol (terminal or non-terminal) represented by this node;
    • Position in the text buffer (leftmost character);
    • Length (total number of characters);
    • The state of the scanner after reading this token (for terminal nodes only);
    • The state of the parser before shifting this node;
    • The state of the parser after shifting this node9 ;
    • The look-ahead symbol that was used to determine that this node needs
      to be shifted;
    • Pointers to the node’s parent, siblings, first child and last child.

    The stream of trees is maintained by keeping two permanent nodes, or sen-
tinels [5], to the beginning of the stream (BOS) and to its end (EOS.) The
sibling pointers of each node on the stream point to the immediate neighbours
of that node on the stream, which allows the implementation of the stream as
a linked list. Note that this is a generalisation of the sibling definition for parse
trees: nodes that are not on the stream are internal nodes or leaves of the trees
in the parse forest, and point to their siblings in the usual way. Nodes on the
stream, on the other hand, are roots of trees in the parse forest, so their siblings
are roots of other trees.
    In addition to the stream of trees, the P-Tree structure also maintains the list
of tokens, as generated by the scanner. This list is required both for incremental
scanning, as described by the algorithms presented in section 4, as well as for
parsing, as will be shown later. The first token in the list is found by taking the
node on the stream immediately following the BOS, and descending along its
left branch until a terminal node (the leaf) is found. The next token operation
is implemented as follows:

    • If the current token has an immediate successor, it is the next token in
      the list;
   8 Henceforth, we will use the term tokens to refer to the P-Tree leaves whenever we discuss

the scanner, and the term terminal nodes whenever we discuss the parser. Note, however,
that in effect these are the same nodes.
   9 Recall that the stack holds pairs of nodes and states. This is implemented by letting the

nodes hold the state of the parser, which is also useful for tree-reuse, as will be shown later.




                                              14
Figure 4: Implementing reduction on a P-Tree. The string aBb is reduced to the
single non-terminal C, based on the look-ahead symbol c. A new non-terminal
node for C replaces the string nodes on the stream, which become its children.
The rectangular nodes are terminals, while the circular nodes are non-terminals.


   • Otherwise, the current token is the right-most child of its parent. Move
     up the tree until a node with a successor is found. Take this node, and
     move along the left branch of its subtree, until a terminal node is found.

   A complete parse tree is represented by a stream that contains the BOS
node, followed by the root of the parse tree, and ends in the EOS node. Note,
however, that if an error occurs during parsing, the P-Tree structure still holds
a correct stream of terminals that were not yet shifted, and all non-terminals
that were shifted before the error was encountered. This stream can thus be
used for re-parsing after the error is corrected, as explained in the next sections.

5.2    The Parser
The parser works by keeping a pointer to a node on the stream, which symbolises
the top of the stack. All nodes on the stream to the left of this pointer are on
the stack, while the nodes to the right are tokens that were not yet parsed (in
the case of terminal nodes), or the root of trees that are considered for reuse.
Thus the stacks suggested by Jalili and Gallier do not need to be implemented
separately, which greatly simplifies the algorithms [5].
    After the initial scan, the stream of the P-Tree is composed of terminal
nodes only, which correspond to the scanned stream of tokens. The parser then
works just as a batch parser: each token is considered as the look-ahead symbol
of the tokens preceding it, and once all reductions were applied, is pushed on
the stack (if the scanned text is semantically correct.) Pushing is achieved by
simply advancing the parser’s pointer to the next node on the stream.
    Reducing a string to a single non-terminal symbol is also easy: a new node
is created for the non-terminal symbol. The first and last children of the new
node are set to be the left-most and right-most nodes in the string, respectively.
Finally, the string is detached from the stream, by having its immediate prede-
cessor and successor nodes point to the new node. Thus the new terminal node
replaces the nodes of the string on the stream. This procedure is delineated in
figure 4.
    Note that very few pointers actually need to be changed in this procedure:
the structures of the trees rooted at the string nodes do not change, nor does
the order of the nodes in the string. Only the semantics of the order changes:
instead of pointing to the next and previous nodes on the stream, each node in
the string points to the next and previous child of the new tree root. The only
pointers that need to be handled are those of the new node, the parent pointers

                                        15
of the string nodes, the previous node pointer of the left-most node in the string
and the next node pointer of the right-most node in the string (the last two are
set to Nil.)

5.3       Tree Reuse
The heart of the incremental parsing algorithm is tree reuse. The goal of the
algorithm is to identify as many trees as possible on the stream that can be
pushed on the stack without examining their internal structure.
    The incremental parsing algorithm implemented in Meddle is based on the
tree reuse rule suggested in proposition 1: a non-terminal node representing a
sub-tree created during a previous parse can be shifted immediately if the state
of the parser and the look-ahead symbol match the ones used during the last
parse.
    Note however, that while this condition is sufficient, it is not necessary. In
fact, the non-terminal node can be shifted even if the look-ahead symbol has
changed, as long as it is in the look-ahead set for the state of the parser before
shifting [5]. However, since Meddle is based on the output of Bison, which does
not provide this information, it uses the original condition as specified by Jalili
and Gallier.
    If the tree reuse condition cannot be matched for a given node, the tree
rooted at this node needs to be broken down. There are two ways to do so,
based on the part of the condition that has failed:

Left break-down All non-terminal nodes on the left-most branch (that is, all
     nodes on that branch but the leaf) are removed. The root of the tree, which
     resides on the stream, is replaced by a string composed of the immediate
     children of the removed nodes, from left to right.
Right break-down All non-terminal nodes on the right-most branch are re-
    moved. The root of the tree is replaced by a string composed of the
    immediate children of the removed nodes, from left to right.

Figure 5 shows examples of a left and right break-downs. Note that these oper-
ations are referred to in [3] as Replace and Undo-Reductions, respectively.
    Unfortunately, [3] does not clearly justify the use of these operations. To
explain them, we need to make the following two observations:
Proposition 2. Let α1 , ...αn ∈ (VT ∪VN )+ be strings in a LR(1) grammar G =
(VT , VN , P, S). If a parser performs the set of reductions α1 ⇒ α2 ⇒ ... ⇒ αn ,
then the parser’s state before shifting the left-most symbol of each string αi is
the same.10
Proof. Let σi denote the first symbol of the string αi , and let s be the state of
the parser just before shifting σi . Before the parser shifts the left-most symbol
in string αi+1 , denoted by σi+1 , it must pop all symbols (and states) that
correspond to the portion of the string in αi that is spanned by that symbol.
By the correctness of the parsing algorithm, and since both σi and σi+1 are on
the left-most branch of the tree, the last symbol to be popped is σi . This brings
the parser’s stack back to its state before σi was shifted, and thus σi+1 is shifted
 10 The   term “reduction” and the ⇒ symbol are used here in their generalised form.


                                             16
        Figure 5: The operations of left (a) and right (b) break-downs.


using the same parser state as σi . By induction, this is true for all symbols in
the left-most branch.
Proposition 3. Let α1 , ...αn ∈ (VT ∪ VN )+ be strings in a LR(1) grammar
G = (VT , VN , P, S), and let A2 , ..., An+1 be non-terminal symbols. If a parser
performs the set of reductions α1 ⇒ α2 A2 ⇒ ... ⇒ αn An ⇒ An+1 , then all
these reductions are made based on the same look-ahead symbol.
Proof. Consider the first reduction α1 ⇒ α2 A2 . This reduction is performed
based on the first terminal symbol that follows α1 . A2 then replaces a substring
of α1 , that is terminated by the right-most symbol of that string. The next
terminal symbol has thus not been shifted or reduced, nor was any terminal
generated before it. Similarly, the right most symbol in the yield of Ai is Ai−1
for i ≥ 3, so the next terminal symbol, which serves as the look-ahead, remains
the same.
    According to proposition 2, if the current state of the parser does not match
the one saved on the root of the tree being considered for reuse, this tree cannot
be shifted. Moreover, none of the nodes on the left branch of this tree should
be shifted, and so this branch needs to be discarded (except for the leaf, which
is a terminal node.) This is achieved by the left break-down operation. Note
that breaking down the tree on the left branch does not mean that the internal
trees should be broken down as well, nor does it guarantee their reuse. Instead,
these trees are reconsidered in turn.
    On the other hand, if the state is matched, but the look-ahead symbol has
changed, then none of the reductions referred to by the right branch of the tree
should occur (as is suggested by proposition 3.) These reductions are therefore
“undone” by the right break-down operation.



                                       17
Figure 6: The results of applying the Divide operation on a parse tree. The
tree is divided from the root to the terminal d node. The shaded nodes are the
ones being removed.


5.4    Handling Changes to The Token Stream
Trees that are considered for reuse can only be those for which the yield (the
string of tokens spanned by the root of the tree) has not changed since the last
parse. The stream of tokens, however, is modified between each two invocations
of the parser (or else there is no need for re-parsing.) The effects of these changes
on the stream and on the operation of the parser need therefore be described.
    Note that unlike the algorithms presented in [3] and [5], the one given in this
paper leaves the task of modifying the parse tree as a result of changes to the
token list to the scanner, instead of the parser. This delegation of responsibility
is natural, as the scanner generates and modifies the token list. Two advantages
of this approach is that there is no need to keep a record of modified tokens
between parses, and that the P-Tree maintains a usable stream at all times.
One major drawback, on the other hand, is that decisions regarding the parse
tree may be taken prematurely, before the entire set of changes between parses
is examined. While this does not lead to incorrect results, it may result in
sub-optimal behaviour.
    Changes to the stream of tokens can be either the introduction of new tokens
or the removal of existing ones. As long as the P-Tree holds only terminal nodes
(i.e., before the first parse), handling changes to the stream is relatively easy.
Things become complicated when the P-Tree includes previously parsed trees,
as trees for which the yield has changed do no longer correctly describe the text
being edited.
    In order to present the procedures for inserting and removing nodes from a
(fully or partially) parsed stream, we first need to describe the Divide proce-
dure, as suggested in [3]. Given a leaf (terminal) node in a parse tree, Divide
scans the path from the root to that node, and removes all nodes on this path
(except for the leaf.) All immediate children of the removed nodes are added
to the stream, in left-to-right order (see figure 6.) It is easy to see that left and
right break-downs are special cases of the divide operation, where the leaves are
the left-most and right most ones, respectively.
    While [3] considers divisions from the root to a terminal node only, the
Divide operation can be generalised to internal (non-terminal) nodes. The al-
gorithm works the same way as it does for terminals. In essence, the generalised
Divide operation can be described as bringing a P-Tree node to the stream,
by discarding its ancestors. The siblings of this node are also brought to the
stream as a result.


                                        18
     Deleting tokens from a P-Tree is achieved by the procedure Remove-Tokens,
described in algorithm 5. Recall from algorithm 3 that the scanner needs to re-
move all tokens after the last one that was not modified, and up to the new
position of the scanner (line 19.) Since the list of tokens is incorporated into
a P-Tree, we need to specify how this removal affects the trees in which these
tokens are included, and how it affects the stream. Naturally, if all tokens to
remove are on the stream (not yet parsed successfully), their removal is a simple
matter of arranging the pointers of the stream nodes immediately before and
immediately after them.
     The algorithm accepts the first token that needs to be removed, and the
last position of the deleted range. Based on this information, the algorithm
identifies the last token that is contained in the deleted range (line 1.) In order
to maintain a valid P-Tree, nodes can only be removed from the stream. That is,
a tree in the P-Tree forest can either be deleted or kept in its entirety. The next
step is, therefore, to bring the first and last tokens that need to be removed to
the stream. This is accomplished using the Divide operation (lines 2-3.) Now
all nodes on the stream between f irst and last are the roots of trees that need
to be completely removed, and are therefore deleted (line 4.)
     The procedure Merge-Tokens listed in algorithm 6 is used to insert tokens
into a P-Tree. This procedure accepts the token after which to insert, and a
list of new tokens. It begins by finding the highest ancestor of af ter that is not
affected directly from the new tokens11 . The generalised Divide procedure is
then invoked on this node (line 2), after which it resides on the stream. Finally,
the tokens are merged into the stream (line 3.)

5.5     The Algorithm
We are now ready to present the incremental parsing algorithm. This algorithm
is based on a regular, batch mode, LALR(1) parser. Incrementality is achieved
through the Next-Terminal function, described in algorithm 7. This function
is invoked whenever the parser requests the next token. In essence, Next-
Terminal finds the first available terminal node on the stream, while handling
all non-terminals before it.
    The function accepts two arguments, which describe the current condition
of the parser: top is a pointer to the stream node that represents the top of the
parse stack, and state is the current state of the parser (this last parameter is
not really needed, since each node also holds the state of the parser after it is
shifted, so the top node also holds the current state of the parser.)
    The function starts by iterating over nodes on the stream of the P-Tree.
Recall that the sibling pointers of every node on the stream point to other
nodes on the stream, so by starting with top and using the Next operation, we
examine only stream nodes. If the node is a non-terminal, the function uses
its left-most terminal as a look-ahead symbol to perform all reductions possible
(line 3.) The parser is then ready to consider a non-terminal for shifting. If
both the state of the parser and the look-ahead symbol match the ones used
during the previous parse (lines 4-6), the node is shifted (lines 7-9.) Shifting
is done by advancing the top pointer to the shifted node, and setting the new
  11 This does not mean that the tree is correct: it may need to be modified as a result of a

change to its look-ahead symbol. Handling this case is deferred until the parsing phase.




                                            19
Algorithm 5 Remove-Tokens(f irst, endpos)
 1: last ← the last token for which pos[last] + length[last] ≤ endpos
 2: Divide(f irst)
 3: Divide(last)
 4: Delete all trees on the stream from f irst to last (inclusive)




Algorithm 6 Merge-Tokens(af ter, token list)
 1: node ← the highest node for which af ter is a right-most descendant
 2: Divide(node)
 3: Merge token list after node and before next[node]




Algorithm 7 Next-Terminal(top, state)
 1: node ← next[top]
 2: while node is a non-terminal do
 3:   Do all reductions based on the left-most terminal in the yield of node
 4:   if prev state[node] = state then
 5:      la ← the next terminal following the yield of node
 6:      if lookahead[node] = la then
 7:         top ← node
 8:         state ← next state[node]
 9:         node ← next[node]
10:      else
11:         node ← Right-Breakdown(node)
12:      end if
13:   else
14:      node ← Left-Breakdown(node)
15:   end if
16: end while
17: return node




                                      20
                   File Size   Batch Parser       Incremental Parser
                    1 Kb          0.3 ms               < 0.1 ms
                    10 Kb         1.8 ms                0.2 ms
                   100 Kb        16.2 ms                2.2 ms

Table 2: Average parse times for different file sizes, based on the tests described
in appendix A. The results given for the incremental parser include the initial
parse, but do not include the scanning time.


state of the scanner to the one saved by this node (recall that a node holds the
parser’s states before and after it was shifted.) If the state is the same, but
the look-ahead symbol has changed, the tree needs to be broken down along
its right branch (line 11), as suggested by proposition 3. Otherwise, the tree is
broken along the left branch (line 14), based on proposition 2. After the tree is
broken down (either to the left or to the right), its components are considered
for reuse in turn. It is assumed that the functions Left-Breakdown and
Right-Breakdown return the first node on the stream resulting from their
executions.

5.6     Results
The incremental parser was compared with a batch parser on files of different
sizes. The tests included manual insertions and deletions, as well as cut and
paste operations, in the beginning, middle and end of the files (see appendix A
for a description of these tests.) The results are given in table 2.
    While these numbers look promising, they fail to capture the true cost of
incremental parsing. Recall that some of the parser’s duties were delegated to
the scanner, which is usually invoked multiple times between parses12 . Thus
any true comparison between the batch and incremental parsers must include
the scanning times as well. Note that the scanning times given in table 1 cannot
be used here since the scanner was tested without the parsing code (the stream
was kept with terminals only, so no maintenance was required by the scanner.)
    The firsts tests that took into account both the parsing and scanning times
were disappointing: the incremental parser showed only a marginal improvement
over the batch parser. Indeed, Meddle felt sluggish while editing large source
files (100 kilobytes.) These poor results were due to the scanner, which, when
required to perform the additional tasks of maintaining the stream, exhibited
performance close to that of the batch scanner.
    At first it seemed that the approach described in this paper for implementing
incremental parsing has failed: if indeed the scanner cannot maintain the stream
efficiently, then this task needs to be returned to the parser. In that case, we no
longer have a valid P-Tree at all times, which, in turn, requires that some kind
of bookkeeping be kept of the token list between parses. Since this cannot be
done by the P-Tree in its current form, the entire system looked like a house of
cards.
  12 The results in this table may be misleading in one more sense. Comparing table 1 with

table 2 suggests that parsing is faster than scanning. This, however, is not the case, as
scanning times include updates to the visual tag system, while the parser, due to technical
difficulties, currently builds the parse tree and nothing more.




                                            21
                 File Size   Batch Parser    Incremental Parser
                  1 Kb         11.3 ms             1.0 ms
                  10 Kb        37.1 ms             3.2 ms
                 100 Kb       624.8 ms            24.2 ms

Table 3: The combined scanning and parsing results for the batch and incre-
mental parsers. Scanning times were accumulated between parses, and added
to the run time of the parser itself.


     However, using a profiler to analyse the run-time behaviour of Meddle’s
scanner resulted in a surprising revelation: it turned out that the cause for
the poor performance displayed by Meddle was not due to the code required
for maintaining a valid stream by the scanner (i.e., the functions that im-
plement the Remove-Tokens and Divide algorithms.) Instead, most of the
time was consumed by two seemingly-harmless functions: ptree next term(),
which is responsible for implementing the token-list view of the P-Tree, and
ptree shift(), which updates the position and length of nodes during scan-
ning. While the footprint of these functions was quite small, thousands of calls
to each of them during a single scan of a large file resulted in a significant
performance loss.
     The solution for the problem posed by the first of these functions was to
introduce one more pointer to the P-Tree node structure. This pointer, imple-
mented in terminal nodes only, always points to the next token. Initially, it
has the same value as the pointer that points to the next node in the stream.
However, while the latter may change when a token is removed from the stream
and becomes the child of a non-terminal, the new pointer is kept during parsing.
On the other hand, it may change when new tokens are added to the P-Tree,
even if the terminal node itself is not on the stream. Thus iterations over the
token list, which are required on every scan, are now much faster. The over-
head of maintaining this new pointer is negligible, as this task fits well into the
maintenance of the stream pointers.
     The original idea behind the implementation of ptree shift() was to use
the P-Tree as a search tree for tokens, based on their positions. Thus, the
scanner may not need to iterate over the entire token list to find a token that
needs to be changed. For the scanner to be able to find the right token, however,
all tree nodes must be kept up-to-date when the text changes, even if this change
only results in tokens being shifted. In order to achieve this, ptree shift()
updates the requested terminal node, and then modifies all of its ancestors
that may have changed as a result. However, since the possible gain of this
method is apparently nullified by the overhead of ptree shift(), the function
was changed and now updates the terminal nodes only.
     Once the above changes were implemented, the scanner again displayed good
performance. In fact, despite the overhead imposed by the P-Tree maintenance
tasks, the incremental parser worked almost as fast on complex parse trees as
on flat token lists. Table 3 shows a comparison between the batch parser and
the incremental parser, with scanning time taken into account.




                                       22
5.7    Non-Grammar Tokens
An interesting problem, for which I could not find any references in the cited
papers, is how to handle tokens that should not be parsed. Most programming
languages allow for certain tokens to be included in a programme’s listing, even
though they are not part of the grammar. A common example for such tokens
are source comments.
    Although non-grammar tokens do not provide any semantic information,
they may still be useful from a syntactic point of view, and therefore need to
be scanned. The P-Tree structure, however, makes it difficult to support tokens
that are not a part of any parse tree. One solution is to discard these tokens
during the Merge-Tokens procedure. The main problem with this method
is that non-grammar tokens are not kept between scans, which leads to some
anomalies in the scanner’s operation.
    The next-token pointer, discussed in the previous section, gives more flexi-
bility when handling the list of tokens. Thus the actual token list can differ from
the (virtual) list of terminals which the parser discovers as it iterates over the
stream (note that the parser uses the ptree next term() function described
above in its original form, as it needs to find the look-ahead symbol of a non-
terminal.) Maintaining these separate lists, however, requires changes to the
Merge-Tokens and Remove-Tokens procedures. Currently, these changes
are not implemented in Meddle, so non-grammar tokens are not handled cor-
rectly by the parser.


6     Conclusion
The P-Tree data structure provides an easy way for implementing the Jalili-
Gallier algorithm for incremental parsing. The result is not optimal, as parse
trees may be broken down even if the exact same tree is rebuilt during the
following parse. However, interactive environments, such as text editors, are
measured by their responsiveness, and not necessarily by their absolute perfor-
mance. Empiric tests show that the Meddle remains fast and responsive even
when editing large source files, while maintaining a correct semantic represen-
tation of the edited text at all times.
    The P-Tree structure, on the other hand, makes it difficult to separate the
token list, as seen by the scanner, from the list of terminals, as required by the
parser. This separation is sometimes useful, as in the case of tokens that are
not included in the language’s grammar. The inclusion of a next-token pointer
with terminal nodes, that differs from the next stream-node pointer, results in
a better distinction between the parser and scanner views of the P-Tree. These
views are still not completely independent, so the handling of non-grammar
tokens remains an arduous task.
    An issue that remains to be studied is the effect of incremental parsing on
attribute grammars. Meddle includes some attribute handling in its code for
maintaining the position and length of the yield of each node in the P-Tree
(both of which are synthesised attributes.) These attributes, however, form a
special case, since they provide no semantic information, and are usually not
needed for code generation. It would be safe to assume that in most cases, if a
tree is reused, than the values of the synthesised attributes remain unchanged



                                       23
for all nodes in this tree. Inherited attributes, on the other hand, may change
upon shifting of the tree, and this change would need to be propagated to the
rest of the nodes.


References
[1] R. A. Ballance, J. Butcher, and S. L. Graham. Grammatical abstraction
    and incremental syntax analysis in a language-based editor. In Proceedings
    of the ACM SIGPLAN 1988 conference on Programming Language design
    and Implementation, pages 185–198. ACM Press, 1988.
[2] John F. Beetem and Anne F. Beetem. Incremental scanning and parsing
    with galaxy. IEEE Transactions on Software Engineering, 17(7):641–651,
    1991.
[3] Fahimeh Jalili and Jean H. Gallier. Building friendly parsers. In Proceedings
    of the 9th ACM SIGPLAN-SIGACT symposium on Principles of program-
    ming languages, pages 196–206. ACM Press, 1982.

[4] Tim A. Wagner and Susan L. Graham. Incremental analysis of real program-
    ming languages. In Proceedings of the ACM SIGPLAN 1997 conference on
    Programming language design and implementation, pages 31–43. ACM Press,
    1997.
[5] Tim A. Wagner and Susan L. Graham. Efficient and flexible incremen-
    tal parsing. ACM Transactions on Programming Languages and Systems,
    20(5):980–1013, 1998.


A     Tests
The following tests were conducted to determine the performance of the incre-
mental scanner and parser:

Test 1 Type (i.e., one letter at a time) the following code near the beginning
of the file:

    typedef struct test1_s {
        char* string;
        int number;
    } test1_t;

Test 2 Type the following code near at the end of the file (i.e., this text should
be appended):

    void test2(char* string, int number)
    {
        printf("%s,%d\n", string, number);
    }




                                       24
Test 3   Paste the following code around the middle of the file:

    int test3(const char* str)
    {
        return strlen(str);
    }

Test 4   Delete the code added in test 1, by repeatedly using the backspace
key.

Test 5   Replace the text pasted in test 3 with the following code:

    const char* test5(int number)
    {
        char buf[20];

         sprintf(buf, "%d", number);
         return buf;
    }

   Replacing text can be done by first selecting the old text, and then pasting
the new one.




                                      25

								
To top