A Concurrent Garbage Collector For leJOS Andy Shaw Introduction LeJOS is an open source system that provides an environment for the execution of programs written in a sub-set of the Java programming language on the LEGO NXT device (see http://lejos.sourceforge.net for more details). The Lego NXT device supports a number of motors and sensors the correct operation of which requires fast predictable response times from the controlling software. One of the aims of leJOS is to do as much as possible in the application level software layer, rather then in the supporting firmware. To this end many of the highly time sensitive operations like motor control, and handling of sensor devices are implemented in Java and supplied by the leJOS class library. This presents particular challenges for a garbage collector, which must operate without introducing pauses that will disrupt the functioning of such control systems. This document briefly describes the history of garbage collection within leJOS and the associated problems. It then presents details of a new collector that is designed to solve these issues and to meet the demands of a Java based real time system. Finally results of various performance tests are presented comparing the new system against its predecessors. It is assumed that the reader is reasonably familiar with Java and with garbage collection systems and there implementation. leJOS garbage collector history Early versions The early releases of leJOS did not implement garbage collection at all. Instead they contained a simple allocator to enable the creation of new objects. With the exception of a few internally allocated objects (dynamic stacks), there was no way to free the memory associated with an object when it was no longer required. This model required the programmer to manage memory explicitly and to avoid operations (like some string manipulations) that would rapidly consume all available memory. Although it was possible to write many interesting programs in this way, the resulting programs often looked a little strange to programmers familiar with Java (having a large number of static objects), and made solving some problems very awkward (if not impossible). However a major advantage of these earlier versions was the lack of interference by a collector with the correct operation of the real time parts of the system. leJOS 0.5 and the first mark sweep collector For the 0.5 version of leJOS Janusz Gorecki added a true garbage collector to the system. This was a huge leap forward allowing much more complex programs to be created in a more conventional style. This implementation has provided the basis for all of the collectors that have followed and so is described in some detail below. The collector was a classic mark/sweep collector using a “stop the world” model of collection. In order to minimize the impact on programs it was only triggered when an allocation failed (so existing programs would not trigger the new code). At this point the collector ran until completion (effectively suspending all of the threads in the system), freeing all currently unused objects. The collector operated in two major phases, first starting at the programs roots (basically the thread run time stacks, and static objects), the collector identified references and recursively followed those references visiting all of the “live” objects that existed in the heap. As it visited each object a bit in the header of the object would be set to indicated that this object was in use. The second major phase was to sweep through the entire heap visiting each object. Those objects that did not have the in use bit set were modified to indicate that the memory was free (which used another bit in the header), the free memory was merged with any adjacent free memory and the in use bit of all objects was cleared (so settings things up for the next mark phase). Memory allocation was performed by searching through the heap (using the object and free headers), looking for a large enough section of free memory. Once found the object would be created (by splitting the free memory if required), and a new object header with the allocated bit cleared inserted. It is worth taking a closer look at the implementation of some of the above stages. The mark phase used recursive calls to mark the child objects of each visited item. So it contained code that looked like... mark(obj) for each reference contained in obj mark(reference) set “in use” bit for obj When marking objects and arrays the collector used additional data structures to identify the type of the class and array members and determine if they needed to be marked. However this data was not available for the contents of the run time stack (unlike some Java systems leJOS does not maintain so called “stack maps”). To allow the correct processing of references held on the stack (in parameters and local variables for instance), the collector used a conservative marking strategy (this is a technique often used by collectors for languages like C/C++ which do not have explicit support for garbage collection). This requires that each stack word can be examined and that the collector can determine if the word could be a pointer to an object, and if so mark the object. To allow this operation the collector uses an auxiliary bit map that contains one bit per possible start of an object within the heap. The allocation and sweep stages maintain this bitmap. leJOS 0.6 incremental sweep The addition of the garbage collector made programming for leJOS much easier and the system soon started to take advantage of the new techniques available, in particular the use of string operations became much wider. This usage quickly highlighted a problem with the collector, the sweep phase could be very long. Programs that used a lot of string operations quickly generated a large amount of garbage, requiring the collector to run frequently. The sweep phase needs to visit every object in the heap (both live and dead) and with short strings this could mean thousands of objects. Processing this many objects could result in sweep times that easily exceeded 10ms or more. While performing a sweep operation the entire system was halted (except for interrupt processing), so even threads that did not use the collector would be impacted. Clearly this was not good for the real time threads that were performing motor control (these threads assumed a maximum pause time of a few milliseconds at most). A similar problem had also been seen with allocations. Sometimes the heap could contain thousands of objects, which resulted in long search times as the allocator worked its way through the heap, looking for free space. The 0.6 release contained modifications to resolve both problems. Firstly the sweep phase was made incremental. Rather than running to completion and scanning all of the heap in one go, the sweep process simply swept forward until sufficient free space was available to meet the immediate requirement. The sweep would be continued the next time an allocation request failed. The process continuing until the entire heap had been swept. This modification effectively distributed the sweep time over a number of allocation calls thus reducing the pause time seen by most applications. A similar change was made to the allocation routine, adding a pointer to remember the point in the heap that was last used to successfully allocate memory and thus avoiding searching through all of the heap for each allocation. leJOS build 1722 deep marking As 0.6 was used more widely a number of reports began to show up on the user forum of Data Aborts. These are hardware exceptions usually thrown when the processor attempts to access non existent memory. Further investigation indicated that the aborts had been caused by the firmware stack overflowing and that the garbage collector was responsible. The cause turned out to be the use of more complex data structures. As noted above the mark phase used recursive calls, to mark child references within an object. Unfortunately recursive structures like trees and linked lists could result in very deeply nested calls. The C stack used by the firmware is only 1Kb in size, and a list structure of only ten nodes (resulting in 10 recursive mark calls), caused the stack to overflow. The classic solution to this problem is to remove the actual recursion and to replace it with an explicit stack of references waiting to be marked. However this approach does have some issues: ● Although the memory required when using an explicit stack is typically less than required to make a recursive call (no registers to save, return addresses etc.), it still requires memory (and in leJOS we do not have that much). Marking arrays using just a stack is typically much less efficient, requiring the entire array of nodes to be pushed on the stack rather than simply keeping track of the current offset within the array. ● To minimize the two above problems, we make use of a hybrid solution (borrowed from the Sun Squawk VM). A recursive mark stage is used to a fixed stack depth. This allows shallow structures (in particular arrays), to me marked efficiently but avoids the stack overflow problem. Once the fixed depth is reached, further references are pushed on to an explicit mark stack. This stack is then popped and the references marked using the recursive marker once the initial recursion has unwound. Because there is only a limited amount of memory available the mark stack must also be of limited size. So as with other collectors if the mark stack overflows we simply mark the headers of the objects and then make one or more passes through the heap until all objects have been processed. Handling overflows this way is much slower than the other two marking mechanisms (due to the repeated passes through the heap), but it does guarantee that any complexity of data structure can be handled. An additional optimization was added in this build. The leJOS linker was modified to set an additional bit in the header of classes that do not contain any references. This is used to improve the marking process. Evaluating the collectors As part of the investigations into the above problems a number of test programs have been created to measure the impact of various memory loads on the system. In particular measurements were made on the allocation times and on the impact of the collector on the responsiveness of a “real time” thread. Several test cases have been created: ● ● Simple string manipulation (allocation, concatenation etc.). Large arrays of simple objects (Simulating a typical internal map data structure as may be used by a mobile robot). Deeply recursive function calls. Because the thread run time stacks are allocated from the heap, this tested how well the reallocation of such stacks would work. It also created a large set of collector roots. Complex data structures (lists, and trees with large numbers of nodes). ● ● The tests measured two parameters. The first was the time taken to perform a simple set of string operations (creating two strings and then appending one to the other). This operation was repeated 5000 times (10000 for the base string test), and the time taken for each operation recorded. The second test measured the responsiveness of a second “real time thread”. The thread simply performed a Thread.sleep(1) operation additional code measured that actual time that this sleep lasted and recorded the results. For the more complex tests shown above, the test first created the complex data structure and then ran the string manipulation test, any collection operation would have to process the additional “live” data during the test. Although many tests have been created and run the rest of this document concentrates on the array test as this: ● ● Illustrates the various problems well. Can be run on all versions of the collector (the more complex tests do not run on the 0.6 collector). Represent a reasonably “real world” data set for a Lego based robot. ● 10000 1000 100 String Ops Latency 10 1 0 5 10 15 20 25 30 35 40 45 50 Figure 1: String test build 1722 Figure 1 (above) shows the results of running the basic string operations test using the build 1722 collector. It can be seen that the majority of times the test takes 1ms or less to perform (shown by the distribution of the blue bars). However a small number of times the test takes longer with a maximum time of 5ms. These longer times are a result of the collector running. For most of the operations there is sufficient memory and so the collector is not needed. However when available memory is exhausted the collector runs, which produces the longer test times. In this test we are mainly seeing the time taken by the sweep phase (the mark phase for such a small set of live objects is very small), the spread of times is caused by the incremental sweep mechanism used by this collector. Also shown in these results is the impact that running the string operations has on the real time thread. It can be seen that all of the times (shown by the red bar) are in the 2ms slot. The measured time is 2ms rather than the 1ms that may have been expected because the leJOS time slice (used by the thread scheduler), is 2ms. Since the string thread is cpu bound it will only release the cpu when it's time slice expires, thus the other thread only gets to run every 2ms. The above results look quiet promising. The impact of a thread making heavy use of allocations does not seem to have a detrimental impact on a thread that is not using the allocator and things run well. We will now examine what happens when we introduce more “live” objects into the system. The following graph (figure 2), shows the array100 test. This test uses a 100 element array with each array element having nodes that contain 3 objects. Thus we introduce 301 (300 objects plus the array) live objects into the system. 10000 1000 100 String Ops Latency 10 1 0 5 10 15 20 25 30 35 40 45 50 Figure 2: Array100 build 1722 The impact of the additional objects can clearly be seen. The maximum test time is now up to 13ms with a further cluster of operations taking around 6ms. These two clusters reflect the impact of the extra objects on the two major phases of the garbage collector. The higher times show the impact on the mark phase. With many more live objects in the system, the mark phase is taking longer to trace all of them. The other cluster reflects the impact on the sweep phase. We now have more live objects so locating free data to perform an allocation now requires the examination of greater number of them before sufficient free space is found. More worryingly the impact on the real time thread has now increased considerably. The allocation thread is now introducing pauses into the system of up to 11ms. This is due to the nature of the “stop the world” collector being used. When it runs all threads are suspended and so will be delayed by the period taken to run the collection. A delay of 10ms or more is not good for a real time process, there are many operations within leJOS (for instance the motor drivers), that assume that any delay with be considerably less than this. The above test however is still only using a small part of the available system memory (approximately 14Kb of the available 55Kb). When we increase this loading the results as may be expected get worse. Figure 3 shows the array300 test results. This test has an array of 300 nodes, which results in 901 objects occupying approximately 44Kb. 10000 1000 100 String Ops Latency 10 1 0 5 10 15 20 25 30 35 40 45 50 Figure 3: Array300 build 1722 Unfortunately the results are as we may have expected. The increased number of live objects has extended both the string test times and the impact on the real time thread. The worse case is now up to over 30ms with an additional significant number of delays being introduced into the over 5ms area. The build 1722 collector is effectively an enhanced version of the original collector used since the 0.5 release. Tests run on these older collectors show similar results (the incremental sweep emplacement added in 0.6 does help the figures for 0,5 are even worse). The problems seen are basically the result of the stop the world nature of the collector used. If we are to address these issues a new approach will be required. A new collector for leJOS So what are the characteristics that are required for a new collector: ● ● ● ● ● Minimize the impact on real time threads. Minimize any additional memory overhead. Minimize the impact on applications that do not need garbage collection. Allow the use of complex data structures. Allow the use of all available memory. The problem that we are trying to address is a well known one with collectors (though the memory available to leJOS does add some additional constraints). There are many potential solutions (on the fly, concurrent, parallel, incremental collectors to name a few), but the class of collector that seems to best fit the above requirements is that of concurrent mark sweep collectors. This type of collector runs the various collector phases alongside the running application threads. Allowing (on a single cpu system) the operations of the collector to be be interleaved with that of the application. It is this style of collector that has been implemented as the new leJOS collector. The following sections give details of the actual implementation. Providing concurrency The leJOS firmware is not multi-threaded and the allocator is part of the firmware, so how do we provide concurrency? One solution would be to have moved the collector into Java and used the threading provided by the VM to provide the concurrency. While in many ways attractive the performance issues presented by this approach (and problems with things like expanding thread run time stacks), ruled against it. Instead the collector is run as a co-routine alongside the VM. On every VM task switch the VM calls into the collector allowing it to run. The collector will run for no more than 1ms on each call thus always allowing at least 1ms in each time slice for application threads to operate. But what happens to requests made by the VM for memory when insufficient is available. In the current system these calls simply run the collector and so implicitly wait for the collection to create sufficient free memory to satisfy the request. But this is the cause of major problem we are trying to solve, so how do we address it? One solution would be to ensure that there is always sufficient memory available. This is how many concurrent collectors operate. The collector is started whenever the amount of free memory falls below a certain point, and the system is tuned such that the rate of memory requests does not exceed that rate that memory is recovered. However to minimize the impact of the collector on programs that manage memory well, we do not want to run the collector unless we have to. So in leJOS we do not trigger a collection until all of memory has been used. This means that we will have to make threads that require memory wait until it becomes available, but without impacting threads that do not require memory. To achieve this we make use of a Java monitor. When it is not possible to allocate memory, the firmware places the current thread on the queue of threads waiting for an (internal) monitor, the firmware also modifies the state of the current thread such that the current op-code will be restarted when the thread runs again. The thread then returns back to the virtual machine and is immediately suspended. Other threads continue to run as does the collector. When the collector has completed the collection, it simply signals any waiting threads. The suspended threads will now run and will re-execute the op-code that requires the memory allocation, if there is now sufficient memory the request will succeed, if not then the thread will be aborted with an outOfMemoryException. The actual implementation is a little more complex than described. Care must be taken to ensure that instructions can be re-started (this required additional changes to the leJOS VM), and steps must be taken to detect if the thread has already waited for memory to prevent a constant wait cycle when there really is insufficient memory available. Also there is an optimization that tracks the amount of memory required and so allows the collector to notify waiting threads before the collection completes. Snapshot at the beginning One of the major challenges facing a concurrent collector is how to allow the application to continue to modify memory (and in particular references to objects), while it identifies the objects that are still live (in a mark sweep collector this identification takes place in the mark phase). There are several well known techniques to dealing with this problem and we use one of the simplest, the “snapshot at the beginning” approach. Basically with this technique the collector needs to capture the current state of all references at a particular instant in time. Some collectors do this by taking a copy of all of the state (hence the snapshot). However doing this is not practical for leJOS (or many other systems), instead we start with a minimal snapshot and take steps to ensure that no changes made by the application “spoil” the view seen by the mark phase. In our collector the initial snapshot is defined as the set of root objects (the run time stacks etc.). So in our new collector we define the following phases ● ● ● Create the snapshot Mark objects Sweep the heap Once the initial snap shot has been created we need to ensure that changes made by the application do not modify things in such a way as to damage the “single image” of the heap that we need to mark all of the objects. Effectively the mark code will start with the root set and them move through the system adding objects into the “snapshot as goes”, we need to ensure that changes will not hide any objects from this process. To do this we make use of a so called “write barrier”. This is code that is placed into the virtual machine to trap modifications to references. In particular we are looking to prevent the over writing of a reference to an object that would have been located by the mark phase. We take a very simple approach to this, our write barrier will mark any object (and any references contained in the object) that is written to during the mark phase and which has not yet been marked. The effect of this is that any objects that are written to before they are reached by the marking code will have already been added to the snapshot. This is a conservative approach and it may result in some objects being marked as live that are not strictly still in use (these will however be picked up by the next mark phase), but it is simple to implement and guarantees that the process will terminate in a finite time. Creating the snapshot To create the initial snapshot the collector has to capture the current state of all of the roots as described above these consist of the static objects and the thread stacks. Each reference that belongs to this set is identified if the object referred to does not contain any references it is marked immediately otherwise it is placed on a queue waiting to be marked (see below for the handling of overflows of this queue). While capturing the roots all other threads must be suspended. This is currently achieved by always allowing the mark roots phase to run to completion. Normally this takes much less time than the allowed 1ms. However large stacks with a large number of references can result in the target time being exceeded. There is currently no fix for this. A side effect of using the run time stack as the basis of the snapshot is that changes to the stack do not need to be protected by the write barrier. This has major benefits for the complexity of the barrier and for its impact on performance. Mark phase Once the roots have been captured the collector enters the mark phase. This uses many of the techniques described for the build 1722 collector. In particular the same hybrid marking scheme is used. However there are a number of differences to either improved efficiency or to work better incrementally. The major problem here is that the mark phase will often take longer than the allowed 1ms duration. This means that the algorithm must detect when the limit has been reached and then save sufficient state to be able to restart when next called. The basic operation of the mark phase is to remove references from the mark queue and to mark them, adding new items to the queue as required. To improve the efficiency of this process a small amount of actual recursion (a depth limit 8 is currently used) is allowed. When the 1ms time period is used up (this is detected by monitoring a single bit set by the system 1ms interrupt routine), then processing of objects stops and any partially marked items are placed on the mark queue for processing next time. For most objects this works well. However large arrays present a particular problem. A considerable amount of time can be wasted getting back to “the point last marked” in a large array. To avoid this problem, marking of a large array does not stop immediately when time runs out, instead all of the remaining references are placed on the mark queue (i.e. Marking of the array is completed). For very large arrays this can result in the 1ms period being exceeded. The previous collectors used a single mark bit in the object header to indicate the mark state. The new collector uses 2 bits to implement a modified version of the classic garbage collector coloring scheme. The 2 bits represent four colors: WHITE: The object has not been marked. LIGHTGREY: The object has been visited but not all child objects have been marked. DARKGREY: The object s on the mark queue but child objects have not been marked. BLACK: The object and child objects have been marked. LIGHTGREY objects are used to handle the mark queue overflow case. If the queue is full the object is marked as LIGHTGREY. If there are LIGHTGREY objects then the mark phase will need to make one or more passes through the heap to process them. The LIGHTGREY marking allows only the objects that have overflowed to be processed. When marking an object, if it is WHITE it is considered fully, if tit does contain any references then tit is immediately marked BLACK, if it is DARKGREY (it will be processed later) or BLACK then nothing needs to be done, if it is LIGHTGREY then an attempt is made to place it on the mark queue (and re-color it DARKGREY), thus potentially avoiding a traversal of the heap later. The mark phase is complete when the mark queue is empty and there are no LIGHTGREY objects in the heap. During the mark phase the write barrier is active. This intercepts modifications to the references in an object. If the object is not colored BLACK then the object is processed and all of the child references are placed on the mark queue, before the object is changed. This in effect extends the snapshot as described above. The sweep process The final stage of the collector is the sweep. Each item in the heap is visited, if the color of the object is still WHITE then it must be garbage and so can be released. Other objects are still live. The sweep is performed using a sweep pointer than moves through the heap. As objects are visited the mark bits are cleared. Adjacent free objects are merged together. The sweep process also links together all free objects in the heap to allow for faster allocation. This free list is rebuilt by the collector on each sweep. If the collector time allocation is exceeded the sweep stops and is resumed next time. The collector keeps track of the amount of newly created free space, it uses this to determine if the requirements of any currently suspended threads has been met. If this is the case the threads are signaled as described above. Allocation Allocation is performed by searching the free list for a suitably sized object. Because allocation can take place concurrently with mark and sweep operations some care is needed. In particular newly allocated objects are marked BLACK if they are allocated ahead of the sweep pointer and WHITE if they are behind it, to ensure they are not incorrectly released. Evaluation of the new collector Figure 4 shows the results for the Array 300 test when using the new allocator. The reduced impact on the latency induced can clearly be seen, this is now completely contained in the 1 to 3ms range, which is much more acceptable for a real time thread. The downside can also be seen in the the worse case for the string operations. With the new allocator this is now 47ms compared with 38ms of the previous collector. This increase is mainly due to the extra work required to allow the collector to run concurrently (there really is no such thing as a free lunch!). However these results are probably a worse case and are partially due to the large arrays in use, other tests using other data structures show a much smaller increase of only 2 or 3ms. Note also that the overall spread of test times has been changed with much fewer results laying above the 5ms point (154 v 288). 10000 1000 100 String Ops Latency 10 1 0 5 10 15 20 25 30 35 40 45 50 Figure 4: Array 300 new collector This data also shows that unlike the previous test we now get a significant number of samples in the latency test showing up in the 1ms sector (approx half of the samples in this test). This is due to the string test thread releasing the cpu early when it needs to wait for a collection and thus allowing the real time thread chance to run early. The table below gives the overall performance for running various tests. The column marked all is for running a complete set of tests (25 in all), that test many different data structure types. The final row shows the percentage increase between the times for new collector and that used in build 1722. Collector leJOS 0.6 Build 1722 new % increase string 15295 15541 15313 -1.47 array100 8734 8292 8316 0.29 array200 11001 9393 9725 3.53 array300 33543 13552 15164 11.89 list300 10741 11119 3.52 tree300 8785 8718 -0.76 all 256625 265443 3.44 As can be seen overall the new collector is around 4% slower. This seems to be in line with other reported results which often a report a 10% slow down when using a concurrent collector. As can be seen from the above the leJOS 0.6 collector was considerably slower for some of these tests (and was not able to run many of the more complex ones). This was mainly due to various bugs that have been fixed in build 1722. Conclusions and future directions The new collectors seems to meet many of the set goals. It makes a considerable improvement in the latency incurred by threads not performing allocations. It consumes very little extra memory (approximately 100 bytes). It has no additional impact on applications which do not require garbage collection (the collector only runs when memory is exhausted). It allows the use of large complex data structures. All memory is available for use (alternate implementations that make use of a copy collector would potentially half the available memory). The cost of these improvements is an increase in execution time of approximately 4% (for a very heavily allocation bound task). There are however still some issues remaining: The use of large arrays and deep stacks with many references can force the mark phase to exceed the target time period. Better ways of handling these structures are needed. Other systems split large arrays into small “arraylets” to tackle the problem, but this approach has not been investigated here. The use of a simple mark/sweep collector does not address the issue of fragmentation. It may be necessary to add a compaction phase if this becomes an issue. The current snapshot at the beginning mechanism and relatively conservative write barrier is known to result in some garbage not being collected until the next iteration of the collector. So far this has not been seen as an issue, but in tight memory usage situations it may not work so well. A more aggressive approach may be required.