The Management of Applications for Reconfigurable Computing using

Document Sample
The Management of Applications for Reconfigurable Computing using Powered By Docstoc
					The Management of Applications for Reconfigurable Computing using an Operating System
Grant Wigley & David Kearney Advanced Computing Research Centre School of Computer and Information Science University of South Australia Mawson Lakes SA 5095 Grant.Wigley@unisa.edu.au David.Kearney@unisa.edu.au Abstract
As the number of system gates available on reconfigurable platforms increase beyond 10 million, the issue of the management of these resources and their sharing among many applications and users will become more of a concern. This paper identifies the fundamental services that must be provided by an operating system for reconfigurable computing and describes a prototype implementation. The paper defines a contract between the application designer and the operating system whereby the application is defined as a task graph whose nodes are pre-placed and routed modules and whose edges define the data flow dependencies between the nodes. After loading by the operating system, the task graph node modules are grouped into partitions that occupy contiguous area on the FPGA surface. The implementation features new versions of algorithms for the allocation of area to tasks, the partitioning of an application to fit selected allocated areas and the placement and routing inside partitions. All the algorithms have small deterministic and bounded run times with near linear time complexity making them suitable to run in between time slices or at the initial loading stage of applications. Tests on the prototype with benchmark examples show that it is a feasible and that fragmentation of the area of the FPGA among many users is manageable. implementation that were once the preserve of the application programmer have been transferred to standard components of systems software. The reconfigurable computing research community thus needs to further investigate the requirements and technical implementation of reconfigurable systems software and in particular an operating system for reconfigurable computing (Management of Applications for Reconfigurable Computing) (MARC.1). There are some general axioms that underpin work on MARC.1 reported in this paper. First, the main focus of systems software for reconfigurable computing should not be solely the raw performance (area as measured by configurable logic block (CLB) usage and delay) of applications using it. Based on the experience of traditional operating systems there will be extra overheads in introducing systems software into reconfigurable computing. These overheads cannot of course be ignored but they should not be the primary focus of the operating system design. The primary focus should be reducing complexity as seen by the user. The second axiom is that an OS introduces a contract between two parties, the application designer and the systems software. If we take a top down view of the systems software for reconfigurable computing, (as is appropriate for assisting in reducing the growing complexity of application development and implementation) a major focus on requirements must be to identify services that such software can provide to potential users. Some of the tedious low-level tasks that face designers now should transfer to standard software (and RC hardware as appropriate). Dealing with low level input and output, fitting the application onto the computing surface, loading and running pre-designed circuits are among the services that we believe the systems software should provide. This may involve standardisation of interfaces for applications in a way that forms a “reconfigurable OS API” and the creation of suitable abstractions for standard libraries in reconfigurable computing. At present there is no general agreement as to what forms this “contract” between the application programmer and the systems software should be and it is one of the aims of the research described here to provide guidance on the matter. The consequence of this second axiom is that the systems software will

1.

Introduction

As FPGA density increases with VLSI feature sizes below 0.3 micron, the need for time and space optimisation of reconfigurable designs will give way as the major focus of users to the need for tools to manage the complexity of systems incorporating in excess of 10 million system gate equivalent designs. This will lead to unprecedented demand for better design tools but will also open the way for the introduction of systems software for the management of pre-designed reconfigurable blocks. In the area of traditional computing the latter is the preserve of an operating system (OS). Therefore operating system like software (and potentially reconfigurable hardware) will be needed for reconfigurable computing. This transformation is of course nothing new as it mirrors what has already occurred in the general software area, where many aspects of software

be manipulating ‘chunks’ of reconfigurable logic that has been technology mapped, placed and routed by application design tools. There is actually a significant difference between reconfigurable computing and general purpose computing in the tasks that must be performed to do this manipulation. In the traditional software environment a linker manages pointers to software modules and a loader places code segments into a one dimensional memory space. In reconfigurable computing the pointers may, at their most complex, become two dimensional routes of connections between groups of CLB’s and the one dimensional memory space becomes a two dimensional array of available CLB’s. Thus in reconfigurable computing, the systems software in linking and loading the applications must perform more complex tasks. Some of these tasks are common to those done by the application design and development tools. The common tasks include placement and routing. The system software does not need to do technology mapping for the same reason that traditional operating systems do not do compilation. It is shown here that the place and route algorithms used by the systems software must meet very different requirements to those assumed for their counterparts in the design process. It will be shown that the system place and route software must be deterministic, bounded and fast in run time as compared with the design tools which are often slow, stochastic, and unbounded in run time to allow for optimisation of application area and execution delay. Tasks that are unique to MARC.1 include allocation and partitioning. Allocation, the process where by an area of the FPGA is selected to load the application, seems to be unique to reconfigurable computing. Previous researchers have considered partitioning as an additional task at the design stage. We show that the needs of an MARC.1 partitioner are quite different to these better-studied applications. For example it will be shown that relaxing the constraints on area and delay performance make fast, deterministic and bounded partitioning algorithms feasible. This paper is organized as follows. First there is a brief review of the research literature applicable to operating systems for reconfigurable computing. It will be shown that at present there is very little work in this area. Next, the most fundamental services that are appropriate for users of reconfigurable computing for systems software are defined and detailed descriptions of a prototype implementation of these services are provided. Finally, a framework is defined, for the experimental investigation of the performance of these services, the results of the experiments reported and the current work that is underway on the operating system for reconfigurable computing.

2.

A Review of General Work on Operating Systems for Reconfiguarable Hardware

The work closest to MARC.1 relates to run time reconfiguration. Most of the work on the run-time management of reconfigurable systems considers the problem of managing the time evolution of a single application. However, there is increasing interest in using reconfigurable computers for horizontally and vertically integrated application domains such as image and signal processing. New FPGA architectures, such as the Time multiplexed FPGA (Trimberger 1997), partially reconfigurable FPGA (Xilinx 1997) and FPGA co-processing boards, such as SPACE.2 (Gunther 1997), are all capable of handling multiple dependent, or independent circuits. With the development of architectures that can handle these types of circuits, there is a growing need from users for a more flexible run-time environment. It is then important that run-time management of the resource be consistent, efficient, and fair. We therefore believe it is timely to consider the possibility of providing OS support for reconfigurable logic. Earlier work on operating systems for reconfigurable computers includes an implementation of an OS for a custom computing machine (CCM) based on the Xputer paradigm (XOS) (Kress 1997). XOS runs as an extension to the actual host OS and determines what parts of the application should execute on the configurable hardware and what parts should execute on the host-hardware with conventional software. XOS supports multiple users by globally statically

Application Partition Communication Channel Application Module

FPGA Chip No. 1 Or Real Surface

FPGA Chip No. 2

Figure 1 – The FPGA task graph scheduling all of the Xputer processes. The OS described in this paper is different as compared with the Xputer OS because it is a true multi-user OS. In a previous paper (Diessel 1999), two of the authors described a web-based multi-user operating system that shared the SPACE.2 architecture amongst eight simultaneous users. This OS used a

web-based client to allow novices to run reconfigurable computing applications. Although this was a multi-user OS it has several limitations. However a single FPGA chip is assigned to one user, potentially wastes large amounts FPGA logic, a problem that will get worse as large chips become the norm (Xilinx 2000). Brebner raised the issue of managing a virtual hardware resource (Brebner 1996). He proposed decomposing reconfigurable computing applications into swappable logic units (SLUs), where a SLU was defined as a logic circuit capable of performing a function in terms of specified inputs. In our work, application partitions are equivalent to SLU’s. SLU’s are considered to be fixed in size as compared to application partitions, which are aggregations of logic modules and hence can be subdivided to better make use of the area of the FPGA.

actually placed on the real surface. These concepts are shown in Figure 1.

4.

Prototype Simulation Implementation

In this section we describe the algorithms and implementation of our first prototype Operating System for Reconfigurable Computing.

4.1.

Allocation

3.

Architectural Model of MARC.1

The MARC.1 provides the basic services for loading and execution of pre-designed user applications onto on or more FPGAs that are shared among several users. To make use these services, the application designer has to construct their applications according to a contract. The basis of this contract is that applications are composed of application modules that are separately technology mapped, placed, and routed at design time into logic modules. Logic modules may come from a pre-designed library of components, or the application could be designed monolithically and then partitioned into logic modules prior to place and route. In either case it is up to the designer to select the level of modularisation of the design and then to use one of the extensive list of available partitioners (Behrens 1996, Krupnova 1997) to prepare the design into a library file. In the current version of the MARC.1 the logic modules are fixed in area (fixed number of CLB’s). The dependency relationship between modules in an application is represented by the edges of a data flow graph called the task graph, where the nodes of the graph are the application modules. The task graph should strictly be considered a hypergraph because later in the partitioning process more than one node and its associated edges may be grouped together. Such a group is called an application partition. The FPGA surface is divided into logic frames. The size of the logic frames are not fixed, as is common in the case of frames in the real RAM of a traditional virtual memory system. In the current version of the OS they are only allowed to be rectangular. When the OS is initialised the real surface consists of just one frame that takes up the entire area. When the first logic modules are placed onto the real surface they form a frame and the remainder of the real surface may be considered as forming one or more other frames. So the size of a frame is actually determined when the logic modules forming a partition are

The real surface at any time has area allocated to running applications and the remainder of the area is free and available for new partitions. The available area is represented as a free list of logic modules, analogous to a free list of disk blocks. However we are not really interested in free logic modules but rather free logic frames, since the OS has no knowledge of the internal details of logic modules. Therefore the FPGA surface is broken into an array of logic frames together with their X and Y coordinates. The OS takes the next application to be executed on the FPGA surface and given its known minimum area requirement, (measured in logic modules including used CLB’s and wasted CLB’s due to internal fragmentation of the logic module) searches the logic frame free list for an area contiguous set of frames that will satisfy the area requirement, favouring sets of frames that form areas that are close to square. We call this process allocation. The allocation algorithm determines if there is an available free frame on the real FPGA surface. The initial dimensions of the application partition are calculated and the algorithm uses these dimensions to create a virtual application partition rectangle to assist the OS to find the available frame. It initially places the virtual rectangle over the first logic module and determines if the space is available. If the virtual application partition rectangle overlaps another logic partition the virtual rectangle is progressively and deterministically moved over each of the logic modules and the process is repeated until a feasible frame is found. Once this frame has been found the OS marks the space in the free list as taken and that application partition has the space reserved for it. In the event that the allocater cannot find a big enough frame for the entire application partition, the partitioner is in then invoked.

4.2.

Partitioning

The process of selecting a hypergraph (Berge 1976) from a task graph that will fit into a particular contiguous area is our definition of partitioning. Although there has been wide ranges of algorithms published that address the partitioning problem for FPGAs (Brasen 1998, Chan 1996, Fang 1998), most of these algorithms have an objective function of

dividing the netlist (or task graph) into clusters such that inter-cluster communication (interconnect) is minimized (Alpert 1995) or the application execution delay is minimized. In selecting a partitioning algorithm for use in an MARC.1, the objective function is to have a small, deterministic and bounded algorithm run time. It must be small because (as with scheduling in a traditional OS) the run time is a direct overhead. A deterministic and bounded run time is important to avoid unpredictable response times that would reduce the user ability of the operating system. The requirements for a bounded and deterministic run time exclude most partitioning based on combinatorial optimisation heuristics. In this prototype OS we have modified an algorithm that was originally proposed for temporal partitioning (Purna 1999). Temporal partitioning aims to break a netlist that is a directed graph (called here a task graph) into segments that might be dynamically reconfigured by a process analogous to time slicing. This has been adopted for ‘space slicing’. It was chosen over other partitioning algorithms as it had a time complexity near linear, it partitioned under area constraints. Although the partitioning algorithm that is used in the MARC.1 was modelled on an algorithm presented by Purna, there were a number of significant modifications made to it so it would work in a multiuser operating system for reconfigurable computing. Firstly, it has been interfaced to a fast deterministic placement algorithm (Gehring 1996, Ludwig 1997). Secondly, an interaction between the partitioner and the placer is introduced so that the partitioner can be called iteratively to find the closest fit to the available area in a logic frame. Thirdly, it is now possible to specify the size of the partitions. Finally the new algorithm has introduced a control variable (target fragmentation level) that allows the selection of the level of interaction between the partitioner and the placer. At one extreme of the control variable the partitioner will find a fitting partition with a single iteration (no interaction) or at the other extreme it will iterate to find the closest partition that can be placed into the frame. Clearly if the interaction is reduced there may be a considerable inefficiency in the use of area in the FPGA but there would be a reduction in the overheads in the run time of the partitioner.

need to communicate from other modules) close to each other. One module is placed to the right, while the other one is placed directly above the connecting module. A single input logic module only requires one connection and is placed immediately to the right of the connecting module. This process is repeated until the entire application module has been placed. If the algorithm exhausts all the area allocated to it, it will inform the calling function to re-allocate or reduce the number of logic modules contained in the application partition. Reasons why it was chosen were, the placement results yield a good chance of a quick successful route, the placement results are repeatable and the algorithm avoids iterative techniques commonly employed by commercial tools, which significantly extends run time.

4.4.

Routing

In this paper, creating an electrical connection between two logic modules by setting the appropriate routing switches is the definition of routing. There have been many different routing algorithms developed for FPGAs (Chan 2000, Nam 1999, Wood 1997). However most of these algorithms are not suitable for an MARC.1 as they are stochastic (Brown 1992, Chan 1993) and an MARC.1 routing algorithm must have a predicable run time. The routing in an MARC.1 is not as complex as compared to the design time routing as the number of items to route is much lower since all the logic modules have been internally placed and routed. Thus although the routing problem is known to be NP-hard (Wu 1994) this complexity is not so important for small problems. Restricting the types of routing resources, which are searched to find a feasible route, can further reduce the complexity of the router. The router used in this work takes advantage of these two factors to achieve approximately polynomial execution time in relation to the number of logic modules to be routed. The first stage of the router is to route all nearest neighbour connections. By definition nearest neighbour connections are those logic modules that are placed directly to the north, south, east or west of the logic module being routed and require a single connection. As the placement algorithm attempts to place all communicating logic modules next to each other, most of the routes should be nearest neighbour routes. The second stage of the router handles routes that are not nearest neighbour. The router uses a set list of rules to attempt to gain a successful route with modules that are not co-located. If a route fails, the complete application partition will be unable to be routed and this will cause the OS to invoke the allocator with different specifications and the process of attempting to instantiate the application will restart.

4.3.

Placement

Determining where application modules of the application partition are located in logic partitions is defined as placement. Once a logic partition has been created and space reserved for it, the logic modules have to be individually placed within that logic frame. Again, in the traditional design method the placement algorithm produces a high quality placement (minimize area and routing delays), but with no respect for run time. The placement algorithm chosen for the MARC.1 trades an increase in fragmentation for a run time speed up. It is simple, but produces a fast constructive placement. The algorithm simply places the connecting logic modules (modules that

5.
5.1.

Experimental Investigation
Area Fragmentation

Fragmentation is usually associated in a traditional software OS environment with a loss of contiguous locations to store a particular application program. In the two dimensional environment of a RC there is a need to generalize this concept. In this paper logic modules (assumed square) are area contiguous if they can be arranged within a rectangular bounding box. In the MARC.1 there are three types of area fragmentation. Within any logic module there can be fragmentation because the offline design tools used to create the logic modules can not fit the design into the required square bounding box without some CLB’s remaining unused. We call this logic block internal fragmentation. Within any application partition there may be area fragmentation because the algorithms that the MARC.1 uses to place and route the application partitions into the logic frame, must do so in a time that is fast, deterministic, and bounded. To allow the algorithms to complete in reasonable time some spare area must exist inside each partition. We call this partition internal fragmentation. The size and location of logic frames on the real reconfigurable surface determines the partition external fragmentation at a particular point in time. This fragmentation is shown in Figure 2. If the logic modules were permitted to be non-square there would be additional possibilities for an increase in the logic block internal fragmentation and if the logic frames were allowed to be non rectangular there would be additional possibilities for an increase the external area fragmentation. Neither of the last two cases is treated in the current paper.

contiguous rectangle. To better quantify the area fragmentation an aspect ratio is defined for this contiguous rectangle as the ratio of the shortest dimension to the longest dimension. If this aspect ratio is multiplied by the area, the result is in fact the shortest dimension of the rectangle. This is called the characteristic dimension of the area. This is this most basic measurement of the usefulness of the rectangle in the sense that it both measures the size available but modifies this according to the usefulness of the area for the OS purposes. The larger the characteristic dimension is the more use the space can be. A small characteristic dimension can mean the logic frame is very small in area, or the dimensions of it are result in the space being poorly distributed, i.e. a logic frame with dimensions of 1 by 10. To obtain a more balanced figure the search procedure is repeated to identify the next biggest contiguous area till a distribution of characteristic dimensions is obtained. But it should be evident that the areas found last will be of least interest since they will have small characteristic dimensions. So the search is terminated when half the area has been assigned to the distribution and the mean characteristic dimension of this distribution is used as the fragmentation measure.

5.2.

Experimental Framework

FPGA Surface

Logic Frames

There is little general agreement yet on which applications should perform well in a reconfigurable OS. Contrast this with the well-defined agreement on “representative” benchmarks in computer architecture research. (gcc, SPICE, etc.). There is still a need to survey the applications that have become popular in reconfigurable computing so that there is some basis of agreement on what are the most representative benchmark applications. In the experiments reported here the benchmarks have been constructed specifically for the tests. There is also no general agreement yet about what is good performance in an OS environment for reconfigurable platforms. We expect that traditional measures of OS performance such as overheads generated by the OS would be of interest when using reconfigurable platforms. In this paper the execution time of the OS software is the major factor investigated as a measure of the likely overheads. The two factors that have been investigated in the experiments are the effect of the size of the applications and the degree of partition external fragmentation and its affects on the execution time of the MARC.1 algorithms. The effect of the application size on the run time performance of the OS algorithms is of interest because we expect there may be practical limits to the number of logic modules that the OS can handle. The size is measured in logic modules because these are the smallest unit that the OS has to manipulate. Given that logic modules absolute size (in equivalent gates or CLB’s) is defined by the application

Free Spaces Occupied Spaces

Figure 2 – An example of partition external fragmentation 5.1.1. Quantification of area fragmentation

The allocation algorithm already performs a search of the available area on the FPGA find the largest area

designer restrictions on the number of logic modules does not necessarily restrict the absolute size of the application as measures in equivalent gates of CLB’s. The effect of fragmentation on the run time performance of the OS algorithms is of interest because fragmentation is a measure of the resource usage and contention of the real surface. The fragmentation measure used in this paper takes into account both the available spare area and the distribution of that area. Partition external fragmentation is generated, as the applications must be partitioned to fit the available area, thus the execution time of the partitioner would be expected to grow with fragmentation. However it could be expected that the place and route stages would be quicker as the size of the partitions is reduced (number of partitions increases). The fragmentation of the area must be measured both in the size distribution of logic frames that are area contiguous and the aspect ratio of these frames because frames with small aspect rations are likely to increase the internal partition fragmentation.

perform the route for small partitions is only a small part of the total time used by the router. Per partition set up time and other overheads seem to dominate the overall routing times in these cases. It is interesting to note that the total OS run times are dominated by partition and place times rather than routing times. This is a positive result indicating that the traditional long routing times associated with many design tools are not likely to be a problem with the OS algorithms provided in this paper. It is also pleasing to note that the total times seem to be close to linear time. In regard to fragmentation the results can be interpreted to mean that even with a totally fragmented FPGA (that is one where each logic module must be placed in its own partition) the OS overheads of 300mS are not impractical in relation to the execution time of typical reconfigurable applications.

6.

Current Work

5.3.

Experimental Results

All the current experimental results are expressed as the execution time for the OS to start the application. This includes the time for allocation, partitioning and place and route. The experimental data is differentiated according the size of the task graph (measured in the number of nodes) and the degree of partition external fragmentation of the real FPGA surface. The execution time is measured as wall clock time on an otherwise idle Celeron i466 with 64 Meg of RAM, running the Linux OS kernel 2.2.15-5.0. Fragmentation can be seen to consistently increase the execution time of the partition and placement stages as expected. The partition and place appears to be close to linear time as might be conjectured from the algorithms. The curve showing fixed partition and place times is for the application that entirely fits within a single frame and hence the partitioner is not called at all. Intuitively routing within each application frame is likely to be at least polynomial in the size of the application partition. However if the application frames are small (high external partition fragmentation) the route time for each will be a small constant and the total time for the application, (the sum of a large number of small frames) will be more closely linear time. The data in Graph 2 do not contradict these conjectures. The curve for high fragmentation is more linear than for low fragmentation. The absolute routing times of Graph 2 however show (counter intuitively) that is takes less time to route an application that entirely fits in a partition than it does to route the application when it is spread across many partitions. This reflects that the time actually used to

After the successful simulation of the operating system for reconfigurable computer, it was decided to implement such a system using a real hardware platform. The reconfigurable platform chosen is an RC1000-pp reconfigurable development board from Celoxica (Celoxica 2000). The platform is a full length PCI based option card that contains a single Xilinx Virtex 1000 FPGA as its reconfigurable logic resource. This board was chosen because of a number of reasons. One, Xilinx completely supports the board with all their design software and as such will reduce any problems associated with incompatible software. Two, as the board has a PCI based connection, it has a 33Mhz bus connection to the host hardware and the 4 banks of 2 Mb of memory is accessible to the host processor. Several platforms, including the Wildcard from Annapolis Computing use the FPGA to complete their glue logic. This can be a problem with an operating system, as at minimum the FPGA requires a design to be loaded at all times and as such can not be swapped out, thus limiting the amount of logic resource available to user applications. Three, the board can perform partial reconfiguration using SelectMap. As application modules will be loaded at various times to the FPGA the board must support partial reconfiguration and using the SelectMap method is a Xilinx standard way to perform such a function. As the reconfigurable resource is in excess of 1 million system gates dense, reconfiguring the complete resource will cost too much run-time. Therefore partial reconfiguration is essential. As the bit-stream configuration protocol on Xilinx Virtex FPGAs is encrypted the use of Xilinx JBits is essential in an operating system that will be required to perform low level bit-stream manipulation. JBits is an API written in JAVA that allows the user to alter the bitstream of a Virtex FPGA without having to use the offline design tools. There are a number of problem with using the current offline design tools

Graph 1, 2 and 3 shows the results of the experiments (The numbers in the legend refer to the characteristic dimension)

Partition & Place
300

250 Time (milliseconds)

200
1 1.66 3 6

150

100

50

Graph 1 The execution time for the partitioning and the placement stages

0 0 5 10 15 20 25 Number of Application Modules

Routing
90 80 70 Time (milliseconds) 60 50 40 30 20 10 0 0 5 10 15 20 25 Number of Application Modules
1 1.66 3 6

Graph 2 The execution time for the routing stage

Total
350 300 250 200 150 100 50 0 0 5 10 15 20 25 Number of Application Modules
1 1.66 3 6

Time (milliseconds)

Graph 3 The total execution time for the complete process

with an operating system for a reconfigurable computer. One, most of the offline design tools requires user interaction at some level. For an operating system to be useful there has to be no or very little user interaction and these offline design tools have not been written in such a way. Two, using the current offline design tools to alter one or two routes at most is very inefficient. Since all of the application modules will be pre-placed and routed the only placement and routing required to be done by the OS will be between application modules. Most of the algorithms used in the offline design tools are stochastic and re-compiling the complete application partition will require too much run-time. Using JBits, only the routing switches that are required to be changed to can altered without having to re-route the complete application partition. This will minimize the overall overheads introduced by the OS.

execution time when the FPGA surface becomes fragmented, and we have developed a method to measure fragmentation within a logic frame, known as characteristic dimension. We now believe an operating system for reconfigurable computing is possible, as execution times gained from our prototype do not exceed 250 milliseconds, the OS was able to manipulate the applications and still receive correct outputs, and the algorithms developed are capable of handling larger more complex circuits. The next stage of our research is to continue the development of the prototype into a full working product. The algorithms for allocation, placement and routing will be improved, as will an attempt to minimize the introduced overheads due to the OS.

8.

References

6.1

Application Development

In order to test the worth of the operating system a number of applications are being developed with the operating system in mind. A part hardware and part software MP3 player is currently under development. This application will have the runtime intensive part of the MP3 algorithm executing in hardware with a software interface connecting to the operating system to receive the output data from the FPGA. Another application being developed for the operating system is the well-known automatic target recognition. Again, the application partition will be broken down into small application modules that can easily be loaded to the surface of the FPGA with the operating system supplying the communication architecture. For any operating system to be accepted by FPGA application designer’s current well known applications have to be shown to execute under such an OS without the introduction of overheads that will reduce the benefits gained from using FPGAs.

ALPERT, C. and KAHNG, A. (1995): Recent Directions in Netlist Partitioning: A Survey. Integration, the VLSI Journal, Vol. 19, pp 1-81, 1995. BEHRENS, K., HARBICH, K., and BARKE, E. (1996): Hierarchical Partitioning. In International Conference on Computer-Aided Design, pp 470-477, San Jose, CA, USA, November 1996. BERGE, C. (1976): Graphs and Hypergraphs. New York, American Elsevier, 1976. BRASEN, D. and SAUCIER, G. (1998): Using Cone Structures for Circuit Partitioning into FPGA Packages. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 17, No. 7, pp 592-600, 1998. BREBNER, G. (1996): A Virtual Hardware Operating System for the Xilinx XC6200. In 6th International Workshop on Field-Programmable Logic and Applications (FPL'96), pp 327-336, Darmstadt, Germany, September 1996. BROWN, S., FRANCIS, R., ROSE, J., and VRANESIC, Z. (1992): Field Programmable Gate Arrays. Boston, USA, Kluwer and Acad. Publishers, 1992. CELOXICA (2000): Reference Manual. RC1000-PP Hardware

7.

Conclusions

In this paper we have identified the fundamental services that must be provided by an MARC.1 are, allocation, partitioning, placement and routing. We have defined a contract between the application designer and the operating system whereby the application is defined as a task graph whose nodes are pre-placed and routed logic modules. We have described and developed a prototype implementation of an MARC.1 that allocates FPGA area when required, partitions the application into any number of smaller applications if it is too large to fit into the largest available area, places the logic modules in the logic frame according to simple placement rules, and finally routes the application partition according to the data dependencies on the task graph. We have presented a set of results that were obtained from the MARC.1, which include the affect of the algorithm

CHAN, P. (1993): On Routability Prediction for Field Programmable Gate Arrays. In IEEE Design Automation Conference DAC'93, pp 326330, Dallas Texas, USA, June 1993. CHAN, P. and SCHLAG, D. (2000): New Parallelization and Convergence Results for NC: A Negotiation-Based FPGA Router. In 8th ACM/SIGDA international symposium on FieldProgrammable Gate Arrays (FPGA'00), pp 165174, Monterey, CA, USA, February 2000. CHAN, P., SCHLAG, D., and ZIEN, J. (1996): Spectral-Based Multiway FPGA Partitioning.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 15, No. 5, pp 554-560, 1996. DIESSEL, O., KEARNEY, D., and WIGLEY, G. (1999): A Web-based Multi-user Operating System for Reconfigurable Computing. In IPPS/SPDP'99 Parallel and Distributed Processing, pp 579-587, San Juan, Puerto Rico, USA, April 1999. FANG, W. and WU, A. (1998): Performance-Driven Multi-FPGA Partitioning Using Functional Clustering and Replication. In 35th annual conference on Design Automation Conference (DAC'98), pp 283-286, San Francisco, California, June 1998. GEHRING, S. and LUDWIG, S. (1996): The Trianus System and its Application to Custom Computing. In 6th International Workshop on Field-Programmable Logic and Applications (FPL'96), pp 176-184, Darmstadt, Germany, September 1996. GUNTHER, B. (1997): SPACE 2 as a Reconfigurable Stream Processor. In 4th Australasian Conference on Parallel and Realtime Systems (PART'97), pp 286-297, Singapore, September 1997. KRESS, R. and HARTENSTEIN, U. (1997): An Operating System for Custom Computing Machines based on the Xputer Paradigm. In 7th International Workshop on Field-Programmable Logic and Applications (FPL'97), pp 304-313, London, UK, September 1997. KRUPNOVA, H., ABBARA, A., and SAUCIER, G. (1997): A Hierarchy-Driven FPGA Partitioning Method. In 34th Design Automation Conference (DAC'97), pp 522-524, Anaheim, CA, USA, June 1997.

LUDWIG, S. (1997): Hades - Fast Hardware Synthesis Tools and Applications. 190 pp, PhD Dissertation, no. 12276, Zurich, Swiss Federal Institute of Technology, 1997. NAM, G., SAKALLAH, K., and RUTENBAR, R. (1999): Satisfiability-Based Layout Revisited: Detailed Routing of Complex FPGAs Via Search-Based Boolean SAT. In 1999 ACM/SIGDA seventh international symposium on Field programmable gate arrays (FPGA'99), pp 167-175, Monterey, CA, USA, February 1999. PURNA, K. and BHATIA, D. (1999): Temporal Partitioning and Scheduling Data Flow Graphs for Reconfigurable Computers. IEEE Transactions on Computers, Vol. 48, No. 6, pp 579-590, 1999. TRIMBERGER, S. (1997): A Time-Multiplexed FPGA. In IEEE Workshop on FPGAs for Custom Computing Machines (FCCM'97), pp 2228, Napa Valley, CA, USA, April 1997. WOOD, R. and RUTENBAR, R. (1997): FPGA Routing and Routability Estimation Via Boolean Satisfiability. In 1997 ACM fifth international symposium on Field-programmable gate arrays (FPGA'97), pp 119-125, Monterey, CA, USA, February 1997. WU, Y. and CHANG, D. (1994): On the NPCompleteness of Regular 2-D FPGA Routing Architectures and a Novel Solution. In 1994 IEEE/ACM international conference on Computer-aided design, pp 362-367, San Jose, CA, USA, November 1994. XILINX (1997): XC6200 Field Programmable Gate Arrays. Xilinx, Inc., 1997. XILINX (2000): Virtex-II Architecture Preview. URL http://www.xilinx.com/products/virtex/ss_vir2.ht m