A Virtual Machine for Merit-Based Runtime by uab76526

VIEWS: 12 PAGES: 2

									                   A Virtual Machine for Merit-Based Runtime Reconfiguration

                                     Brian Greskamp                     Ron Sass
                       University of Illinois at Urbana-Champaign  University of Kansas
                                greskamp@uiuc.edu                 rsass@ittc.ku.edu


    SRAM-based FPGAs can be quickly and repeatedly re-               is a summary of our first results which show (i) that the cost of
configured. One advantage of this flexibility is that time-            profiling and calculating the proposed figure of merit is not pro-
multiplexing the FPGA’s programmable logic can effectively           hibitive (indeed, it appears neglible) and (ii) that for a synthetic
increase the capacity of a resource-constrained system. Sys-         benchmark application, the proposed merit heuristic makes rea-
tems consisting of a processor and FPGA resources, where the         sonable reconfiguration choices.
FPGA’s programmable logic implements specific functionality
to augment the processor and improve the performance of the          Online Algorithm
system as a whole, are well-known [2, 10, 12, 13, 6, 4, 14].
Since not all of the functionality is needed simultaneously, Run-        In merit-based reconfiguration, the goal of the runtime sys-
Time Reconfiguration (RTR) has been proposed by a number of           tem is to migrate into hardware those kernels that will presently
groups to increase (virtual) capacity. The prevailing reconfigu-      give the greatest speedup. Accordingly, a figure of merit is as-
ration policy is “on demand.”                                        signed to each kernel based on the amount of speedup the kernel
    The on-demand policy stipulates that if a particular func-       might provide as well as the FPGA resources it consumes. Intu-
tionality (kernel) is needed by the application, then it must        itively, speedup-per-area seems to be a good metric. We assume
be present in the programmable logic. If it is not already           that all kernels are pre-synthesized before the application starts
present, the system stalls while the necessary reconfiguration        executing so that we know in advance the area consumed (A)
takes place. Hence, researchers have focused on reducing re-         and the hardware execution time (Thw ) for each kernel. The
configuration latency as a means of improving system per-             profiler provides continually updated sliding-window averages
formance. Well-known techniques range from configuration              of the invocation frequency (r) and the software execution time
caching [8] to configuration compression [5] to pre-fetching [9]      (Tsw ) for each kernel. Let H be the set of kernels currently in
and compile-time scheduling of reconfigurations [11].                 the FPGA. We define the figure of merit m(k) for a kernel k as
    As an alternative to on-demand reconfiguration, this work         follows, giving hardware-resident kernels a factor-of-β advan-
advocates a merit-based reconfiguration policy. In the merit-         tage to combat thrashing:
based approach, each kernel that has a hardware implemen-                        
                                                                                  r(k) × (Tsw (k) − Thw (k))
tation also has a software implementation. A runtime sys-                                                       ×β :k ∈H
                                                                                                A(k)
                                                                                 
tem performs continuous profiling of the application to de-             m(k) =
                                                                                  r(k) × (Tsw (k) − Thw (k))
cide which hardware reconfigurations are most profitable at any                                                         : otherwise
                                                                                                A(k)
                                                                                 
given time. Such a system can minimize thrashing, keeping the
set of hardware-resident functionality constant over short inter-       H is periodically recalculated as follows. A new set H is
vals while still adapting to phased behavior in the application.     formed by including kernels in order of decreasing m until no
Consequently, the system is better able to tolerate reconfigura-      more kernels will fit. Kernels that were in H but are not in H
tion latencies.                                                      are reverted to their software implementations. Those in H but
    Other related work includes the RTR system for Java              not H are moved into hardware. Finally, H becomes the new
sketched in [3] and the HASTE [7] unified instruction set ma-         H.
chine which can quickly move kernels between the processor              It is important to note that the proposed metric does not di-
and reconfigurable unit at runtime. The main contribution of          rectly account for reconfiguration time. Instead, the hysteresis
this work is the exhibition of a simple “merit” heuristic that can   parameter β serves to limit the frequency of reconfiguration, an
be used to direct reconfiguration.                                    approach that should work well when kernels on average are ac-
    In this synopsis we present a brief overview of our ongoing      tive (in use) for a period much longer than the reconfiguration
work. First, a simple online algorithm for calculating a figure-      latency. Metrics that more carefully consider reconfiguration
of-merit for RTR systems is formalized. Next we describe its         time are certainly possible, and are a subject for future work.
implementation and demonstrate a working prototype. Included
Implementation                                                           overhead would dominate execution time since the application
                                                                         quickly switches between the four kernels and only two can be
   We implemented the merit-based reconfiguration strategy in             hardware-resident at any time. Instead, the Table shows that
the Kaffe [1] interpretive Java Virtual Machine, which is slower         the RTR-JVM chooses different kernels to place into hardware
but simpler than a JIT JVM. The resulting RTR-JVM system                 depending on the ratio R/(T + R), maximizing performance
comprises a 450 MHz Intel P3 host machine and a PCI-based
                                                                         in each case and keeping the set of hardware-resident features
reconfigurable computing card. The Pentium runs the modi-
                                                                         stable.
fied JVM and the reconfigurable computing card (a TSI-Telsys
AceII card with two Xilinx XC4085 FPGAs) provides the pro-                %R        Execution Time             Hardware Kernels
grammable logic. Since the venerable 4000-series parts do                 0               84s                  DES encrypt, Hamming gen
not support partial reconfiguration, each kernel effectively con-          0.4            120s                  DES encrypt, Hamming gen
sumes an entire FPGA and has area A = 1.                                  50             123s                  DES encrypt, DES decrypt
   In general, a “kernel” could be of any size (number of in-             99.6           126s                  DES decrypt, Hamming check
structions). For the current RTR-JVM, we define each Java class            100             76s                  DES decrypt, Hamming check
to be a kernel, and only classes that invoke no external methods
are considered for implementation in hardware. Our decision                  Although the RTR-JVM system did give speedups over the
to use the Xilinx Forge Java-to-Verilog compiler for hardware            software-only JVM, they are not very meaningful given the poor
synthesis influenced these constraints.                                   initial performance of the interpretive software JVM and are not
   In addition to profiling and kernel migration, the RTR-JVM             reported here.
must maintain consistent state between the hardware and soft-
ware kernels. For example, any class instance variables must             Conclusion
be sent to the hardware when a kernel is placed on the FPGA
and must be read back and synchronized with the software data               We have constructed the RTR-JVM as a proof-of-concept,
structures when a kernel leaves the FPGA. Of course, the RTR-            showing that merit-based reconfiguration may be an effective al-
JVM must also be able to marshal data to and from the hardware           ternative to the accepted demand-based approach. It remains to
kernels. The state maintenance and marshalling tasks are han-            be seen how merit-based reconfiguration will compare against
dled by class-specific interface libraries. Each kernel that has a        demand-based schemes in a JIT JVM system with tightly-
hardware implementation also has an interface library contain-           coupled processor and FPGA.
ing fuctions that the RTR-JVM calls when the kernel enters or
leaves hardware. In addition, the interface library contains a           References
marshalling function for each method in the class. When the                [1] The kaffe JVM. http://www.kaffe.org.
                                                                           [2] P. Athanas and H. Silverman. Processor reconfiguration through instruction-set
RTR-JVM encounters a call to a method of a hardware resident                   metamorphosis. IEEE Computer, pages 11–18, 1993.
class, it delegates to the corresponding interface library mar-            [3] Y. H. et. al. Building a virtual framework for networked reconfigurable hardware
                                                                               and software objects. The Journal of Supercomputing, pages 131–144, 2002.
shalling function. Along with the kernels, interface libraries are         [4] S. Hauck, T. Fry, M. Hosler, and J. Kao. The Chimaera reconfigurable functional
compiled before execution begins.                                              unit. In IEEE Symposium on FPGAs for Custom Computing Machines, Napa, CA,
                                                                               USA, Apr. 1997.
                                                                           [5] S. Hauck and Z. Li. Configuration compression for the Xilinx XC6200 FPGA.
                                                                               IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
Results                                                                        pages 1107–1113, Aug. 1999.
                                                                           [6] J. R. Hauser and J. Wawrzynek. Garp: A MIPS processor with a reconfigurable co-
    To demonstrate that the overhead of profiling is negligible,                processor. In Proceedings of the IEEE Symposium on Field-Programmable Custom
                                                                               Computing, Napa, CA, USA, Apr. 1997.
the SciMark2 benchmark was executed with profiling both en-                 [7] B. Levine and H. Schmit. Efficient application representation for HASTE: Hybrid
abled and disabled. No kernels were actually placed into hard-                 architectures with a single, transformable executable. In FCCM ’03. IEEE Com-
                                                                               puter Society, 2003.
ware, but with profiling on, the RTR-JVM recalculated the set of            [8] Z. Li, K. Compton, and S. Hauck. Configuration caching management techniques
                                                                               for reconfigurable computing. In IEEE Symposium on FPGAs for Custom Comput-
hardware-resident kernels ten times per second. The measured                   ing Machines, pages 87–96, Napa, CA, USA, Apr. 2000.
difference in execution time was not statistically significant.             [9] Z. Li and S. Hauck. Configuration prefetching techniques for partial reconfigurable
                                                                               coprocessor with relocation and defragmentation. In ACM/SIGDA International
    A synthetic benchmark benchmark was employed to test the                   Symposium on Field-Programmable Gate Arrays, pages 187–195, Feb. 2002.
system. It models a simple communications task, transmit-                 [10] R. Razdan and M. D. Smith. A high-performance microarchitecture with hardware-
                                                                               programmable functional units. In MICRO 27: Proceedings of the 27th annual
ting and receiving encrypted ECC-coded packets on a network.                   international symposium on Microarchitecture, pages 172–180. ACM Press, 1994.
It consists of two threads: A ‘transmitter’ thread processes T            [11] X. Tang, M. Aalsma, and R. Jou. A compiler directed approach to hiding configu-
                                                                               ration latency in chameleon processors. In FPL, pages 29–38, 2000.
packets per second, encrypting each with DES and generating               [12] M. J. Wirthlin and B. L. Hutchings. A dynamic instruction set computer. In IEEE
                                                                               Symposium on FPGAs for Custom Computing Machines, pages 99–107, Napa, CA,
Hamming ECC codes for the output stream. A ‘receiver’ thread                   USA, Apr. 1995.
does the inverse, processing R packets per second. The total              [13] R. D. Wittig and P. Chow. OneChip: An FPGA processor with reconfigurable logic.
                                                                               In IEEE Symposium on FPGAs for Custom Computing Machines, Napa, CA, USA,
number of packets processed T + R is held constant, and each                   Apr. 1996.
of the four kernels (DES encrypt, DES decrypt, Hamming gen-               [14] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee. CHIMAERA: a high-
                                                                               performance architecture with a tightly-coupled reconfigurable functional unit. In
erate, and Hamming check) has a hardware implementation.                       ISCA ’00: Proceedings of the 27th annual international symposium on Computer
    If on-demand reconfiguration were used, reconfiguration                      architecture, pages 225–235. ACM Press, 2000.




                                                                     2

								
To top