JAVA PROCESSORS IN EMBEDDED SYSTEMS

Document Sample
JAVA PROCESSORS IN EMBEDDED SYSTEMS Powered By Docstoc
					                             JAVA PROCESSORS
                           IN EMBEDDED SYSTEMS
                               SNART ’09 SUMMER SCHOOL
                                      AUGUST 19

                                                    FLAVIUS.GRUIAN
                                                         @CS.LTH.SE




Tuesday, August 18, 2009                                              1
                              EMBEDDED SYSTEMS

                     ALL NON-GENERAL-PURPOSE SYSTEMS

                     MORE STRICT CONSTRAINTS ON:

                           SIZE/COST (DEVELOPMENT, DEVICE, BATTERY)

                           TIMING (HANDLING EXTERNAL EVENTS)

                           FUNCTIONALITY (DOING THE RIGHT THING WELL)

                           RELIABILITY

                     THESE ARE NOT ORTHOGONAL!



Tuesday, August 18, 2009                                                2
                                   WHY JAVA?

                     SAFE, EASY TO PROGRAM IN
                           ARRAY BOUNDS CHECKING
                           BUILT-IN STRING SUPPORT
                           FIXED SIZE DATA-TYPES
                           AUTOMATIC MEMORY MANAGEMENT (GC)
                           BUILT-IN THREADS AND SYNCHRONIZATION
                           EXCEPTIONS
                           ...
                     TOOL SUPPORT (FOR DESKTOPS ANYWAY)

                     PORTABILITY



Tuesday, August 18, 2009                                          3
                               WHY NOT (JAVA)?


                     PERFORMANCE

                     MEMORY FOOTPRINT

                     PREDICTABILITY ISSUES

                     LOW LEVEL CONTROL OF SENSORS AND ACTUATORS

                     “ALREADY USING C JUST FINE”

                     KEEPING THE STANDARD IS NOT TRIVIAL




Tuesday, August 18, 2009                                          4
                           ...AND THE CHALLENGES ARE


                     MINIMAL SIZE/COST

                     CUSTOMIZABLE FEATURES/ADAPTABILITY

                     PREDICTABILITY (RT)

                     HIGH PERFORMANCE

                     MINIMAL POWER/ENERGY

                     ADHERENCE TO STANDARDS




Tuesday, August 18, 2009                                  5
                           APPROACHES OVERVIEW (I)


                     JAVA TO C TRANSLATION

                     PURE VIRTUAL MACHINES

                     JAVA CO-PROCESSORS

                     NATIVE JAVA PROCESSORS

                     JAVA TO HARDWARE TRANSLATION (SYNTHESIS)

                     ...MULTI-PROCESSORS? HETEROGENEOUS JAVA?




Tuesday, August 18, 2009                                        6
                           APPROACHES OVERVIEW (II)
                                                    JAVA CODE


                                    JAVA2C                JAVAC       JAVA2H

                                        AOT         BYTECODE

                                                                  BC2H
                                                         IMAGE
                           C CODE             JIT                              VHDL/VERILOG
                                                      GENERATOR/
             C FLOW                                  CLASS LOADER             HW
                                                                           SYNTHESIS
                      CRT/OS             JVM/OS                 JRE
                      EMBEDDED           EMBEDDED           NATIVE JAVA        SPECIALIZED
                     PROCESSOR          PROCESSOR           PROCESSOR          HARDWARE




Tuesday, August 18, 2009                                                                      7
                                APPROACHES: JAVA2C

                     LJRT - THE LUND JAVA-BASED REAL-TIME SYSTEM
                     [ANDERS NILSSON, LTH]

                     AHEAD OF TIME (AOT) TRANSLATION TO C

                   ✓       USE THE EXISTING C FLOW TOOLS
                   ✓       USE THE OS SUPPORT FOR SCHEDULING AND SYNCH
                   ✓       ABLE TO GIVE REAL-TIME PREDICTABILITY
                   ✓       RATHER GOOD PERFORMANCE (COMPARABLE TO HANDWRITTEN C)
                   -       HARD TO TRACE BACK BUGS
                   -       NOT ALL JAVA FEATURES ARE AVAILABLE (MOST ARE:
                           JAVA5 WITH LIMITED SUPPORT FOR GENERICS UNDER WAY)




Tuesday, August 18, 2009                                                           8
                           APPROACHES: TRUE JVM

                     USE A JVM ON AN EMBEDDED PROCESSOR
                     (TYPICAL IN MOBILE PHONES, PDA, ETC.)

                     ADVANTAGES:
                       RUN CLASSES AS THEY ARE
                       PORTING (FROM DESKTOP) IS EASY
                     DRAWBACKS:
                       VERY SLOW
                       NOT FOR REAL-TIME (ISOLATED FROM THE RT PART)
                       (TOO) LARGE MEMORY FOOTPRINT
                       POWER HUNGRY




Tuesday, August 18, 2009                                               9
                APPROACHES: JAVA PROCESSORS

             ASIC DESIGNS:

                     SUN’S PICO-JAVA, PICO-JAVA II (1999)
                     CISC, 6-stages pipeline, microprogrammed,
                     341 instructions, bytecode folding, gc support,
                     fastest but very large (5x, 7-11x)

                     AJILE’S AJ-80,AJ-100 (JEM2 CORE)
                     32-bit microprogrammed core, hardware supports for java threads,
                     multi-JVM Unit, SoC, J2ME CLDC/CDC, RTSJ compliant

                     CJIP
                     CISC, microprogrammed, thread scheduling and GC as
                     microcode, no OO bytecodes, slow (nop = 6 cycles)



Tuesday, August 18, 2009                                                                10
                APPROACHES: JAVA PROCESSORS

             CO-PROCESSORS

                     NAZOMI’S (NOW ?) JSTAR
                     160 bytecodes
                     to native code (2002)
                     also JSMART for JavaCard

                     INSILICON’S (NOW SYNOPSYS) JVXTREME
                     ARM9 co-proc, 87 bytecodes in Hw, folding (2001)

                     ARM’S JAZELLE DBX
                     Extra instruction set for ARM (along with ARM and Thumb),
                     145 bytecodes in HW, the rest are emulated or undefined (2005)



Tuesday, August 18, 2009                                                              11
                APPROACHES: JAVA PROCESSORS

             IN/FOR FPGA

                     VULCAN ASIC’S MOON, MOON2
                     32-bit Microprogrammed stack machine, PPO folding, coprocessor
                     or standalone, j2ME CLDC

                     DCT’S LIGHTFOOT
                     8(i)/32(d)-bit Harvard architecture, 3-stage pipeline,
                     3-instruction formats (128 soft, 64 non-ret, 32 single byte ret),
                     Xilinx Alliance Core

                     XILINX’S LAVACORE
                     32-bit soft IP, user-configurable, GC, DES...




Tuesday, August 18, 2009                                                                 12
                APPROACHES: JAVA PROCESSORS


                     AUGSBURG UNI.’S KOMODO
                     4-stage pipeline, microprogrammed,
                     multi-threaded (4 hw threads)
                     UFRGS’ FEMTOJAVA
                     5-stages pipeline, Harvard architecture, operand forwarding,
                     application specific (similar to LavaCore)
                     DRESDEN UNI.’S SHAP
                     Microprogrammed Pipeline, Method cache, Memory Management
                     Unit with GC, subset/superset of Java Bytecodes
                     SCHOEBERL’S JOP
                     CISC, 4-stages pipeline, microprogrammed, Method cache,
                     Predictable - for Real-Time applications




Tuesday, August 18, 2009                                                            13
              EXAMPLE ARCHITECTURE: BLUEJEP




                     derived from JOP, redesigned in Bluespec System Verilog

                     6-stages pipeline, microprogrammed, stack machine



Tuesday, August 18, 2009                                                       14
                           MORE BLUEJEP FEATURES


                     OPERAND FORWARDING

                     STAGE BYPASS

                     SPECULATIVE EXECUTION

                     METHOD CACHE (SINGLE METHOD OR MULTI BLOCK)

             SYNTHESIS:
               2x larger than jop (70% xc2v1000)
               (but faster development, highly configurable)
               faster clock than JOP

                                     source: gruian & Westmizje,
                                     A Flexible and High-performance Java Embedded Processor, JTRES 2007


Tuesday, August 18, 2009                                                                                   15
                             BLUEJEP SOFTWARE


                     CREATING THE PROCESSOR:
                     FROM MICROCODE TO THE
                     DECODING HW & MICRO-ROM




                                           CREATING THE APPLICATION:
                                           FROM JAVA SOURCE TO THE
                                           RAM IMAGE




Tuesday, August 18, 2009                                               16
                           TYPICAL BLUEJEP SYSTEMS

                    SOFTWARE MEMORY
                    MANAGEMENT WITH/WITHOUT
                    GARBAGE COLLECTION




                    HARDWARE MEMORY
                    MANAGEMENT UNIT
                    (MARK-COMPACT GC)




Tuesday, August 18, 2009                             17
                 DESIGN CHOICES FOR SIZE AND
                        PREDICTABILITY

                     OFFLINE JAVA TO EXECUTABLE IMAGE

                           LOAD CLASSES ONCE

                           ELIMINATE REDUNDANT CLASSES, METHODS, DATA

                     SPECIALIZED CACHES

                     COMPLEX AND SELDOM USED BYTECODES ARE
                     DISCARDED OR EMULATED THROUGH OTHERS

                     SUPPORT FOR MEMORY MANAGEMENT WITH GC




Tuesday, August 18, 2009                                                18
                             MORE DESIGN CHOICES


                     BYTECODES VARY FROM VERY SIMPLE (IADD) TO VERY
                     COMPLEX (INVOKE, ANEW):

                   ➡       USE MICROPROGRAMMING

                     JAVA IS A SAFE LANGUAGE, BUT EMBEDDED SYSTEMS
                     MUST EXPLICITLY ACCESS HARDWARE:

                   ➡       ADD DIRECT HW-ACCESS INSTRUCTIONS (MEMRD/WR)

                   ➡       PROVIDE CLASSES TO CONTROL ACCESSES (IOPORT)




Tuesday, August 18, 2009                                                  19
                       MICROPROGRAMMING IN JOP

                                                                          NEW, NEWARRAY,... ARE
                                                                          IMPLEMENTED IN JAVA AND
                                                                          CALLED USING INVOKE
                                                                          TRANSPARENTLY IN
                                                                          MICROCODE


              source: M. Schoeberl, Evaluation of a Java Processor,2005           ...and compared to BlueJEP




                                                                           (1.42 faster clock, 85MHz vs. 60 MHz)



Tuesday, August 18, 2009                                                                                           20
                           JVM STACK IMPLEMENTATION

                                                                     RD/WR
                     small, fast memory                                       TOS
                     (usually 32-bit words)                   MAIN
                                                                              TOS-1
                           scratch pad                        MEM
                           register files                            SP
                     registers for TOS, TOS-1 to reduce the
                                                                     VP      STACK
                     number of memory ports

                     spill/fill on context switch
                        via TOS
                                                               33RD BIT FOR TRACKING
                        separate port
                                                               REFERENCES ?
                                                               - EASIER GC
                     detect and handle stack overflow
                                                               - COMPLEX SPILL/FILL




Tuesday, August 18, 2009                                                               21
                           INSTRUCTION CACHES (I)


                     METHODS ARE AT MOST 1KB LARGE,
                     BUT MOST ARE FAR SMALLER

                     UPDATING CACHES ON METHOD INVOKE/RETURN -
                     PREDICTABLE

                     SINGLE METHOD / MULTIPLE METHOD CACHE

                     ONE / VARIABLE BLOCK(S) PER METHOD




Tuesday, August 18, 2009                                         22
                           INSTRUCTION CACHES (II)

                      SINGLE METHOD    MULTIPLE METHOD       MULTIPLE BLOCKS
                     (ONE 1KB BLOCK)   MANY 1KB BLOCKS      MANY SMALL BLOCKS



                           M1            M1       M2



                  •REFILL ON EACH                           •USE LRU TO REFILL
                  INVOKE/RETURN        •USE LRU TO REFILL   •BETTER SPACE USE
                  •MUCH UNUSED         •STILL MUCH          •MORE COMPLEX
                  SPACE                UNUSED SPACE         HARDWARE


             INCREASED PERFORMANCE = DECREASED PREDICTABILITY
             (MORE COMPLEX WCET ESTIMATION METHODS ARE NEEDED)




Tuesday, August 18, 2009                                                         23
                           DOWN THE MEMORY LANE...

             Embedded systems Usually demand:

                     Small memory footprint, ...yet:
                           J2SE (jar files)>15MB
                           j2ME (CDC, CVM) needs about 2MB on top of OS
                           j2ME (CLDC, KVM) needs about 450KB on top of OS

                     Efficient (even RT) memory management, ...so:

                           J2SE standard GC will not do




Tuesday, August 18, 2009                                                     24
                REDUCING MEMORY FOOTPRINT

                     JAVA, JVM restrictions (e.g. data types)
                     Configurations and Profiles (jars) [J2ME CDC, CLDC]
                     Preload/link all/some classes with the VM
                           C(L)DC: JavaCodeCompact/JavaMemberDepend
                     discard unused classes, attributes and methods based on closures
                     for a set of apps.
                           JOP: ImageGenerator (BCEL)
                           BlueJEP: BlueJIM (BCEL)




Tuesday, August 18, 2009                                                                25
                           MEMORY MANAGEMENT

                     STATIC (everything at start):
                       Immortal Memory (RTSJ)
                     DYNAMIC:

                           allocate once(?) - useless
                           Scoped Memory (RTSJ) - explicit, no sharing
                           Garbage Collection:
                              “Stop the World”
                              Concurrent - needs synchronization
                              Real-Time - careful analysis of memory usage
                              [Henriksson, Robertz]



Tuesday, August 18, 2009                                                     26
                                          MM SUPPORT

                     Reference Maps (GCinfo)

                     Track references in the stack (Ref bit)

                     Concurrent/incremental GC:

                           use handles instead of references
                           (addresses)                               ClassPtr
                                                                                                 ClassInfo
                                                               M   InstancePtr
                                                                     Handle               GCInfoPtr
                           read/write locks                mark
                                                            bit            Instance
                                                                                      BLUEJEP
                           for access control               M         Handle
                                                                      Size
                                                                                                   GC info
                                                                                           NEntries
                                                                                       Refs       Skips
                           GC bits (mark)                            Data                    ...
                                                                                       Refs       Skips




Tuesday, August 18, 2009                                                                                     27
                                 GC IN HARDWARE (I)


                     use a second processor just for GC...

                           makes sense only for concurrent GC

                     BlueJEP MMU:

                           Mark-Compact GC

                           Stop the World

                           Re-design in BSV of...




Tuesday, August 18, 2009                                        28
                                 GC IN HARDWARE (II)


                           JOP concurrent GCU:

                             uses read/write barriers
                             for synchronization



                                                        progress depends on
                                                        object size and access
                                                        pattern (not Real-Time)

                                                        split objects in equal
                                                        size blocks --> RT


Tuesday, August 18, 2009                                                          29
                            BYTECODE FOLDING (I)

                     a way to shortcut stack accesses

                     usually carried out at bytecode level

                     STACK MACHINE                      3-ADDRESS MACHINE
                         (JVM)                               (TARGET)
               ILOAD_1    PUSH (FROM L1)
               ILOAD_2   PUSH (FROM L2)
                        POP, POP, ADD, PUSH                      ADD L1,L2,L3
               IADD
               ISTORE_3     POP (TO L3)

                   MEM: 2 READS, 1 WRITE                     MEM: 2 READS, 1 WRITE
                  STACK: 3 READS, 3 WRITES




Tuesday, August 18, 2009                                                             30
                           BYTECODE FOLDING (II)

             Basic ideas:
                    classify bytecodes according to the effect on stack: producer,
                    consumer, operation, special (e.g. jumps)
                    identify patterns of 2+ bytecodes that can be folded into one
                    target instruction


                  PREVIOUS SLIDE:
                  ILOAD_1           P           COMPILE TIME OR RUN-TIME ?
                  ILOAD_2           P
                                                SOFTWARE OR...HARDWARE ?
                  IADD              O
                  ISTORE_3          C           WHICH PATTERNS ?




Tuesday, August 18, 2009                                                             31
                              BYTECODE FOLDING (III)

                     Hardware folding increases decoder complexity: Is it worth it?
                     Case Study: BlueJEP folding unit (microcode-level folding)
                     [Gruian & Westmizje ‘08], runs software GC Application




                                                                              TH NO
                                                                                IS  TI
                                                                                   CA  N
                                                                                      SE
                                                                                           !




Tuesday, August 18, 2009                                                                       32
                           APPROACHES: JAVA TO HW

                     Translate Java source or CLass bytecode Into hardware described at
                     Register Transfer or Behavioral Level
                     Usually intended for extracting accelerators
                     (not for whole applications)

                     Advantages:
                       high performance, parallel (it is hardware)
                       let the synthesis tools optimize
                     Drawbacks:
                       large designs (method calls are tough)
                       hard to debug




Tuesday, August 18, 2009                                                                  33
                                MORE JAVA TO HW


                     Java source to Hardware Accelerator [Per Andersson, LTH]
                     (Single Method, focus on optimizing memory access)

                     ByteCode to RTL Hardware [Hanna]
                     (Single Method, Bypass Stack R/W, Low Level CTL)

                     Bytecode TO Behavioral Hardware
                     (Bluespec System Verilog, with call frames, multiple bytecodes
                     per clock)




Tuesday, August 18, 2009                                                              34
                                      BYTECODE TO HW

                           ICONST_3
                           SWAP
                           ISUB
                           ISTORE_1                                       MEMORY



             TOS
                           ICONST_3       SWAP              ISUB          ISTORE_1
                                3
                                                              -



                     eliminate stack accesses, thus data shuffle
                     modular
                     group bytecodes into single clock cycles depending on critical
                     path and memory access (folding)



Tuesday, August 18, 2009                                                              35
                    BYTECODE TO BSV EXAMPLE (I)
                              JAVA
                                             BSV



                           BYTECODE




Tuesday, August 18, 2009                           36
                   BYTECODE TO BSV EXAMPLE (II)
              BSV Support Library
         (Implements the JVM ...a subset)        IADD




                                INVOKESTATIC
                                                 TYPES




                                               ILOAD_1




Tuesday, August 18, 2009                                 37
                                    THE MAGIC 8 BALL

                     Are we going to see more Java in embedded systems?
                     Definitely!
                     How about Java Processors?
                     Most likely.
                     In what form?
                     Co-processors, along other kinds in heterogeneous multi-
                     processors, standalone only for very specific applications.
                     How are they going to look like?
                     Follow the trends: highly re-configurable, customizable, pipelined,
                     microprogrammed



Tuesday, August 18, 2009                                                                   38
                                     OPEN RESEARCH

                     JAVA MULTI-PROCESSORS
                     PITTER & SCHOEBERL, TOWARDS A JAVA MULTIPROCESSOR, JTRES 2007


                     APPLICATION SPECIFIC JAVA PROCESSORS,
                     HW/SW CO-DESIGN OF JVM

                     HETEROGENOUS JAVA SYSTEMS
                     REACTIVE-JAVA (ESTEREL-JAVA MIX: SYSTEMJ)


                     PREDICTABLE, YET STANDARD JAVA

                     ...




Tuesday, August 18, 2009                                                             39
                           THANK YOU!




Tuesday, August 18, 2009                40