microsoft Hardware Microkernels for Heterogeneous Manycores “If You Build it He by huanghengdong


									     Hardware Microkernels for
     Heterogeneous Manycores

“If You Build it He Will Come”…Field of Dreams 1989
            Yes, But Can He Use it ?

                     David Andrews
    Mullins Endowed Chair in Computer Engineering
                 University of Arkansas

       Computer Science &
      Computer Engineering
                    Today’s Agenda
• The Rise, Fall, and Rebirth of Parallel Processing

• Operating (Run Time) System Challenges (4-5)
    – Scalability and Heterogeneity
        • Focusing on Synchronization, Program Management, Scheduling
    – Monolithic to Micro to Hardware Microkernels
        • Where Will it All End ?
• Hthreads Prototype System (6-10)
    – Overview
    – Performance
• Conclusion (1)

                Computer Science &
               Computer Engineering
     The Return of Parallel Processing
    “Groundhog Day” Back to Parallelism Once Again

•Dynamic ILP Run It’s Course
   •Performance Scaling Ebbing
       •Not Much Juice to Be Squeezed in ILP
       •High Transistor Costs for Small Return
•Power + Memory Wall = Brick Wall
•Manycore Architectures Following Moore’s Law ?
   •Simpler CPU’s for Parallelism
       •Modern Apps: MIMD/SIMD Heterogeneity
          •Better Use of Dense Interconnect
          •Will This Break Memory Wall ?
•A Rebirth of Parallel Processing

             Computer Science &
            Computer Engineering
              Manycore Status
• Paradigm Shift Occurred Without Considering
  Software Infrastructure
     • Concerning as Prior Efforts a Failure
        – We did not Resolve Parallel Programming Models
        – We did not Resolve Run Time Systems
     • New Considerations
        – Magnitude of Parallelism Will Be Greater
        – Heterogeneity of New Applications

Complete Technology Infrastructure Riding on Success

           Computer Science &
          Computer Engineering
            Parallel Processing Era: Lots of Fun !
                   The war of the machines

Peak Performers, Usually Vector/SIMD Data Parallelism
        -Hard to Program
                 Computer Science &
                Computer Engineering
                                 Which PP Usability)?
                                 (Economics &
Video Killed the Radio Star The Buggles 1979 (First MTV Video)

                                                     Economics := Commodity
                                                            Sequential Languages
            QuickTime™ and a
                                                            Operating Systems
    ar e neede d to see this picture.

                                                     Usability := Familiarity
                                                              MIMD Parallelism

                                        Victim of Our Own Success:
                           Side Effect was Broad Research in OS’s Ebbed
                             Research on Vaneered Middleware Layers
                       Computer Science &
                      Computer Engineering
      Today’s Operating Systems
Monolithic OS’s Large, Complex, Resistant to Change
  -Brittle, Insecure, Memory Hogs (~6 MSLOC’s Linux)
  -Shared Data Structures Prevent Scaling
  -Target Homogeneous Processors

Linux Kernel Hacking != OS Research

     Classic Monolithic OS May be Retired Along with Power
     Hungry Dynamic ILP Processors
              Computer Science &
             Computer Engineering
Can We Simply Port ?

                                         o         o
                                         o         o
                                         o         o


                       Seq to Parallel

 Computer Science &
Computer Engineering
Obvious Scalability Issues


o                     o
o                     o
o       chan_attr()   o


    Wasteful if large image replicated in memory hierarchy
    Shared data structures enforce sequentiality, contention
           Implications on Caches
    Scheduler focus on time and not space multiplexing
          Computer Science &
         Computer Engineering
     Scalability/Thread Efficiency Issues
                  (A Little Scary to Me !)

os                          os                                  1
                           app         Amdahls Law                  f
                                                         (1-f) +
                            os                                      Sp
                                              f => #threads > #Cores
os                         app    os
                                 app                      app
                                             ThrEff =
app                                                     app + OS
                                 app                     10
                                              1 =                  = 90%
                                                        10 + 1
                                                           3.3   = 75%
                                 app          3 =
           Computer Science &
                                                         3.3 + 1
          Computer Engineering
Focus on Deconstructing Operating Systems
                 (From the Berkeley View)

Resurgence of Interest in Virtual Machines
      Hypervisor: Thin SW Layer btw Guest OS and HW

Future OS:=
       Libraries Where Only Functions Needed
       Hypervisor Provides Protection and Resource Sharing

Leverage HW Partitioning Support for Very Thin Hypervisors
      Allows Software Full Access to Hardware Within Partition

           Computer Science &
          Computer Engineering
        Heterogeneity Issues
• Amdahls law pointing towards
  heterogeneous cores
  – Scalar cores for threaded data processing
  – SIMD cores for audio/video/signal processing
• Heterogeneity Issues permeate abstractions
  – Unifying Programming Languages/Models
  – Compilation
  – Run Time System

          Computer Science &
         Computer Engineering
• Heterogeneous Processors as Schedulable Resources
   – All Under Scheduler Control
   – Enabling Asynchronous Model
• Equal Access to Unifying OS API’s
   – Synchronization Particularly Sticky
     • LL/SC Versus Test&Set
  – Creating/Managing Threads Across Heterogeneous
     • ABI’s, etc

             Computer Science &
            Computer Engineering
        Classic Synchronization
Classic SMP Synchronization Using LL/SC Atomic Pairs
             t1                          ll    Rx,lock
             t2          ll Ry, lock     bne   Rx,again
             t3          bne Ry, again    sc   Rx,lock
             t4           sc Ry,lock     Beq   Rx, again
             t5          beq Ry, again

          PPC                                       PPC
            LR                  t2                   LR

                    t3                   t1   Cache Miss
                           lock      1
                    Update Lock               Update LR
                    Invalidate LR

            Computer Science &
           Computer Engineering
    Heterogeneous Synchronization
                               LL/Sc              Test-and-Set
                               PC1          ?          PC2
Different ISA’s Collide                 lock  0
        -LL/Sc versus TAS
Reliance on Snoopy Cache Protocol
        -Doesn’t Support Hetero Semantics
        -Doesn’t Scale Well
        -May not Even Be in System (ala Cell)

                Computer Science &
               Computer Engineering
 Remote Procedure Call (RPC)
   Used by CELL & EXOCHI
• Approach Highly Flexible
• Scalability Issues
   – Long Latency of Calls
       » Interrupt/Exception Processing
       » Redundant Messaging
   – Does Not Scale Well
       » Centralized Bottleneck
       » “Master” CPU + Bus
       » Punishes “Master” Processor
   – Imposes Synchronous Model
       » Cell Uses “wrappers” for asynchronous model
       » Scheduler See’s Only Master

      Computer Science &
     Computer Engineering
      Hthreads: Hardware Microkernel
• NSF Project Originally Developed as Unifying Programming Model
  for CPU/FPGA Hybrid Embedded Systems
   – Enable Programmers to Specify Computations that Seamlessly run
      on CPU/FPGA
       • Adopt Familiar Programming Models for hw/sw co-design
       • Open up custom hardware design to Software
         Engineers/Domain Scientists
   – Pthreads Model Adopted
       • Thread bodies synthesized and mapped into hardware
           – VHDL or C->VHDL
           – Also have used HandelC and Haskell-> VHDL
        • Operating System is Unifying Framework
            – Abstracts Interface
            – Enables Uniform API Calls from hardware/software threads
               Computer Science &
              Computer Engineering
                  hthreads System

• Challenges
  – Asynchronous concurrency in FPGA
     • Some said couldn’t be done :-)
     • VHDL Processess and Threads Share Much Commonality
  – Uniform Policies but with platform specific
     • API’s for hw/sw identical
     • Software Mechanisms: standard linkable libraries
     • Hardware Mechanisms: FSM’s in “linkable” abstraction I’face

           Computer Science &
          Computer Engineering
     hthreads System (Original)

• Separate Cores Form Microkernel
   – Breaks up Monolithic Kernel Bottleneck
      • Fast lightweight messaging between cores (load/store)
          – Breaks up Global Data Structures
          – Allows Parallel Operations
          – Resolves Heterogeneity

            Computer Science &
           Computer Engineering
 hthreads for Heterogeneous Manycores

• Difference is largely within Computational Units
   – Substitute Processors for Custom Circuits
• Hthreads OS Cores Serve as Unifying Framework
   – Cores did not change !
   – Back to linkable libraries in place of FSM I’face
• Cores Interesting Enabling Technology
   – Resolves Heterogeneity (well almost…)
   – Provides Scalable Low Latency OS Services
   – Breaks up Monolithic Bottlenecks

            Computer Science &
           Computer Engineering

 Computer Science &
Computer Engineering
Creating A Heterogeneous Thread

     Computer Science &
    Computer Engineering
      Mutex Unlock

 Computer Science &
Computer Engineering
         Mutex Lock

 Computer Science &
Computer Engineering
        API Timings

 Computer Science &
Computer Engineering
 RPC/hthread Core Comparisons

• RPC Call from Custom Circuit to OS on PPC
  – create_thread( )
     • 160usec versus 40.8usec PPC & 12.5 usec MBlaze
  – join( )
     • 130usec versus 65.7 usec PPC & 13.9 use MBlaze

           Computer Science &
          Computer Engineering
 Scheduler Timings

 Computer Science &
Computer Engineering
     Mutex Timings

 Computer Science &
Computer Engineering
• Manycores Placing Success on Parallel
  – Operating Systems Research Stagnated
  – Need

         Computer Science &
        Computer Engineering
                       A New World Order

• This Go Round Benefiting From Some Lessons Learned
   • Architectures: It’s not just about CPU’s
       • Much History on Interconnect Networks, Memory Hierarchies
   • Representing Parallelism
       • User Representation Versus Automated Compiler Extraction
       • Programming Models Versus Languages
            • CPU versus System Abstractions
            • MIMD/SIMD Pros and Cons
       • Will User Languages Move Towards DSL’s and Libraries ?
          • Abstracting Complexity & Correctness
          – Will This Kill Hallway Language Debates ?

• What About OS’s ?
   • Historical Domination of Clusters Good for Economics
       • Much OS work rests with big Iron in Boneyard of Parallel Machines
   • Largely Still Building on “Son of” OS/360 Type Monolithic Kernels
   • But with Middleware to Fill in the System Abtraction Gap
               Computer Science &
              Computer Engineering
 What is the Operating System ?
• Modern Parallel Programming Model:
  – Programming Language +
  – Middleware +
  – Operating System

• We Should Be Talking about More Than
  Classic Operating System

         Computer Science &
        Computer Engineering
 Middleware                   System Level Abstractions
Operating System                    Communications
                              CPU Level Abstractions
                                     Program Mgt
                                     I/O Mgt

        Computer Science &
       Computer Engineering
        A New World Order

 Computer Science &
Computer Engineering
    A New World Order
 Armed With Lessons Learned

 Computer Science &
Computer Engineering
  Operating Systems Challenges

• Complete Multiprocessor System on Chip
  – Computations/Communications Tradeoffs Different
  – Moore’s Law Applied to Processors
     • Doubling of Processors ~18 Months Exciting
     • But What About Memory Hierarchy & Interconnect
         – How Do We Exploit ?
  – Amdahls Law Applies to Speedup
     • What is Speedup on 1,000 cores running 10 threads ?
     • “Threads” Should Track Processors

           Computer Science &
          Computer Engineering
      Redefining Bridging Abstractions
Productivity Measure

                                 Coordination Language
Software                                                      Parallel
Time/Ease                                                     Systems

                                Comm/Synch Scheduler
                                                 CPU’s, IP Cores
                Language        Primitives

  Co-Ordination Language == Virtual Machine == Computational Model
        Provide Transparency Between Software/Hardware Apps
        Common API’s Accessible for Hw & Sw Computations

                        Computer Science &
                       Computer Engineering
                 Scalability Issues
• Operating Systems Traditional Role:
   – Provide Virtual Machine Abstraction
      • Enable Portability Across Platforms
      • Abstract Physical Characteristics of Machine
          – Number Processors, Interconnect, Memory Hierarchy, I/O
      • Functionality of Virtual Machine Provided Through API’s
          –   Program Management
          –   File & I/O Management
          –   Timers
          –   Synchronization/Communication Primitives

               Computer Science &
              Computer Engineering
           Scary Scalability Issues

• Manycore’s bring about higher performance if
  parallelism is increased
   – Increased Burden on OS !
   – Efficiency = work/(latency + work)
      • fixed latencies (thread_create(), mutex()) worsens…
         – What happens when thread is broken into more threads ?
      • latencies variable with #threads worsens….
• New and more efficient services
  fundamental to success of Manycores
           Computer Science &
          Computer Engineering

To top