Scalability by cJ70z5

VIEWS: 3 PAGES: 41

									 Node-to-Network Interface
in Scalable Multiprocessors

      CS 258, Spring 99
       David E. Culler
   Computer Science Division
        U.C. Berkeley
Racap: Common Challenges
• Input buffer overflow
      – N-1 queue over-commitment => must slow sources
      – reserve space per source (credit)
          » when available for reuse?
              • Ack or Higher level
      – Refuse input when full
          » backpressure in reliable network
          » tree saturation
          » deadlock free
          » what happens to traffic not bound for congested dest?
      – Reserve ack back channel
      – drop packets
      – Utilize higher-level semantics of programming model



3/10/99                               CS258 S99                     2
Racap: Challenges (cont)
• Fetch Deadlock
      – For network to remain deadlock free, nodes must continue
        accepting messages, even when cannot source msgs
      – what if incoming transaction is a request?
          » Each may generate a response, which cannot be sent!
          » What happens when internal buffering is full?
• logically independent request/reply networks
      – physical networks
      – virtual channels with separate input/output queues
• bound requests and reserve input buffer space
      – K(P-1) requests + K responses per node
      – service discipline to avoid fetch deadlock?
• NACK on input buffer full
      – NACK delivery?
3/10/99                        CS258 S99                           3
Network Transaction Processing
                    Scalable Network
                                                       Message

Output Processing                                                    Input Processing
                              °°°                                     – checks
– checks
                    CA   Communication Assist            CA           – translation
– translation
– formating                                                           – buffering
– scheduling M                Node Architecture                       – action
                         P                         M             P




 • Key Design Issue:
 • How much interpretation of the message?
 • How much dedicated processing in the Comm.
   Assist?

3/10/99                                CS258 S99                                        4
Spectrum of Designs
     • None: Physical bit stream
          – blind, physical DMA                          nCUBE, iPSC, . . .
     • User/System
          – User-level port                              CM-5, *T
          – User-level handler                           J-Machine, Monsoon, .
            ..
     • Remote virtual address
          – Processing, translation                      Paragon, Meiko CS-2
     • Global physical address
          – Proc + Memory controller                     RP3, BBN, T3D
     • Cache-to-cache
          – Cache controller                             Dash, KSR, Flash
     Increasing HW Support, Specialization, Intrusiveness, Performance (???)


3/10/99                               CS258 S99                                  5
Net Transactions: Physical DMA
                          Data                  Dest




               DMA
               channels

                Addr                                            Addr
                          Cmd       Status,
                Length              interrupt              Length
                 Rdy                                             Rdy


           Memory               P                          Memory        P




• DMA controlled by regs, generates interrupts
• Physical => OS initiates transfers    sender        auth

• Send-side                                 dest addr

      – construct system “envelope” around user data in kernel area
• Receive
      – must receive into system buffer, since no interpretation inCA
3/10/99                                                 CS258 S99            6
nCUBE Network Interface
                        Input ports                                     Output ports

                                                                              


                                                    Switch




                                             DMA
               Addr   Addr            Addr                   Addr     Addr             Addr
                                             channels
                                                             Length   Length           Length



                                                                         Memory
                                                                         bus

                              Memory                    Processor




• independent DMA channel per link direction
      – leave input buffers always open
                                                                                           Os 16 ins           260 cy   13 us
      – segmented messages
                                                                                           Or      18          200 cy   15 us
• routing interprets envelope                                                                   - includes interrupt

      – dimension-order routing on hypercube
      – bit-serial with 36 bit cut-through
3/10/99                                      CS258 S99                                                                  7
Conventional LAN NI
            Host Memory                        NIC

                                                                  trncv


                                                NIC Controller

     Data                                                 addr
                                                TX
                                                                 DMA
                                                RX        len


                 Addr Len       Addr Len
                 Status         Status
                                                                       IO Bus
                 Next           Next            mem bus
                  Addr Len       Addr Len
                  Status         Status
                  Next           Next                 Proc

                     Addr Len       Addr Len
                     Status         Status
                     Next           Next




3/10/99                         CS258 S99                                       8
User Level Ports
                                             User/system
                                  Data               Dest




                                          

                              Status,                Mem    P
                   Mem   P
                              interrupt




•   initiate transaction at user level
•   deliver to user without OS intervention
•   network port in user space
•   User/system flag in envelope
      – protection check, translation, routing, media access in src CA
      – user/sys check in dest CA, interrupt on system
3/10/99                          CS258 S99                               9
User Level Network ports
           Virtual address space

           Net output
           port
           Net input
           port
                                       Processor
           Status

                                       Registers


                                    Program counter




• Appears to user as logical message queues plus
  status
• What happens if no user pop?

3/10/99                        CS258 S99              10
Example: CM-5
                                                                            Diagnostics network

 • Input and output                                                         Control network
                                                                           Data network

   FIFO for each
   network                                                                                    PM PM

                                                            Processing       Processing Control

 • 2 data networks
                                                                                                         I/O partition
                                                             partition        partition processors




 • tag per message
          – index NI mapping       SPARC            FPU                            Data
                                                                                 networks
                                                                                              Control
                                                                                              network

            table                           $                $                        NI
                                           ctrl

 • context switching?
                                                           SRAM

                                  MBUS

                                                  Vector                             Vector
                                   DRAM            unit      DRAM        DRAM         unit       DRAM
                                    ctrl                      ctrl        ctrl                    ctrl

 • *T integrated NI on              DRAM                    DRAM          DRAM                 DRAM

   chip
 • iWARP also                                          Os 50 cy                  1.5 us
                                                       Or      53 cy             1.6 us
3/10/99                        CS258 S99               interrupt                 10us                                    11
User Level Handlers
                                               U s e r /s y s te m


                              D a ta       A d d re s s      D e st




                                   

          M em                                               Mem
                 P                                                    P




• Hardware support to vector to address specified
  in message
      – message ports in registers




3/10/99                        CS258 S99                                  12
J-Machine: Msg-Driven Processor




• Each node a small msg
  driven processor
• HW support to queue
  msgs and dispatch to
  msg handler task

3/10/99             CS258 S99     13
Monsoon Explicit Token-Store




3/10/99         CS258 S99      14
*T: Network Co-Processor




3/10/99        CS258 S99   15
iWARP: Systolic Computation
               Host


          Interface unit




• Nodes integrate
  communication with
  computation on
  systolic basis
• Msg data direct to
  register
• Stream into memory

3/10/99                    CS258 S99   16
          Dedicated processing without
           dedicated hardware design




3/10/99              CS258 S99           17
Dedicated Message Processor
             Network

                                   dest

                             °°°

         Mem                                Mem
                        NI                             NI



        P          MP                       P     MP

      User         System                 User    System

 • General Purpose processor performs arbitrary output
     processing (at system level)
 • General Purpose processor interprets incoming network
     transactions (at system level)
 • User Processor <–> Msg Processor share memory
 • Msg Processor <–> Msg Processor via system network
     transaction
3/10/99                       CS258 S99                     18
Levels of Network Transaction
                  Network

                                         dest

         Mem                       °°°                           Mem


                        NI                       NI



         P        MP                                  MP           P

       User       System

• User Processor stores cmd / msg / data into shared output
  queue
    – must still check for output queue full (or make elastic)
• Communication assists make transaction happen
    – checking, translation, scheduling, transport, interpretation
 • Effect observed on destination address space and/or events
 • Protocol divided between twoS99
3/10/99                    CS258
                                 layers                    19
 Example: Intel Paragon
                                                          Service
                       Network
            I/O                                       I/O
            Nodes                                     Nodes

                                                          Devices

                                                          Devices




                      16     175 MB/s Duplex
                                                                        rte
                Mem          2048 B       °°°       EOP                 MP handler
                                                                    Var data
                             NI   64
i860xp
                                       400 MB/s
50 MHz
16 KB $     $          $   sDMA
4-way                         rDMA
32B Block   P         MP
MESI
  3/10/99                               CS258 S99                                    20
User Level Abstraction (Lok Liu)
                  IQ                       IQ
          Proc                                   Proc
                 OQ                         OQ



                 VAS                       VAS

                  IQ                       IQ
          Proc                                   Proc
                 OQ                         OQ



                 VAS                       VAS


• Any user process can post a transaction for any
  other in protection domain
      – communication layer moves OQsrc –> IQdest
      – may involve indirection: VASsrc –> VASdest

3/10/99                        CS258 S99                21
Msg Processor Events
                                   User Output
                                   Queues




                                                 DMA done
                      System                                Send DMA
          Compute     Event
          Processor                Dispatcher               Rcv DMA
          Kernel



                        Rcv FIFO            Send FIFO
                        ~Full               ~Empty




3/10/99                            CS258 S99                           22
  Basic Implementation Costs: Scalar
                                                   10.5 µs

                        CP                 MP            Net            MP                  CP
                             2      1.5            2              2              2      2
            Registers
             7 wds


            Cache                User                                                User
                                 OQ                                                  IQ


            Net FIFO
                                          4.4 µs                        5.4 µs
                                                       250ns + H*40ns


• Cache-to-cache transfer (two 32B lines, quad word ops)
    – producer: read(miss,S), chk, write(S,WT), write(I,WT),write(S,WT)
    – consumer: read(miss,S), chk, read(H), read(miss,S), read(H),write(S,WT)
• to NI FIFO: read status, chk, write, . . .
• from NI FIFO: read status, chk, dispatch, read, read, . . .
  3/10/99                                     CS258 S99                                          23
   Virtual DMA -> Virtual DMA
                                                sDMA                    rDMA

        Memory

                   CP                MP                Net         MP          MP          CP

                        2      1.5        2                   2                 2      2
Registers
        7 wds

Cache                                     hdr                     400 MB/s          User
                            User                   400 MB/s
                            OQ                                                      IQ

                                                2048
        Net FIFO                                              2048

                                                   175 MB/s


   • Send MP segments into 8K pages and does VA –> PA
   • Recv MP reassembles, does dispatch and VA –> PA per
     page
   3/10/99                                      CS258 S99                                       24
Single Page Transfer Rate
                    Effective Buffer Size: 3232
             Actual Buffer Size: 2048

                 400
                                                                     Total MB/s

                 350                                                 Burst MB/s


                 300


                 250
          MB/s




                 200


                 150


                 100


                  50


                   0

                       0        2000              4000        6000            8000

                                          Transfer Size (B)


3/10/99                                     CS258 S99                                25
Msg Processor Assessment

                                       VAS             User Output
                   User Input                          Queues
                   Queues


                                                     DMA done
                       System                                        Send DMA
           Compute     Event
           Processor                 Dispatcher                      Rcv DMA
           Kernel

                          Rcv FIFO                Send FIFO
                          ~Full                   ~Empty



• Concurrency Intensive
     – Need to keep inbound flows moving while outbound flows stalled
     – Large transfers segmented
• Reduces overhead but adds latency
3/10/99                          CS258 S99                                      26
Case Study: Meiko CS2 Concept
                                                 Network
                                               Dest




                      P out   Pin     Preply                  Pout    P in   Preply

             Mem                                    Mem
                   Pcmd   V   P P event                    Pcmd   V   P Pevent




             P                                      P




• Circuit-switched Network Transaction
   – source-dest circuit held open for request response
   – limited cmd set executed directly on NI
 • Dedicated communication processor for each
     step in flow
3/10/99             CS258 S99                                                         27
Case Study: Meiko CS2 Organization
                                                                     Network




           Generates                Output control
           set-event                                                 Execute net transactions
           3 x write_word
                                                                       · requests from Pthread
                               Preply    Pthread PDMA                  · write_blocks from PDMA
                                                              Pinput
                                                                       · set-event and write_word
                                                                        from Preply
                      Set-event      Run- Start-
          SWAP:                     thread DMA                        DMA from memory
            CMD, Addr Interrupt Pcmd                                  Issue write_block transactions
            Accept                                                    (50-s limit)
                                          Mem interface

                                                                           RISC instruction set
                                                                           64-K nonpreemptive threads
                                           Threads                         Construct arbitrary net transactions
                        P                                                  Output protocol
                                                         DMA
                                        User           descriptors
                                        data   Mem




3/10/99                                              CS258 S99                                                    28
Shared Physical Address Space
                                                             Src    Rrsp Tag   Data
                                 Scalable network

                                                                                            Output processing
                                                 Tag   Src    Addr Read Dest                 · Mem access
                                                                                             · Response


                                                                                       Commmunication
             Input processing                                                            assist
               · Parse
               · Complete read
                                 Pseudo                Pseudo-                  Pseudo-           Pseudo-
                                 memory                processor                processor         memory


                                    Mem                  $                             $          Mem
                                                                   P            P
                                                                                      MMU
                                          Data
                                                             Ld R      Addr
                        Memory management unit



• NI emulates memory controller at source
• NI emulates processor at dest
3/10/99–   must be deadlock free                 CS258 S99                                                29
Case Study: Cray T3D


                    Resp       Req          Req     Resp
                    in         out          in      out                                    3D torus of pairs of PEs
                                                                                            · share net and BLT
                                                                                            · up to 2,048
                                                                                            · 64 MB each

                                                                                150-MHz DEC Alpha (64 bit)
           Message queue                                   Block transfer
                                                   DMA     engine               8-KB instruction + 8-KB data
            · 4,080  4  64
                                                     PE# + FC                   43-bit virtual address
           Prefetch queue
            · 16  64                                                           32- and 64-bit memory
                                                    DTB                         and byte operations
                                                                   $            Nonblocking stores
          Special registers          DRAM                                   P   and memory barrier
           · swaperand                        32-bit           MMU
                                              physical address                  Prefetch
           · fetch&add
                                                                                Load-lock, store-conditional
           · barrier




• Build up info in ‘shell’
• Remote memory operations encoded in address
3/10/99                                           CS258 S99                                                           30
Case Study: NOW
                                                               Myrinet
             160-MB/s                             Eight-port
             bidirectional                        wormhole
             links                                switches




                                        Myricom
                                        Lanai NIC                  
                              Link      (37.5-MHz processor,
                              Interface 256-MB SRAM
                                        3 DMA units)

                                      r DMA
                                                     SRAM
                                      s DMA
                                  Main
                                processor
                                Host DMA
                                 Bus interface


              Mem                          SBUS (25 MHz)
                             Bus adapter

                              X-bar



             UltraSparc            L2 $




• General purpose processor embedded in NIC
3/10/99                                   CS258 S99                        31
Message Time Breakdown
                                                                     To t a l c o m m u n i c a t i o n la t e n c y



                                                                                      L                                O
                                                             O                                                             r
                                                                 s
                                                                         O b s e r v e d n e tw o r k
                     D e s t i n a t io n p ro c e s s o r
                                                                                la t e n c y
    re s o u r c e




                     C o m m u n i c a t i o n a s s is t


                                          N e tw o r k
    M a c h in e




                     C o m m u n i c a t i o n a s s is t



                            S o u r c e p ro c e s s o r




                                                                           T im e o f t h e m e s s a g e




• Communication pipeline

3/10/99                                                              CS258 S99                                                 32
Message Time Comparison

                                     Communication                          Processing overhead,                                             Time per
                         14          latency (L)                            sending side (Os)                                                message,
                                     Processing overhead,                                                                                    pipelined
                                     receiving side (Or)                                                                                     sequence
                         12
                                                                                                                                             of request-
                                                                                                                                             response
                         10                                                                                                                  operations
                                                                                                                                             (g)
          Microseconds




                         8


                         6


                         4


                         2


                         0
                                                                                  T3D




                                                                                                                                                     T3D
                              CM-5




                                                                                                   CM-5
                                                   Meiko CS-2




                                                                                                                    Meiko CS-2
                                         Paragon




                                                                                                          Paragon
                                                                NOW Ultra




                                                                                                                                 NOW Ultra
3/10/99                                                                         CS258 S99                                                                  33
3/10/99
                                Microseconds




                        0
                            5
                                10
                                         15
                                                     20
                                                                            25




                 CM-5
                                               Gap
                                                          Issue




              Paragon
                                                                  Latency




            Meiko CS-2



            NOW Ultra



                  T3D




CS258 S99
                                                                                 SAS Time Comparison




                 CM-5



              Paragon



            Meiko CS-2



            NOW Ultra



                  T3D
34
Message-Passing Time vs Size
                       1,000,000
                                              iPSC/860
                                                                                                                              
                                              IBM SP-2
                        100,000               Meiko CS-2
                                              Paragon/Sunmos*
                                                                                                                            
                                                                                                                             
                                              Cray T3D                                                                      
                                                                                                                             
                         10,000               NOW                                                                           
                                              SGI Challenge                                                                  
                                                                                                         
                                                                                                       
          Time ( s)




                                              Sun E5000                                                 
                                                                                                        
                          1,000                                                                         
                                                                                                      
                                                                                              
                                                                                                
                                                                                               
                                                                                       
                                                                                         
                                                                                          
                                                                                  
                                                                                              
                                                                                
                                                                               
                                                                                         
                                                                                               
                            100                  
                                                         
                                                                                         
                                                                           
                                                                                         
                                                                                         
                                                                                         
                                                                            
                                                            
                                                                  
                                                                              
                                                
                                                         
                                                               
                                                                              
                                                                  
                                                                              
                                                           
                                                                
                                
                             10                       




                              1
                                   1       10                 100            1,000            10,000   100,000          1,000,000
                                                                                                         *Sunmos operating system
                                                                          Message size                   is used for the benchmark.




3/10/99                                                             CS258 S99                                                         35
Message-Passing Bandwidth vs Size
                             180
                                              iPSC/860                                                         
                                                                                                     
                             160              IBM SP-2
                                              Meiko CS-2
                             140              Paragon/Sunmos
                                              Cray T3D
                             120              NOW
                                                                                                     
          Bandwidth (MB/s)




                                              SGI Challenge                                  
                                                                                                                
                                                                                         
                             100               Sun E6000
                                                                                              
                                                                                          
                              80
                                                                                      
                                                                                      
                                                                                                               
                              60                                                             
                                                                                      
                                                                                 
                                                                                 
                                                                                                             
                              40                                                                                
                                                                                                              
                                                                                                  
                                                                                                            
                                                                          
                                                                                        
                                                                                             
                              20                                                            
                                                                     
                                                                                 
                                                                                        
                                                                          
                                                                               
                                                                                
                                                                
                                                                             
                                                          
                                                                      
                                                                          
                                                             
                                                                   
                                                                                                           
                                 
                                 
                                 
                                 
                                 
                                 
                               0        
                                          
                                          
                                          
                                          
                                          
                                                
                                                   
                                                  
                                                  
                                                  
                                                  
                                                        
                                                          
                                                          
                                                          
                                                               
                                                                              
                                  1            10              100            1,000       10,000   100,000   1,000,000
                                                                          Message size




3/10/99                                                              CS258 S99                                           36
Application Performance on LU
                    125                                                                         MFLOPS on LU-A
                                      T3D                                                    using four processors

                    100               SP-2                                            250
                                      NOW
  Speedup on LU-A




                    75                 Ideal                                            200
                                                                                  
                                                       
                                                       
                                                                                        150
                    50
                                                       
                                                                                        100
                                           
                                           
                    25                     
                                  
                                  
                                  
                                                                                         50
                              
                              
                              
                        
                     0 
                          
                          
                                                                                          0




                                                                                                        SP-2
                          0           25          50         75        100        125




                                                                                                                NOW
                                                                                                 T3D
                                               Number of processors




3/10/99                                                               CS258 S99                                       37
Application Performance on BT
                     100                                                                                     BT MFLOPS
                                         T3D                                                                using 25
                      90
                                                                                                             processors
                      80                 SP-2
                                                                                                   1,400
                                         NOW                                                 
   Speedup on BT-A




                      70
                                                                                                  1,200
                                                                                  
                      60                  Ideal
                                                                                                  1,000
                      50                                        
                                                                
                                                                                                   800
                      40
                                                       
                                                       
                                                                                                   600
                      30
                                              
                                              
                                              
                      20                                                                            400
                                     
                                     
                      10        
                                                                                                   200
                            
                       0                                                                             0




                                                                                                                 SP-2


                                                                                                                          NOW
                        0       10       20       30       40   50   60       70   80   90   100




                                                                                                           T3D
                                                   Number of processors




3/10/99                                                                        CS258 S99                                        38
Message Profile on BT

                              40



                              35



                              30
          Message size (KB)




                              25



                              20



                              15



                              10



                               5



                               0
                                   0   500   1,000               1,500   2,000   2,500
                                                     Time (ms)




3/10/99                                        CS258 S99                                 39
Reflective Memory
                                                      Nodej
          Nodei                                                VA
                              Physical
                   VA0        address


                                                                    T0
                         T1   T1
                                                                    T2
                         T2   T2

                              T3   I/O
                                                                    R0
                         R1

                              R1                                    R2
                         R2

                              R2
                  VA2
                              R3
                                                     Nodek
                         T3                                   VA


                                                                    T0
                         R3
                                                                    T1


                                                                    R1

                                                                    R0




• Writes to local region reflected to remote
3/10/99                                  CS258 S99                       40
Case Study: DEC Memory Channel

                                          Memory Channel interconnect



                                   100 MB/s




                                                                   
                              Link
                           interface




                    tx       PCT        rx
                   ctrl                ctrl
                                Receive DMA

                          Bus interface


                                       PCI (33 MHz)
                   Bus adapter


                                              AlphaServer
           Alpha                              SMP
                             Mem
           P-$




• See also Shrimp
3/10/99                                       CS258 S99                    41

								
To top