Docstoc

interconnect

Document Sample
interconnect Powered By Docstoc
					                         Interconnection Networks




ECE 6101: Yalamanchili                       Spring 2004
                                  Overview


           Physical Layer and Message Switching

           Network Topologies

           Metrics

           Deadlock & Livelock

           Routing Layer

           The Messaging Layer




ECE 6101: Yalamanchili                             Spring 2004
                                                                 2
                         Interconnection Networks




            Fabric for scalable, multiprocessor architectures
            Distinct from traditional networking architectures such as
             Internet Protocol (IP) based systems


ECE 6101: Yalamanchili                                             Spring 2004
                                                                                 3
                    Resource View of Parallel Architectures


                             P   P   P   P   P   P   P   P   P   P   P   P

                             P   P   P   P   P   P   P   P   P   P   P   P

                             M   M   M   M   M   M   M   M   M   M   M   M
                             M   M   M   M   M   M   M   M   M   M   M   M
                             M   M   M   M   M   M   M   M   M   M   M   M
                             M   M   M   M   M   M   M   M   M   M   M   M

                             P   P   P   P   P   P   P   P   P   P   P   P
                             P   P   P   P   P   P   P   P   P   P   P   P

                             M   M   M   M   M   M   M   M   M   M   M   M
                             M   M   M   M   M   M   M   M   M   M   M   M
                             M   M   M   M   M   M   M   M   M   M   M   M
                             M   M   M   M   M   M   M   M   M   M   M   M




           How do we present these resources?
           What are the costs of different interconnection networks
           What are the design considerations?
           What are the applications?

ECE 6101: Yalamanchili                                                       Spring 2004
                                                                                           4
              Example: Clusters & Google Hardware Infrastructure


                         VME rack 19 in. wide, 6 feet tall,
                          30 inches deep
                         Per side: 40 1 Rack Unit (RU)
                          PCs +1 HP Ethernet switch (4
                          RU): Each blade can contain 8
                          100-Mbit/s EN or a single 1-Gbit
                          Ethernet interface
                         Front+back => 80 PCs +
                          2 EN switches/rack
                         Each rack connects to 2 128 1-
                          Gbit/s EN switches
                         Dec 2000: 40 racks at most
                          recent site
                         6000 PCs, 12000 disks: almost 1
                          petabyte!
                         PC operates at about 55 Watts
                         Rack => 4500 Watts , 60 amps

  From Patterson, CS252, UCB
ECE 6101: Yalamanchili                                             Spring 2004
                                                                                 5
                                     Reliability

   For 6000 PCs, 12000s, 200 EN switches
   ~ 20 PCs will need to be rebooted/day
   ~ 2 PCs/day hardware failure, or 2%-3% / year
         –   5% due to problems with motherboard, power supply, and connectors
         –   30% DRAM: bits change + errors in transmission (100 MHz)
         –   30% Disks fail
         –   30% Disks go very slow (10%-3% expected BW)
   200 EN switches, 2-3 fail in 2 years
   6 Foundry switches: none failed, but 2-3 of 96 blades of switches
    have failed (16 blades/switch)
   Collocation site reliability:
         – 1 power failure,1 network outage per year per site




  From Patterson, CS252, UCB
ECE 6101: Yalamanchili                                                     Spring 2004
                                                                                         6
                                   The Practical Problem




 From: Ambuj Goyal, “Computer Science Grand Challenge – Simplicity of Design,” Computing Research Association
 Conference on "Grand Research Challenges" in Computer Science and Engineering, June 2002




ECE 6101: Yalamanchili                                                                                   Spring 2004
                                                                                                                       7
                           Example: Embedded Devices




                                                             picoChip: http://www.picochip.com/




                                                   Issues
                                                      Execution performance
                                                      Power dissipation
PACT XPP Technologies: http://www.pactcorp.com/       Number of chip types
                                                      Size and form factor

ECE 6101: Yalamanchili                                                                            Spring 2004
                                                                                                                8
Physical Layer and Message Switching




ECE 6101: Yalamanchili          Spring 2004
                                      Messaging Hierarchy


                    Routing Layer      Where?: Destination decisions, i.e., which output port




                    Switching Layer   When?: When is data forwarded




                     Physical Layer    How?: synchronization of data transfer




      This organization is distinct from traditional networking
       implementations
      Emphasis is on low latency communication
            – Only recently have standards been evolving
                   – Infiniband: http://www.infinibandta.org/home


ECE 6101: Yalamanchili                                                                          Spring 2004
                                                                                                              10
                                    The Physical Layer

                                          Data


Packets


                         checksum
              header

                                             Flit: flow control digit



                                                 Phit: physical flow control digit




           Data is transmitted based on a hierarchical data structuring
            mechanism
             – Messages  packets  flits  phits
             – While flits and phits are fixed size, packets and data may be
               variable sized

ECE 6101: Yalamanchili                                                               Spring 2004
                                                                                                   11
                                       Flow Control

           Flow control digit:
            synchronized transfer of a unit
            of information
             – Based on buffer management
           Asynchronous vs.
            synchronous flow control
           Flow control occurs at multiple
            levels
             – message flow control
             – physical flow control
           Mechanisms
             – Credit based flow control




ECE 6101: Yalamanchili                                Spring 2004
                                                                    12
                                Switching Layer


           Comprised of three sets of techniques
             – switching techniques
             – flow control
             – buffer management


           Organization and operation of routers are largely determined
            by the switching layer

           Connection Oriented vs. Connectionless communication




ECE 6101: Yalamanchili                                               Spring 2004
                                                                                   13
                         Generic Router Architecture



 Wire delay




                                 Switching delay




                                 Routing delay
ECE 6101: Yalamanchili                                 Spring 2004
                                                                     14
                                         Virtual Channels

           Each virtual channel is a pair of
            unidirectional channels
           Independently managed buffers
            multiplexed over the physical
            channel
           De-couples buffers from physical
            channels
           Originally introduced to break
            cyclic dependencies
           Improves performance through
            reduction of blocking delay
           Virtual lanes vs. virtual channels
           As the number of virtual channels
            increase, the increased channel
            multiplexing has two effects
             –    decrease in header delay
             –    increase in average data flit delay       Virtual Channels
           Impact on router performance
             –    switch complexity

ECE 6101: Yalamanchili                                                         Spring 2004
                                                                                             15
                               Circuit Switching




           Hardware path setup by a routing header or probe
           End-to-end acknowledgment initiates transfer at full hardware
            bandwidth
           Source routing vs. distributed routing
           System is limited by signaling rate along the circuits --> wave
            pipelining



ECE 6101: Yalamanchili                                                  Spring 2004
                                                                                      16
                               Packet Switching




           Blocking delays in circuit switching avoided in packet switched
            networks --> full link utilization in the presence of data
           Increased storage requirements at the nodes
           Packetization and in-order delivery requirements
           Buffering
             – use of local processor memory
             – central queues
ECE 6101: Yalamanchili                                                  Spring 2004
                                                                                      17
                               Virtual Cut-Through




           Messages cut-through to the next router when feasible
           In the absence of blocking, messages are pipelined
             – pipeline cycle time is the larger of intra-router and inter-router
               flow control delays
           When the header is blocked, the complete message is
            buffered
           High load behavior approaches that of packet switching

ECE 6101: Yalamanchili                                                        Spring 2004
                                                                                            18
                          Wormhole Switching




        Messages are pipelined, but buffer space is on the order of a
         few flits
        Small buffers + message pipelining --> small compact buffers
        Supports variable sized messages
        Messages cannot be interleaved over a channel: routing
         information is only associated with the header
        Base Latency is equivalent to that of virtual cut-through

ECE 6101: Yalamanchili                                               Spring 2004
                                                                                   19
                         Comparison of Switching Techniques


           Packet switching and virtual cut-through
             – consume network bandwidth proportional to network load
             – predictable demands
             – VCT behaves like wormhole at low loads and like packet
               switching at high loads
             – link level error control for packet switching
           Wormhole switching
             – provides low latency
             – lower saturation point
             – higher variance of message latency than packet or VCT switching
           Virtual channels
             – blocking delay vs. data delay
             – router flow control latency
           Optimistic vs. conservative flow control

ECE 6101: Yalamanchili                                                     Spring 2004
                                                                                         20
                         Saturation




ECE 6101: Yalamanchili                Spring 2004
                                                    21
                         Network Topologies




ECE 6101: Yalamanchili                        Spring 2004
                                    Direct Networks




          Generally fixed degree

          Modular

          Topologies
             – Meshes
             – Multidimensional tori
             – Special case of tori – the binary hypercube



ECE 6101: Yalamanchili                                       Spring 2004
                                                                           23
                         Indirect Networks

                                             – indirect networks
                                                 – uniform        base
                                                   latency
   Multistage Network
                                                 – centralized       or
                                                   distributed control
                                                 – Engineering
                                                   approximations to
                                                   direct networks




    Fat Tree Network                                    Bandwidth
                                                       increases as
                                                       you go up the
                                                            tree




ECE 6101: Yalamanchili                                          Spring 2004
                                                                              24
                               Generalized MINs




           Columns of k x k switches and connections between switches
           All switches are identical
           Directionality and control
           May concentrate or expand or just connect



ECE 6101: Yalamanchili                                                   Spring 2004
                                                                                       25
                         Specific MINs




      Switch sizes and interstage interconnect establish
       distinct MINS
      Majority of interesting MINs have been shown to be
       topologically equivalent
ECE 6101: Yalamanchili                             Spring 2004
                                                                 26
                         Metrics




ECE 6101: Yalamanchili             Spring 2004
                                  Evaluation Metrics

                                                  bisection




           Bisection bandwidth
             – This is minimum bandwidth across any bisection of the network
             – Bisection bandwidth is a limiting attribute of performance
           Latency
             – Message transit time
           Node degree
             – These are related to pin/wiring constraints

ECE 6101: Yalamanchili                                                         Spring 2004
                                                                                             28
             Constant Resource Analysis: Bisection Width




ECE 6101: Yalamanchili                                     Spring 2004
                                                                         29
                         Constant Resource Analysis: Pin out




ECE 6101: Yalamanchili                                         Spring 2004
                                                                             30
                           Latency Under Contention




32-ary 2-cube
     vs.
10-ary 3 cube




  ECE 6101: Yalamanchili                              Spring 2004
                                                                    31
                         Deadlock and Livelock




ECE 6101: Yalamanchili                           Spring 2004
                         Deadlock and Live Lock




      Deadlock freedom can be ensured by enforcing
       constraints
             – For example, following dimension order routing in 2D
               meshes
      Similar
ECE 6101: Yalamanchili                                            Spring 2004
                                                                                33
                         Occurrence of Deadlock

                                       VCT and SAF Dependency




                                      Wormhole Dependency




      Deadlock is caused by dependencies between buffers




ECE 6101: Yalamanchili                                          Spring 2004
                                                                              34
                         Deadlock in a Ring Network




ECE 6101: Yalamanchili                                Spring 2004
                                                                    35
                         Deadlock Avoidance: Principle




      Deadlock is caused by dependencies between buffers




ECE 6101: Yalamanchili                                      Spring 2004
                                                                          36
                   Routing Constraints on Virtual Channels




      Add multiple virtual channels to each physical
       channel
      Place routing restrictions between virtual channels

ECE 6101: Yalamanchili                                       Spring 2004
                                                                           37
                         Break Cycles




ECE 6101: Yalamanchili                  Spring 2004
                                                      38
                         Channel Dependence Graph




ECE 6101: Yalamanchili                              Spring 2004
                                                                  39
                         Routing Layer




ECE 6101: Yalamanchili                   Spring 2004
                         Routing Protocols




ECE 6101: Yalamanchili                       Spring 2004
                                                           41
                            Key Routing Categories


           Deterministic
             – The path is fixed by the source destination pair
           Source Routing
             – Path is looked up prior to message injection
             – May differ each time the network and NIs are initialized
           Adaptive routing
             – Path is determined by run-time network conditions
           Unicast
             – Single source to single destination
           Multicast
             – Single source to multiple destinations




ECE 6101: Yalamanchili                                                    Spring 2004
                                                                                        42
                         Software Layer




ECE 6101: Yalamanchili                    Spring 2004
                                  The Message Layer

          Message layer background
            – Cluster computers
            – Myrinet SAN
            – Design properties


          End-to-End communication path
            – Injection
            – Network transmission
            – Ejection


          Overall performance




ECE 6101: Yalamanchili                                Spring 2004
                                                                    44
                                             Cluster Computers


      Cost-effective alternative to supercomputers
           – Number of commodity workstations
           – Specialized   network        hardware                                  and                software

      Result: Large pool of host processors




             CPU Memory            CPU Memory              CPU Memory            CPU Memory
             I/O Bus




                                   I/O Bus




                                                           I/O Bus




                                                                                 I/O Bus
                       Network               Network                 Network               Network
                       Interface             Interface               Interface             Interface



                                                         Network
ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                                                      Spring 2004
                                                                                                                          45
                           For Example..




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                     Spring 2004
                                                         46
                           For Example..




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                     Spring 2004
                                                         47
                             Clusters & Networks



 Beowulf clusters
       – Use Ethernet & TCP/IP
       – Cheap, but poor Host-to-Host performance
             – Latencies:    ~70-100 μs
             – Bandwidths:   ~80-800 Mbps


 System Area Network (SAN) clusters
       – Custom hardware/software
       – Examples: Myrinet, SCI, InfiniBand, QsNet
       – Expensive, but good Host-to-Host performance
             – Latencies:    as low as 3 μs
             – Bandwidths:   up to 3 Gbps




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                  Spring 2004
                                                                      48
                                           Myrinet


      Descendant of Caltech Mosaic project
           –    Wormhole network
           –    Source routing
           –    High-speed, Ultra-reliable network
           –    Configurable topology: Switches, NICs, and cables



   CPU    NI                    NI   CPU
                                                                    NI   CPU
                           X                                    X
   CPU    NI
                                            X                       NI   CPU

                               CPU   NI              NI   CPU




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                     Spring 2004
                                                                                         49
                           Myrinet Switches & Links

         16 Port crossbar chip                    Backplane
                                                  Fiber
                                              X       X      X     X
           – 2.0+2.0 Gbps per port                Fiber
           – ~300 ns Latency                      Fiber
                                                  Fiber           16              To
         Line card                   X   X   X Fiber
                                                    X        X   Xbar
                                                                   X      X   Backplane
                                                                                X    Line
           – 8 Network ports                      Fiber
                                                                                     Cards
                                                          16 Port
           – 8 Backplane ports                    Fiber       Line
                                                           Xbar
                                                  Fiber   8 Hosts / Line Card
                                                              Card
         Backplane cabinet
           – 17 line card slots
           – 128 Hosts




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                            Spring 2004
                                                                                                50
                            Myrinet NI Architecture

         Custom RISC CPU
           – 33-200MHz
           – Big endian
           – gcc is available

         SRAM                                        SRAM
           – 1-9MB
           – No CPU cache

         DMA Engines
           – PCI / SRAM                      Host      RISC     SAN          Tx
           – SRAM / Tx
                                     PCI     DMA                DMA
                                                       CPU                   Rx
           – Rx / SRAM
                                              LANai Processor
                                                    Network Interface Card



ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                Spring 2004
                                                                                    51
                           Message Layers




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                      Spring 2004
                      “Message Layer” Communication Software


         Message layers are enabling technology for clusters
           – Enable cluster to function as single image multiprocessor system
           – Responsible for transferring messages between resources
           – Hide hardware details from end users




                                    CPU    CPU                  CPU
                           CPU                        CPU                CPU
            CPU
CPU                                                                             CPU



                                 Cluster Message Layer




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                          Spring 2004
                                                                                              53
                           Message Layer Design Issues


     Performance is critical
        – Competing with SMPs, where overhead is <1us


     Use every trick to get performance
        –   Single cluster user   --   remove device sharing overhead
        –   Little protection     --   co-operative environment
        –   Reliable hardware     --   optimize for common case of few errors
        –   Smart hardware        --   offload host communication
        –   Arch hacks            --   x86 is a turkey, use MMX, SSE, WC..




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                          Spring 2004
                                                                                              54
                            Message Layer Organization


                                 User-space Application

                                                           Device Driver
Communication Library                                      - Physical access
                                  User-space    Kernel     - DMA transfers
- Maintains cluster info
- Message passing API
                                   Message     NI Device   - ISR
- Device interface               Layer Library  Driver

                                      NI Firmware


                                 Firmware
                                 - Monitor network wire
                                 - Send/Receive messages




     Courtesy of C. Ulmer
ECE 6101: Yalamanchili                                                     Spring 2004
                                                                                         55
                           End User’s Perspective


       Processor A                                   Processor B


send(        dest,
             data,
             size )
                                                    Msg = extract();
                                   Msg


                             Message Layer


                                                        Msg




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                             Spring 2004
                                                                                 56
                           End-to-End Communication Path


      Three phases of data transfer
           – Injection
           – Network
           – Ejection


                                             Message Passing
                  CPU                                                             CPU

                                         Remote Memory Operations
                           Memory                                        Memory


                                                    2
                           1                                                 3
                                    NI            SAN               NI

                      Source                                         Destination




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                                  Spring 2004
                                                                                                      57
                           Injecting Data




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                      Spring 2004
                            Injecting Data into the NI

send(       dest, data, size      )


           msg[0] header

                data                          B         B     F
                                                             B,F
           msg[1] header
                data           PCI                    [1]    [0]       Tx
                data
                                           Outgoing Message Queue
          msg[n-1] header

                data                          Network Interface Card


          Fragmentation




     Courtesy of C. Ulmer
ECE 6101: Yalamanchili                                                  Spring 2004
                                                                                      59
                           Host-NI: Data Injections



 Host-NI transfers challenging
      – Host lacks DMA engine                             CPU
                                                          Cache
                                                                           Main
 Multiple transfer methods                            Memory
                                                                          Memory

      – Programmed I/O                                Controller

                                      – DMA




                                                      PCI Bus
 What about virtual/physical addresses?                            PCI
                                                                            Network
                                                                            Interface
                                                                   DMA
                                                                            Memory




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                          Spring 2004
                                                                                              60
                           Virtual and Physical Addresses



    Virtual address space                                    Physical Address
       – Application’s view                                       Host
       – Contiguous                                              Memory
                                            Virtual Address
    Physical address space                   User space
       –   Manage physical memory             application
       –   Paged, non-contiguous
       –   PCI devices part of PA
       –   PCI devices only use PAs
                                              PCI Device
    Viewing PCI device memory                  mmap
                                                                    PCI
       – Memory map
                                                                PCI Device


                                                                PCI Device
ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                               Spring 2004
                                                                                   61
                              Addresses and Injections

         Programmed I/O (user-space)
           – Translation automatic by host CPU
           – Example: memcpy( ni_mem, source, size )
           – Can be enhanced by use of MMX, SSE registers


         DMA (kernel space)
           – One-copy:
                  – Copy data into pinned, contiguous block
                  – DMA out of block


           – Zero-copy:
                  – Transfer data right out of VA pages
                  – Translate address and pin each page




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                        Spring 2004
                                                                            62
                                                     TPIL Performance:
                                           LANai 9 NI with Pentium III-550 MHz Host



                                140
                                              DMA 0-Copy
                                120           DMA 1-Copy DB
                                              DMA 1-Copy
         Bandwidth (MBytes/s)




                                              PIO SSE
                                100
                                              PIO MMX
                                              PIO Memcpy
                                 80


                                 60


                                 40


                                 20


                                  0
                                      10        100         1,000     10,000     100,000   1,000,000

ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                  Injection Size (Bytes)                         Spring 2004
                                                                                                                     63
                      Network Delivery (NI-NI)




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                           Spring 2004
                                  Network Delivery (NI-NI)


      Reliably transfer message between pairs of NIs
           – Each NI basically has two threads: Send and Receive


      Reliability
           – SANs are usually error free
           – Worried about buffer overflows in NI cards
           – Two approaches to flow control: host-level, NI-level




                                          network
                                             SAN
                      Network Interface             Network Interface

                           Sending                   Receiving
ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                  Spring 2004
                                                                                      65
                               Host-managed Flow Control


      Reliability managed by the host
           – Host-level credit system
           – NI just transfers messages between host and wire




    Good points                                         Bad points
       – Easier to implement                               – Poor NI buffer utilization
       – Host CPU faster than NI                           – Retransmission overhead



                                               Send

Sending                                                                                   Receiving
Endpoint
               PCI                             SAN                                PCI
                                                                                          Endpoint
                           Network Interface                  Network Interface


                                               Reply
ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                                      Spring 2004
                                                                                                          66
                                 NI-Managed Flow Control


      NI manages reliable transmission of message
           – NIs use control messages (ACK/NACK)




    Good points                                         Bad points
       – Better dynamic buffer use                         – Harder to implement
       – Offloads host CPU                                 – Added overhead for NI




                                               DATA
               DATA                                                               DATA
Sending                                                                                  Receiving
Endpoint
                PCI                            SAN                                PCI
                                                                                         Endpoint

                           Network Interface   ACK            Network Interface

ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                                    Spring 2004
                                                                                                        67
                           Ejection (NI-Host)




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                          Spring 2004
                           Message Ejection (NI-Host)



 Move message to host
       – Store             close   to    host     CPU                CPU
                                                   Memory
 Incoming message queue
       –   Pinned, contiguous memory
       –   NI can write directly
       –   Host extracts messages
       –   Reassemble fragments

                                                 Network Interface
 How does host see new messages?



ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                 Spring 2004
                                                                                     69
                             Notification: Polling


      Applications explicitly call extract()
           – Call examines queue front & back pointers
           – Processes message if available




    Good points                             Bad points
       – Good performance                      – Waste time if no messages
       – Can tuck away in a thread             – Queue can backup
       – User has more control                 – Code can be messy




ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                       Spring 2004
                                                                                           70
                                  Notification: Interrupts

          NI invokes interrupt after putting message in queue
            –   Host stops whatever it was doing
            –   Device driver’s Interrupt service routine (ISR) catches
            –   ISR uses UNIX signal infrastructure to pass to application
            –   Application catches signal , executes extract()




    Good points                                      Bad points
         – No wasted polling time                       – High overhead
                                                             – Interrupts: 10 us
                                                        – Constantly.. interrupted




                           Device Driver           Application
      NI                                                                     extract()
                               ISR               Signal Handler

ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                                   Spring 2004
                                                                                                       71
                           Other APIs: Remote Memory Ops

         Often just passing data
           – Don’t disturb receiving application

         Remote memory operations
           – Fetch, store remote memory
           – NI executes transfer directly (no need for notification)
           – Virtual addresses translated by the NI (and cached)




                           Memory                                   Memory
                 CPU                                                         CPU




                                    NI       SAN               NI
ECE Courtesy of C. Ulmer
    6101: Yalamanchili                                                             Spring 2004
                                                                                                 72
                                 The Message Path


                   CPU   M                                M       CPU


                   PCI   OS                               OS      PCI



                   PCI                                            PCI
                              Memory             Memory


                    NI                                             NI

                                       Network



           Wire bandwidth is not the bottleneck!
           Operating system and/or user level                 software   limits
            performance


ECE 6101: Yalamanchili                                                       Spring 2004
                                                                                           73
                        Universal Performance Metrics


                     Sender       Transmission time
 Sender             Overhead      (size ÷ bandwidth)


                   (processor
                      busy)
                                Time of   Transmission time    Receiver
                                 Flight   (size ÷ bandwidth)   Overhead
Receiver
                                                               (processor
                                     Transport Latency
                                                                  busy)

                                          Total Latency

Total Latency = Sender Overhead + Time of Flight +
                Message Size ÷ BW + Receiver Overhead
 Includes header/trailer in BW calculation?

  From Patterson, CS252, UCB
ECE 6101: Yalamanchili                                                      Spring 2004
                                                                                          74
                               Simplified Latency Model


     Total Latency - Overhead + Message Size / BW


     Overhead = Sender Overhead + Time of Flight +
                         Receiver Overhead


     Can relate overhead to network bandwidth utilization




  From Patterson, CS252, UCB
ECE 6101: Yalamanchili                                       Spring 2004
                                                                           75
                         Commercial Example




ECE 6101: Yalamanchili                        Spring 2004
                    Scalable Switching Fabrics for Internet Routers




                                          Router




           Internet bandwidth growth  routers with
             – large numbers of ports
             – high bisection bandwidth
           Historically these solutions have used
             – Backplanes
             – Crossbar switches
           White paper: Scalable Switching Fabrics for Internet Routers,
            by W. J. Dally, http: //www.avici.com/technology/whitepapers/

ECE 6101: Yalamanchili                                                  Spring 2004
                                                                                      77
                                      Requirements


           Scalable
             – Incremental
             – Economical  cost linear in the number of nodes
           Robust
             – Fault tolerant  path diversity + reconfiguration
             – Non-blocking features
           Performance
             – High bisection bandwidth
             – Quality of Service (QoS)
                    – Bounded delay




ECE 6101: Yalamanchili                                             Spring 2004
                                                                                 78
                                   Switching Fabric




           Three components
             – Topology  3D torus
             – Routing  source routing with randomization
             – Flow control  virtual channels and virtual networks
           Maximum configuration: 14 x 8 x 5 = 560
           Channel speed is 10 Gbps

ECE 6101: Yalamanchili                                                Spring 2004
                                                                                    79
                                                         Packaging

           Uniformly short wires between
            adjacent nodes
             – Can be built in passive
               backplanes
             – Run at high speed
                     – Bandwidth inversely proportional
                       to square of wire length
             – Cabling costs
             – Power costs




               Figures are from Scalable Switching Fabrics for Internet Routers, by W. J. Dally (can be found at www.avici.com)
ECE 6101: Yalamanchili                                                                                                            Spring 2004
                                                                                                                                                80
                                            Available Bandwidth




           Distinguish between capacity and I/O bandwidth
             – Capacity: Traffic that will load a link to 100%
             – I/O bandwidth: bit rate in or out
           Discontinuuities



               Figures are from Scalable Switching Fabrics for Internet Routers, by W. J. Dally (can be found at www.avici.com)


ECE 6101: Yalamanchili                                                                                                            Spring 2004
                                                                                                                                                81
                                                          Properties




           Path diversity
             – Avoids tree saturation
             – Edge disjoint paths for fault tolerance
                     – Heart beat checks (100 microsecs) + deflecting while tables are updated




               Figures are from Scalable Switching Fabrics for Internet Routers, by W. J. Dally (can be found at www.avici.com)
ECE 6101: Yalamanchili                                                                                                            Spring 2004
                                                                                                                                                82
                                                          Properties




               Figures are from Scalable Switching Fabrics for Internet Routers, by W. J. Dally (can be found at www.avici.com)
ECE 6101: Yalamanchili                                                                                                            Spring 2004
                                                                                                                                                83
                             Use of Virtual Channels


           Virtual channels aggregated into virtual networks
             – Two networks for each output port


           Distinct networks prevent undesirable coupling
             – Only bandwidth on a link is shared
             – Fair arbitration mechanisms


           Distinct networks enable QoS constraints to be met
             – Separate best effort and constant bit rate traffic




ECE 6101: Yalamanchili                                              Spring 2004
                                                                                  84
                                       Summary


           Distinguish between traditional networking                and   high
            performance multiprocessor communication

           Hierarchy of implementations
             – Physical, switching and routing
             – Protocol families and protocol layers (the protocol stack)


           Datapath and architecture of the switches

           Metrics
             – Bisection bandwidth
             – Reliability
             – Traditional latency and bandwidth



ECE 6101: Yalamanchili                                                       Spring 2004
                                                                                           85
                                  Study Guide


           Given a topology and relevant characteristics such as channel
            widths and link bandwidths, compute the bisection bandwidth
           Distinguish between switching mechanisms based on how
            channel buffers are reserved/used during message
            transmission
           Latency expressions for different switching mechanisms
           Compute the network bisection bandwidth when the software
            overheads of message transmission are included
           Identify the major delay elements in the message transmission
            path starting at the send() call and ended with the receive()
            call
           How do costs scale in different topologies
             – Latency scaling
             – Unit of upgrade  cost of upgrade


ECE 6101: Yalamanchili                                                Spring 2004
                                                                                    86

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:7
posted:1/15/2012
language:
pages:86