jacob by xiaopangnv


									       College of Engineering & Computer science
      Department of Electrical & Computer Engineering

Off-chip Communication Architectures for
  High Throughput Network Processors

                 Short version of
                  Final Defense

                   Jacob Engel
   Line card design challenges                            OS
     – Rapidly growing line rates                                                        UART
     – Additional deep packet processing operations      Routing

     – Increase in memory capacity requirements
                                                         Memory             Memory

   The “memory wall”
     – Nature of packet transmission                                                     SONET
     – Variable packet size                              Backplane          Forwarding
     – Out of order transmission
   Off-chip vs. on-chip
     – Off-chip interconnects lack innovative methods                                     ATM

     to improve their integration into large
     scalable network components                                                         Utopia

   PCB physical limitations
     –   Area
     –   I/O pins
     –   Switching elements
     –   Scalability
              Interconnect Topologies
 Bus
 k-ary n-cubes
   Mesh
       N nodes arranged in a rectangle or
       Each node connected to 4 neighbors
       Cost: N switches (one per node)
   Torus                                             Mesh            Torus network
                                                                      (k-ary n-cube
       Mesh with wrap-around                (4-ary 2-cube network)
                                                                      with wraparound)
       Longer cables than simple mesh
       Harder to partition
   Hypercubes
       Multi-dimensional cubes
       Nodes identified by n-bit numbers
       Each node connected to other nodes
          that differ in a single bit
       Cost: N switches (one per node)                    3D Hypercube
                                                           (2-ary n-cube)
             Switching Mechanism in Interconnects

   Bus-based network
     –  Drawback
           Does not scale up with number of processors
   Crossbar switching network
     –  Allows connection of any of p processors to any
                                                          Bus-based network
        of b memory banks
     –  Number of switches: n²
     –  Wiring complexity: O(n²w)
     –  Disadvantage
           Its complexity grows as a function of p²
           Expensive for large number of processors
           Physical dimensions (switches + wiring)

                                                          Crossbar network
             Switching Mechanism in Interconnects

   Omega switching network
     – More scalable in terms of cost than crossbar,
       more scalable in terms of performance than
     – Number of switches: (n/2)log_k(n)
          n=# channels; k=# switches/box
     – Wiring complexity: O(n w log_k(n))
     – Drawbacks
          limited switching flexibility
          Blocking

                                                       Omega switch
               K-ary n-cube based architectures

   Packet based multiple path
   Packets are shared among PE & M
   PEs=TM, QoS, Classification…
   Oversubscribed or faulty link does not
    avoid connectivity to its nodes
   Uses wormhole routing
   Packets are routed adaptively based on
    traffic load and connectivity

                                             8-ary 2-cube interconnect   4-ary 3-cube interconnect

            3D-mesh interconnect                                 Shared-bus
                              Wormhole routing

   Wormhole routing operates on flits
    –  Typically what can be transmitted in a single cycle
    –  Flit = channel width
   Packet header is typically one or two flits
   The rest of the flits do not contain routing information
   Known for its improved latency and throughput


                          4               3               2          1

                        Node 1          Node 2          Node 3    Node 4

                                    4          3    2         1

K-ary n-cube based architectures
                         Performance measures

                           – Latency = T_w+T_s+T_r
                           – Latency does not consider
                             queuing delays at this point
                           – Latency is measured per flit
                             per flow

                           – Throughput=ch_width/L
                           – Bi-directional links will
                             increase the throughput
                           – Aggregate throughput is a
                             function of #PE as well
                Flow control mechanisms
Virtual Channel Effects     Sub-Channel Variation Effects

                               Channel/Packet Size 32 Bits, 1 channel

 Virtual Channel Inactive

                 VC          Channel/Packet Size 16 Bits, 2 sub-channels

  Virtual Channel Active

                             Channel/Packet Size 8 Bits, 4 sub-channels
                     The traffic controller (TC)

   Switch module                       TC architecture
     – Receives status from
         all TC modules                                                                       M   T          T   PE
                                                               Port A                             C          C
     – It switches I\O ports
   Channel Sampler
     – Ports status                                                                           M   T          T   PE
     – “busy” / “not-busy”                      Channel
                                                                                                  C          C
   Routing algorithm                           Sampler

     – Shortest path
     – Ports status            Port D                          SW                    Port B
   Virtual channels                                            le                            PE/Memory connectivity
     – VC on/off
     – # of available VC                            Virtual
                                                    Channels             Partition

     – VC occupancy status
   Channels partitioning
     – Sub-channeling on/off
     – # of available SC                                       Port C

     – SC occupancy status
                       The Network simulator

                            3D-mesh interconnect

                                                   4-ary 3-cube interconnect

8-ary 2-cube interconnect
  Interconnect simulation

                   The interconnect

Simulation speed                      Runtime performance data
                        The Network simulator architecture
                                                                                                          • The interconnect type &
                                                                                                          configuration to simulate
                                                                      The Interconnect
• A worm container                                                    M       PE        M        PE
• Contains worms to                                            M          M        M         M
  model and worms                    Worms                                                                 Worms
                                                                      M       M         M        M                                    • Records
  done modeling                       Jar                                                                   Jar
                                                               PE         M        PE        M                                          performance
• Interconnect type,
  configuration & properties                                                                Traffic                      • Timing for
• Data is updated from the                            Interconnect                          Sampler
                                                      Configuration                                                        entering worms
  user interface
                                                        Manager                                                          • Traffic load feedback

• Calculates the                                                                                                                     • Output files which
                                                                                                        Interconnect                   contain all simulated
  shortest path route for                                                                                Properties
  each worm                                                                                                                            data (worms properties
                                                                                                                                       and performance)
                                                                           Worm                                  Data
                                          Algorithm                       Manager
• Orchestrates the simulation process
• Provides control signals to all other                                     User
  modules participating in simulation                                     Interface
                                                                                                      • Command line or GUI
                                                                                                      • User system parameters
        Performance results: Latency

                                                • Latency denotes the time it takes
                                                  a message to reach destination

                                                • Latency includes wire propagation,
                                                  switching and routing delays

                                                • Latency of 3D-mesh for both short
                                                  and long messages was the smallest
                                                  of all three interconnects

                                                • The results shown represent the
                                                  average latency measured for both
                                                  short and long messages

K-ary n-cube interconnects latency comparison
               Performance results: Latency

                                                      • Offered load determines the probability
                                                        that each node comprising the interconnect
                                                        will generate a message within each cycle

                                                      • If offered load=0.1 there is a chance that
                                                        10% of the total nodes in the interconnect
                                                        will generate a message at each cycle

                                                      • As the offered load increases the latency
                                                        increases exponentially for all the

                                                      • 3D-mesh has the lowest latency and is able
                                                        to sustain higher traffic load
K-ary n-cube interconnects latency vs. offered load
                  Performance results: Throughput

• 3D-mesh reached the highest peak
  throughput for both short and
  long messages

• 4-ary 3-cube outperforms 8-ary 2-cube
  in all measurements

• Higher throughput when long worms
  are modeled as a result of the routing
  algorithm (wormhole routing)

                                           K-ary n-cube interconnects throughput comparison
                 Routing accuracy comparison

                                                        8-ary 2-cube routing accuracy with VC & SC enabled

     3D-mesh routing accuracy with VC & SC enabled

• VCs and SC significantly increased routing accuracy
• 3D-mesh RA=96% vs. 85% (8-ary 2-cube) and
  84% (4-ary 3-cube)
                                                        4-ary 3-cube routing accuracy with VC & SC enabled
Bandwidth utilization rate

                 • Bandwidth utilization = (# of occupied
                   channels) / (total # of channels)

                 • VCs as well as SC increase bandwidth
                   utilization rate

                 • Larger gap from no-VC/SC to VC+2SC
                   then VC+2SC to VC+4SC
                      Failure rate vs. VC size

• As VCs size increases failure rate

• Tradeoff: failure rate vs. VCs
  size (area and cost)

                                       3D-mesh failure rate as a function of VC size
                       Throughput comparison

                                                                     • Throughput of common
                                                                       interconnects is based on results
                                                                       provided by their vendors

                                                                     • All k-ary n-cube interconnects
                                                                       utilize both VCs and SC

                                                                     • 3D-mesh has the highest
                                                                       throughput among all other

Average throughput comparison among high-performance interconnects

• We presented k-ary n-cube based interconnects as off-chip
  communications architectures for line cards to increase the
  throughput of the currently used memory system

• We designed a new mixed-radix, non-symmetrical k-ary n-cube based
  interconnect called the 3D-mesh (a variation of 2-ary 3-cube)

• We include multiple, highly efficient techniques to route, switch and
  control packet flows in order to increase throughput and interconnect
  utilization, while minimize traffic congestion and packet loss

• We reveal the best processor-memory configuration, out of multiple
  configurations, that achieves optimal performance

• We developed a custom-designed, event-driven, simulator to evaluate
  the performance of packet-based, off-chip, k-ary n-cube interconnect

• Our results show that k-ary n-cube interconnect architectures
  provide higher throughput and can sustain higher traffic loads

• 3D-mesh reached the highest performance results of all other
  interconnects tested

• 3D-mesh is a scalable, cost effective solution which complies with
  all the functional as well as the physical constraints on the line card
                   Future work

• The results of this work can also be used for

     • PCs
     • On-chip communication architectures

• Future directions for this work include

     • Board implementation of the interconnect
     • Testing the interconnect with off-the-shelf components
     • Expand our simulation framework to include
           • Higher dimensions k-ary n-cube networks
           • Internet traffic workloads

To top