College of Engineering & Computer science
Department of Electrical & Computer Engineering
Off-chip Communication Architectures for
High Throughput Network Processors
Short version of
Final Defense
Jacob Engel
Motivation
Line card design challenges OS
application
Buffer
Memory
Memory
– Rapidly growing line rates UART
– Additional deep packet processing operations Routing
Table
Packet
Data
– Increase in memory capacity requirements
Memory Memory
The “memory wall”
– Nature of packet transmission SONET
– Variable packet size Backplane Forwarding
Connector
– Out of order transmission
Engine
Ethernet
Off-chip vs. on-chip
– Off-chip interconnects lack innovative methods ATM
to improve their integration into large
NPU
scalable network components Utopia
PCB physical limitations
– Area
– I/O pins
– Switching elements
– Scalability
Interconnect Topologies
Bus
k-ary n-cubes
Mesh
N nodes arranged in a rectangle or
square
Each node connected to 4 neighbors
Cost: N switches (one per node)
Torus Mesh Torus network
(k-ary n-cube
Mesh with wrap-around (4-ary 2-cube network)
with wraparound)
Longer cables than simple mesh
Harder to partition
Hypercubes
Multi-dimensional cubes
Nodes identified by n-bit numbers
Each node connected to other nodes
that differ in a single bit
Cost: N switches (one per node) 3D Hypercube
(2-ary n-cube)
Switching Mechanism in Interconnects
Bus-based network
– Drawback
Does not scale up with number of processors
Crossbar switching network
– Allows connection of any of p processors to any
Bus-based network
of b memory banks
– Number of switches: n²
– Wiring complexity: O(n²w)
– Disadvantage
Its complexity grows as a function of p²
Expensive for large number of processors
Physical dimensions (switches + wiring)
Crossbar network
Switching Mechanism in Interconnects
Omega switching network
– More scalable in terms of cost than crossbar,
more scalable in terms of performance than
bus
– Number of switches: (n/2)log_k(n)
n=# channels; k=# switches/box
– Wiring complexity: O(n w log_k(n))
– Drawbacks
limited switching flexibility
Blocking
Omega switch
K-ary n-cube based architectures
Packet based multiple path
Packets are shared among PE & M
PEs=TM, QoS, Classification…
Oversubscribed or faulty link does not
avoid connectivity to its nodes
Uses wormhole routing
Packets are routed adaptively based on
traffic load and connectivity
8-ary 2-cube interconnect 4-ary 3-cube interconnect
3D-mesh interconnect Shared-bus
Wormhole routing
Wormhole routing operates on flits
– Typically what can be transmitted in a single cycle
– Flit = channel width
Packet header is typically one or two flits
The rest of the flits do not contain routing information
Known for its improved latency and throughput
packets
Message
header
4 3 2 1
Node 1 Node 2 Node 3 Node 4
4 3 2 1
Message
K-ary n-cube based architectures
Performance measures
– Latency = T_w+T_s+T_r
– Latency does not consider
queuing delays at this point
– Latency is measured per flit
per flow
– Throughput=ch_width/L
– Bi-directional links will
increase the throughput
– Aggregate throughput is a
function of #PE as well
Flow control mechanisms
Virtual Channel Effects Sub-Channel Variation Effects
Channel/Packet Size 32 Bits, 1 channel
Virtual Channel Inactive
VC Channel/Packet Size 16 Bits, 2 sub-channels
Virtual Channel Active
Channel/Packet Size 8 Bits, 4 sub-channels
The traffic controller (TC)
Switch module TC architecture
– Receives status from
all TC modules M T T PE
Port A C C
– It switches I\O ports
Channel Sampler
– Ports status M T T PE
TC
– “busy” / “not-busy” Channel
Routing
C C
Algorithm
Routing algorithm Sampler
– Shortest path
– Ports status Port D SW Port B
modu
Virtual channels le PE/Memory connectivity
– VC on/off
Channel
– # of available VC Virtual
Channels Partition
– VC occupancy status
Channels partitioning
– Sub-channeling on/off
– # of available SC Port C
– SC occupancy status
The Network simulator
3D-mesh interconnect
4-ary 3-cube interconnect
8-ary 2-cube interconnect
Interconnect simulation
The interconnect
Simulation speed Runtime performance data
The Network simulator architecture
• The interconnect type &
configuration to simulate
The Interconnect
• A worm container M PE M PE
• Contains worms to M M M M
model and worms Worms Worms
M M M M • Records
done modeling Jar Jar
PE M PE M performance
data
• Interconnect type,
configuration & properties Traffic • Timing for
• Data is updated from the Interconnect Sampler
Configuration entering worms
user interface
Manager • Traffic load feedback
Scheduler
• Calculates the • Output files which
Interconnect contain all simulated
shortest path route for Properties
each worm data (worms properties
Worms
and performance)
Routing
Worm Data
Algorithm Manager
Performance
Results
• Orchestrates the simulation process
• Provides control signals to all other User
modules participating in simulation Interface
• Command line or GUI
• User system parameters
Performance results: Latency
• Latency denotes the time it takes
a message to reach destination
• Latency includes wire propagation,
switching and routing delays
• Latency of 3D-mesh for both short
and long messages was the smallest
of all three interconnects
• The results shown represent the
average latency measured for both
short and long messages
K-ary n-cube interconnects latency comparison
Performance results: Latency
• Offered load determines the probability
that each node comprising the interconnect
will generate a message within each cycle
• If offered load=0.1 there is a chance that
10% of the total nodes in the interconnect
will generate a message at each cycle
• As the offered load increases the latency
increases exponentially for all the
interconnects
• 3D-mesh has the lowest latency and is able
to sustain higher traffic load
K-ary n-cube interconnects latency vs. offered load
Performance results: Throughput
• 3D-mesh reached the highest peak
throughput for both short and
long messages
• 4-ary 3-cube outperforms 8-ary 2-cube
in all measurements
• Higher throughput when long worms
are modeled as a result of the routing
algorithm (wormhole routing)
K-ary n-cube interconnects throughput comparison
Routing accuracy comparison
8-ary 2-cube routing accuracy with VC & SC enabled
3D-mesh routing accuracy with VC & SC enabled
• VCs and SC significantly increased routing accuracy
• 3D-mesh RA=96% vs. 85% (8-ary 2-cube) and
84% (4-ary 3-cube)
4-ary 3-cube routing accuracy with VC & SC enabled
Bandwidth utilization rate
• Bandwidth utilization = (# of occupied
channels) / (total # of channels)
• VCs as well as SC increase bandwidth
utilization rate
• Larger gap from no-VC/SC to VC+2SC
then VC+2SC to VC+4SC
Failure rate vs. VC size
• As VCs size increases failure rate
decreases
• Tradeoff: failure rate vs. VCs
size (area and cost)
3D-mesh failure rate as a function of VC size
Throughput comparison
• Throughput of common
interconnects is based on results
provided by their vendors
• All k-ary n-cube interconnects
utilize both VCs and SC
• 3D-mesh has the highest
throughput among all other
interconnects
Average throughput comparison among high-performance interconnects
Conclusion
• We presented k-ary n-cube based interconnects as off-chip
communications architectures for line cards to increase the
throughput of the currently used memory system
• We designed a new mixed-radix, non-symmetrical k-ary n-cube based
interconnect called the 3D-mesh (a variation of 2-ary 3-cube)
• We include multiple, highly efficient techniques to route, switch and
control packet flows in order to increase throughput and interconnect
utilization, while minimize traffic congestion and packet loss
• We reveal the best processor-memory configuration, out of multiple
configurations, that achieves optimal performance
• We developed a custom-designed, event-driven, simulator to evaluate
the performance of packet-based, off-chip, k-ary n-cube interconnect
architectures
Conclusion
• Our results show that k-ary n-cube interconnect architectures
provide higher throughput and can sustain higher traffic loads
• 3D-mesh reached the highest performance results of all other
interconnects tested
• 3D-mesh is a scalable, cost effective solution which complies with
all the functional as well as the physical constraints on the line card
Future work
• The results of this work can also be used for
• PCs
• On-chip communication architectures
• Future directions for this work include
• Board implementation of the interconnect
• Testing the interconnect with off-the-shelf components
• Expand our simulation framework to include
• Higher dimensions k-ary n-cube networks
• Internet traffic workloads