Embed
Email

jacob

Document Sample

Shared by: xiaopangnv
Categories
Tags
Stats
views:
1
posted:
12/8/2011
language:
pages:
23
College of Engineering & Computer science

Department of Electrical & Computer Engineering







Off-chip Communication Architectures for

High Throughput Network Processors



Short version of

Final Defense





Jacob Engel

Motivation

 Line card design challenges OS

application

Buffer

Memory

Memory

– Rapidly growing line rates UART

– Additional deep packet processing operations Routing

Table

Packet

Data



– Increase in memory capacity requirements

Memory Memory







 The “memory wall”

– Nature of packet transmission SONET

– Variable packet size Backplane Forwarding

Connector

– Out of order transmission

Engine

Ethernet

 Off-chip vs. on-chip

– Off-chip interconnects lack innovative methods ATM



to improve their integration into large

NPU

scalable network components Utopia



 PCB physical limitations

– Area

– I/O pins

– Switching elements

– Scalability

Interconnect Topologies

 Bus

 k-ary n-cubes

 Mesh

 N nodes arranged in a rectangle or

square

 Each node connected to 4 neighbors

 Cost: N switches (one per node)

 Torus Mesh Torus network

(k-ary n-cube

 Mesh with wrap-around (4-ary 2-cube network)

with wraparound)

 Longer cables than simple mesh

 Harder to partition

 Hypercubes

 Multi-dimensional cubes

 Nodes identified by n-bit numbers

 Each node connected to other nodes

that differ in a single bit

 Cost: N switches (one per node) 3D Hypercube

(2-ary n-cube)

Switching Mechanism in Interconnects



 Bus-based network

– Drawback

 Does not scale up with number of processors

 Crossbar switching network

– Allows connection of any of p processors to any

Bus-based network

of b memory banks

– Number of switches: n²

– Wiring complexity: O(n²w)

– Disadvantage

 Its complexity grows as a function of p²

 Expensive for large number of processors

 Physical dimensions (switches + wiring)







Crossbar network

Switching Mechanism in Interconnects



 Omega switching network

– More scalable in terms of cost than crossbar,

more scalable in terms of performance than

bus

– Number of switches: (n/2)log_k(n)

 n=# channels; k=# switches/box

– Wiring complexity: O(n w log_k(n))

– Drawbacks

 limited switching flexibility

 Blocking









Omega switch

K-ary n-cube based architectures



 Packet based multiple path

 Packets are shared among PE & M

 PEs=TM, QoS, Classification…

 Oversubscribed or faulty link does not

avoid connectivity to its nodes

 Uses wormhole routing

 Packets are routed adaptively based on

traffic load and connectivity





8-ary 2-cube interconnect 4-ary 3-cube interconnect









3D-mesh interconnect Shared-bus

Wormhole routing



 Wormhole routing operates on flits

– Typically what can be transmitted in a single cycle

– Flit = channel width

 Packet header is typically one or two flits

 The rest of the flits do not contain routing information

 Known for its improved latency and throughput



packets

Message

header



4 3 2 1



Node 1 Node 2 Node 3 Node 4





4 3 2 1



Message

K-ary n-cube based architectures

 Performance measures



– Latency = T_w+T_s+T_r

– Latency does not consider

queuing delays at this point

– Latency is measured per flit

per flow





– Throughput=ch_width/L

– Bi-directional links will

increase the throughput

– Aggregate throughput is a

function of #PE as well

Flow control mechanisms

Virtual Channel Effects Sub-Channel Variation Effects









Channel/Packet Size 32 Bits, 1 channel





Virtual Channel Inactive









VC Channel/Packet Size 16 Bits, 2 sub-channels









Virtual Channel Active





Channel/Packet Size 8 Bits, 4 sub-channels

The traffic controller (TC)



 Switch module TC architecture

– Receives status from

all TC modules M T T PE

Port A C C

– It switches I\O ports

 Channel Sampler

– Ports status M T T PE

TC

– “busy” / “not-busy” Channel

Routing

C C

Algorithm

 Routing algorithm Sampler



– Shortest path

– Ports status Port D SW Port B

modu

 Virtual channels le PE/Memory connectivity

– VC on/off

Channel

– # of available VC Virtual

Channels Partition



– VC occupancy status

 Channels partitioning

– Sub-channeling on/off

– # of available SC Port C



– SC occupancy status

The Network simulator







3D-mesh interconnect







4-ary 3-cube interconnect



8-ary 2-cube interconnect

Interconnect simulation





The interconnect









Simulation speed Runtime performance data

The Network simulator architecture

• The interconnect type &

configuration to simulate

The Interconnect

• A worm container M PE M PE

• Contains worms to M M M M

model and worms Worms Worms

M M M M • Records

done modeling Jar Jar

PE M PE M performance

data

• Interconnect type,

configuration & properties Traffic • Timing for

• Data is updated from the Interconnect Sampler

Configuration entering worms

user interface

Manager • Traffic load feedback

Scheduler



• Calculates the • Output files which

Interconnect contain all simulated

shortest path route for Properties

each worm data (worms properties

Worms

and performance)

Routing

Worm Data

Algorithm Manager

Performance

Results

• Orchestrates the simulation process

• Provides control signals to all other User

modules participating in simulation Interface

• Command line or GUI

• User system parameters

Performance results: Latency





• Latency denotes the time it takes

a message to reach destination



• Latency includes wire propagation,

switching and routing delays



• Latency of 3D-mesh for both short

and long messages was the smallest

of all three interconnects



• The results shown represent the

average latency measured for both

short and long messages







K-ary n-cube interconnects latency comparison

Performance results: Latency



• Offered load determines the probability

that each node comprising the interconnect

will generate a message within each cycle



• If offered load=0.1 there is a chance that

10% of the total nodes in the interconnect

will generate a message at each cycle



• As the offered load increases the latency

increases exponentially for all the

interconnects



• 3D-mesh has the lowest latency and is able

to sustain higher traffic load

K-ary n-cube interconnects latency vs. offered load

Performance results: Throughput





• 3D-mesh reached the highest peak

throughput for both short and

long messages



• 4-ary 3-cube outperforms 8-ary 2-cube

in all measurements



• Higher throughput when long worms

are modeled as a result of the routing

algorithm (wormhole routing)









K-ary n-cube interconnects throughput comparison

Routing accuracy comparison









8-ary 2-cube routing accuracy with VC & SC enabled









3D-mesh routing accuracy with VC & SC enabled



• VCs and SC significantly increased routing accuracy

• 3D-mesh RA=96% vs. 85% (8-ary 2-cube) and

84% (4-ary 3-cube)

4-ary 3-cube routing accuracy with VC & SC enabled

Bandwidth utilization rate







• Bandwidth utilization = (# of occupied

channels) / (total # of channels)



• VCs as well as SC increase bandwidth

utilization rate



• Larger gap from no-VC/SC to VC+2SC

then VC+2SC to VC+4SC

Failure rate vs. VC size







• As VCs size increases failure rate

decreases



• Tradeoff: failure rate vs. VCs

size (area and cost)









3D-mesh failure rate as a function of VC size

Throughput comparison





• Throughput of common

interconnects is based on results

provided by their vendors



• All k-ary n-cube interconnects

utilize both VCs and SC



• 3D-mesh has the highest

throughput among all other

interconnects









Average throughput comparison among high-performance interconnects

Conclusion



• We presented k-ary n-cube based interconnects as off-chip

communications architectures for line cards to increase the

throughput of the currently used memory system



• We designed a new mixed-radix, non-symmetrical k-ary n-cube based

interconnect called the 3D-mesh (a variation of 2-ary 3-cube)



• We include multiple, highly efficient techniques to route, switch and

control packet flows in order to increase throughput and interconnect

utilization, while minimize traffic congestion and packet loss



• We reveal the best processor-memory configuration, out of multiple

configurations, that achieves optimal performance



• We developed a custom-designed, event-driven, simulator to evaluate

the performance of packet-based, off-chip, k-ary n-cube interconnect

architectures

Conclusion





• Our results show that k-ary n-cube interconnect architectures

provide higher throughput and can sustain higher traffic loads



• 3D-mesh reached the highest performance results of all other

interconnects tested



• 3D-mesh is a scalable, cost effective solution which complies with

all the functional as well as the physical constraints on the line card

Future work



• The results of this work can also be used for



• PCs

• On-chip communication architectures



• Future directions for this work include



• Board implementation of the interconnect

• Testing the interconnect with off-the-shelf components

• Expand our simulation framework to include

• Higher dimensions k-ary n-cube networks

• Internet traffic workloads



Related docs
Other docs by xiaopangnv
agenda-10-04
Views: 0  |  Downloads: 0
Folkevisen Germand Gladensvend
Views: 1  |  Downloads: 0
Macbeth-Summary-by-toni
Views: 0  |  Downloads: 0
How to Change Settings for the Microphone
Views: 0  |  Downloads: 0
bonn3update8
Views: 0  |  Downloads: 0
Enrol Result_0067AG_17032007_web
Views: 0  |  Downloads: 0
Healing _A Prayer for Healing_
Views: 0  |  Downloads: 0
8900september
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!