Delay optimization: use of dynamic routing

Document Sample
Delay optimization: use of dynamic routing Powered By Docstoc
					Oct. 31                          IJASCSE, Vol 1, Issue 3, 2012

                 Delay optimization: use of dynamic routing
                                  Andrews K, Bharti Maitreye

                                                                 I. INTRODUCTION
ABSTRACT-Systems on Chips (SoCs) are
becoming increasingly complex as large                   System-on-a-chip or system on chip
numbers of cores are integrated into               (SOC) refers to integrating all components of
single-chip platforms. These systems               a computer or other electronic system into a
typically exhibit stringent processing,            single integrated circuit chip. It may contain
communication, and power constraints               digital, analog, mixed-signal, and often radio
that must be carefully addressed during            frequency functions all on a single chip
system design. As the size and diverse use         substrate. Network-on-Chip (NOC) is an
of SoCs increase, the importance of run-           approach to design the communication
time monitoring of correct functionality           subsystem between IP cores in a System-on-
and system performance increases. Real-            a-Chip (SoC). NoC can span synchronous and
time system monitoring is crucial to               asynchronous clock domains or use
determine if a system is operating as              unclocked asynchronous logic. NoC applies
designed and is executing within designed          networking theory and methods to on-chip
parameters. To address the need for high-          communication         and     brings    notable
speed and coordinated transport of                 improvements over conventional bus and
monitor data in a SoC, we develop a new            crossbar interconnections. NoC improves the
interconnection network for monitors (i.e)         scalability of SoC, and the power efficiency of
Monitor Network on Chip (MNoC). Data               complex SoC compared to other designs.
collected from the monitors via MNoC is                 Conventional system-on-a-chip hardware
collected by a processor that controls the         is augmented with additional components for
operation of the SoC in response to                monitoring, verification, and response. Multiple
monitor data. Existing system proposes             monitors are added to each major component
static routing. This shows significant             of the SoC. The monitors are linked by a
limitations in latency. Because the data           monitor network on-chip, a heterogeneous
transfer rate is less. In contrast to this,        communication substrate containing low-
dynamic routing algorithm is used. In              overhead routers, buses, and multiplexed
dynamic routing, the bandwidth is high             connections [1].
and hence latency can be reduced. The                           The paper is organized as follows.
network is free from congestion which in           In Section II, some of the background and
turn improves throughput.                          existing system of on chip communication
                                                   were focused, while in Section III the target
Keywords: Monitor Network        on   Chip,        system-on-chip multiprocessor architecture is
dynamic routing, latency.                          presented focusing on the critical path monitor
                                                   and dynamic routing. Finally, the summary
                                                   and results are described in Section IV.                                                                              Page 1
Oct. 31                              IJASCSE, Vol 1, Issue 3, 2012

                                                       number of peripherals. Also, bandwidth is
                                                       shared among multiple monitors and the
                                                       shared bandwidth might not suffice for some
 II. BACKGROUND AND RELATED WORK                       higher data rate monitors like the critical path
                                                       delay monitors [5].
A.       Existing approaches to on-Chip                      The thermal sensors on Intel’s Montecito
Communication                                          processor are directly connected to the micro-
     Increased SoC integration is increasing           controller via analog-to-digital converters
chip reliability and power concerns, making            using point-to-point connections. Point-to-point
monitors for temperature, power, clock jitter,         connections can be a good choice of
supply noise and performance behavior an               interconnection for a small number of closely
integral part of current day SoC. Buses                located monitors [6]. Since we anticipate that
constitute the straightforward form of SoC             MNoC will cater to a number of different kinds
communication that is widely used in                   of monitors, it will not be realistic to assume a
contemporary SoC [2]. In a bus based                   point-to-point connection from every monitor
interconnection,      several     communicating        to the MEP. Such point-to-point connections
modules are connected to a set of shared               for MNoC would result in a significant
wires and an arbiter controls data transfer on         resource overhead.
the bus. The arbiter evaluates requests from
various peripherals and grants one of the              B. Metrics of Networks on Chip
requesters access to the bus, based on the                  NoC is an approach for communications
arbitration mechanism that it employs. Buses           within large VLSI systems implemented on a
are simple and easy to build. However, they            single silicon chip. In a NoC system, modules
suffer from a variety of disadvantages like            such as processor cores, memories and
poor scalability .Their limitations are causing a      specialized IP blocks exchange data using a
shift towards alternative, more scalable               network [7]. Two important metrics that
communication models.                                  evaluate the quality of a network-on-chip are
      An FPGA thermal monitoring system                bandwidth and latency. Bandwidth indicates
involves the connection of temperature                 the amount of data that can be put on the
sensors and a controller using the On-Chip             network in a given amount of time and latency
Peripheral bus (OPB). Sensor information that          indicates delay experienced by data in
is read by the controller through the OPB bus          traveling from the source to the destination,
can be used to implement various dynamic               along the network.
thermal management schemes [3]. The
number of monitors that can be connected to            C. Existing method
a bus is limited by its scalability. If numerous           The main functions of multi-core
SoC monitors need to communicate to a                  processor are efficient task allocation, priority
centralized destination like in MNOC, a bus,           scheduling,    task    synchronization,     task
by itself, is likely unsuitable. As the need for       parallelism, aware of overload (power and
monitor integration increases, more monitors           area) handling, reduced energy consumption.
need to be integrated using the bus and the            In    multi-core    processor,    performance
performance of the bus begins to degrade [4].          (bandwidth and latency) and reliability can be
Bus arbitration delays increase with the               improved. MNOC requires different bandwidth                                                                                   Page 2
Oct. 31                              IJASCSE, Vol 1, Issue 3, 2012

                                                       are typically distributed in an unorganized
                                                       fashion,      necessitating      an      irregular
                                                       interconnect topology. An irregular mesh
monitors. High bandwidth monitors such as              topology of routers is needed for MNoC,
delay monitors are connected through routers           whose placement is dictated by the
and low bandwidth monitors such as thermal             distribution of monitors. Although other
monitors are connected via multiplexers[8]. In         topologies could possibly be used with MNoC,
MNOC, with low data traffic static distributed         a mesh like topology represents a simple,
routing protocols are used. If there is any fault      extensible solution for initial exploration.
in the transmitting path then alternate path           Critical path monitor is also known as delay
can’t be chosen. For high data traffic dynamic         monitor which is used to determine the
routing is used[9].                                    process variation while transferring data from
     Every input channel in router is                  one core to another. Dynamic routing is used
multiplexed into two separate virtual channels.        to transfer the data at high traffic rate. This is
The two channels are priority channel and              quite a simple routing mechanism where each
regular channel. Higher priority data’s such as        router maintains a table about the known
error data, critical monitor data are sent             destinations available to it along with the link
through priority channel[10]. All other data are       to get there, and it sends this information to its
sent through regular channels.                         neighbors.

          III. PROPOSED SYSTEM                         A. Critical path monitor

    The data from core1 is transferred to              As SOC design has migrated towards the use
core2 through router shown in Fig 1.                   of multi- cores, the deployment and use of on-
                                                       chip monitors has become more widespread.
                                                       Monitors are important components in SOC.
                                                       Delay monitors are one among them. Critical
                                                       path monitors are used to identify process
                                                       variation and supply noise on circuit
                                                       performance. Typical path delay monitors
                                                       include multi-inverter delay lines with capture
                                                       latches at each inverter output. The output of
                                                       critical path monitor is often a digital code
                                                       which can require high bandwidth.
                                                             The delay line and sampling latches form
                                                       an edge-capture circuit. If the inverter delays
  Fig 1. Block diagram of proposed system              are constant and there is no clock jitter, then
                                                       the clock edges are captured in the same
    Core1 and Core2 can be of any module               latches every cycle. When there is clock jitter
such as processor, memory etc. The data                or power supply noise, then the location of the
which is stored in core1 can be transferred to         captured edges changes.
core2 through router. Dynamic routing                          A signal generation circuit launches a
technique is followed to transfer the data from        signal down a delay path which is then
one core to another core. On-chip monitors                                                                                    Page 3
Oct. 31                              IJASCSE, Vol 1, Issue 3, 2012

captured by a time-to-digital converter shown
in Fig 2.

                                                                     Fig 3. Parallel Paths
                                                            The CPM described in this work is shown
                                                       in Fig 4. The synchronizer is a flip-flop acting
                                                       as a clock divider: the CPM samples a rising
  Fig 2. Generic Critical Path Monitor (CPM)           edge one cycle and a falling edge the
                   diagram                             following cycle. The delay synthesis consists
                                                       of five parallel paths that can be monitored
     The signal is usually a propagating edge          individually or combined in the data
generated from the system clock. The time-to-          analysis block to emulate a hybrid path,
digital converter can take many forms, but it is       providing      up     to    14 unique      path
typically a latch or bank of latches that provide      combinations.
from 1 to n bits of resolution. The system clock
the most accurate signal on a microprocessor
provides the time-base for the time-to-digital
conversion. The control block uses calibration
information to configure the critical path
     A comparison between the monitor output
and calibration data determine the system
parameters that must be actuated to provide
appropriate timing margin. . There are two
different type of path. Parallel path and serial
path. Serial path has RC delay. The serial                           Fig 4. Critical path monitor
path takes more delay for propagating the
signal. Hence for less delay parallel path is               The delay lines synthesize the operating
being used. The synthesis paths consist of a           point effects on the timing of wire delay, adder
combination of parallel delay paths with               delay, pass-gate delay, 4-nand gate delay,
differing timing characteristics shown in Fig 3.       and 3-nor gate delay. Programming overhead
                                                       for this CPM is spent in a variable delay line to
                                                       adjust edge placement in the edge detector
                                                       and in selecting the paths that will be
                                                       detected.                                                                                      Page 4
Oct. 31                                IJASCSE, Vol 1, Issue 3, 2012

                                                         fanout 4 in the latch setup and hold and
                                                         another 2 fanout 4 driving the first stage of
                                                         logic and receiving the data from the logic.
B. Time-to-Digital Conversion                            The CPM adds another 3-4 fanout 4 of over
                                                         head to provide the ability to program the
The time-to-digital converter is shown in Fig 5.         synthesis paths. That is an overhead of 9-10
The edge detector is also called as time-to-             fanout 4. If the edge detector is made
digital converter. The system clock captures             insensitive to operating point, the capture latch
the previously launched edge within the                  timing in the edge detector will not correlate
twelve-bit field and compares the value to the           to the capture latch of the actual critical
level that would have occurred if the edge had           paths. An additional delay of 2 fanout 4 within
traveled all the way through the edge detector.          the edge detector is also lost. This leaves the
If the clock frequency remains the same but              CPMs cycle time sensitivity about 4 fanout 4
the path delay increases, or if the path delay           less than the actual cycle time, decreasing
remains the same while the clock frequency               the CPMs correlation to the critical paths.
increases, the edge will move to the left in the
edge detector.                                           C. Dynamic Network

                                                         Switching defines how packets move through
                                                         the routers. The most important modes are
                                                         store-and-forward, virtual cut-through and
                                                         wormhole. In store-and-forward mode, a
                                                         router cannot forward a packet until it has
                                                         been completely received. This leads to high
          Fig 5. Time-to-digital converter               latencies because the header and the body
                                                         flits have to wait until the last flit of the packet
     Timing margin is obtained by subtracting            arrives. Only then the packet can be
the bit position at maximum frequency from               forwarded to the next switch. The buffers sizes
the current bit position. The CPM has a 12-bit           also need to be large in such a mode. In
thermometer code output with each bit                    virtual cut-through mode, a router can forward
separated by a delay of 1 fanout 2. A multiple           a packet as soon as the next switch gives a
bit thermometer code was chosen to provide               guarantee that a packet will be accepted
additional data about the timing margins than            completely. Thus, it is necessary for the buffer
just a PASS/FAIL reading.                                to store a complete packet, like in store and
      Averaging can be done on the CPM                   forward, but in this case with lower latency
output by the system to increase its accuracy            communication.
and calibration can be done in the field to              The wormhole switching mode is a variant of
adjust to changes in operating point due to              the virtual cut-through mode that avoids the
aging and other factors. The edge detector is            need for large buffer spaces. A packet is
as sensitive to operating point changes as are           transmitted between routers on a flit by flit
the synthesis paths, but this does not reduce            basis. Only the header flit has the routing
the CPMs accuracy. In low fanout 4 designs,              information. Thus, the rest of the flits that
16 is common, there is an overhead of 4                  compose a packet must follow the same path                                                                                       Page 5
Oct. 31                              IJASCSE, Vol 1, Issue 3, 2012

                                                       new paths to all the destinations which they
                                                       can still reach.

reserved for the header [8, 9]. Wormhole
routing is typical of low latency, low overhead                IV. SUMMARY AND RESULTS
implementations and is the one to choose for           The data transfer rate can be increased by
a low overhead MNoC. Congestion control,               reducing the latency of the processor using
reliability and deadlock avoidance should be           dynamic routing and the network can be free
addressed by the network and the                       from congestion. While static routing cannot
implemented protocol [10].                             be used to decrease the latency. In this work,
                                                       core1, core2 and critical path monitor has
The choice of each of these parameters                 completed and the simulation results are
directly influences the area and performance           analyzed.
of the network. The final design is a balanced             In future, dynamic routing algorithm will be
trade-off based on the cost requirements of            implemented using core1 and core2. Using
the communicating cores. For MNoC, the                 this routing, the delay of the processor can be
communicating modules are monitors. Unlike             optimized. The design is implemented using
a statically-scheduled network, a dynamic              Verilog.
network allocates resources and schedules
communication at runtime. Dynamic routing              A. Simulation results
describes the capability of a system, through              The final output shows the simulation
which routes are characterized by their                results of an core1, core2, critical path monitor
destination, to alter the path that the route          and interfacing of all modules.
takes through the system in response to a
change in conditions Routers handle
communication by implementing routing
protocols that forward data from the source to
the destination. The fundamental component
of a dynamic network is the cores are the
actual communicating modules which are
monitors in case of MNoC.

     Dynamic routing protocols will then
distribute this 'best route' information to other
routers running the same routing protocol,
thereby extending the information on what                            Fig 6. Output of core1
networks exist and can be reached. This gives
dynamic routing protocols the ability to adapt             In Fig 6. the core1 has two input with 8
to logical network topology changes,                   bits. It produces an 8 bit output. When
equipment failures or network outages.                 logic_unit is 1 the core1 produces
Eventually all the nodes in the network receive        corresponding output. When logic_unit is 0 the
the updated information, and will then discover        core1 does not produces any output.                                                                                   Page 6
Oct. 31                               IJASCSE, Vol 1, Issue 3, 2012

                                                          Fig 9. Simulation result of core1 and CPM
            Fig 7. Output of core2                           In Fig 9. CPM analysis the critical path of
                                                        core1. Core1 input is aluin1 and aluin2. It
       In Fig 7. Core2 stores the data in               produces an output at alu_out. The other
particular address when wr_en is 1. Core2               outputs are equal, less and great.
reads the data when rd_en is 1.

                                                        [1].   Jia Zhao, Ramakrishna Vadlamani, and
                                                               Russell Tessier, Sailaja Madduri and
                                                               Wayne Burleson (2011), ‘A Dedicated
                                                               Monitoring Infrastructure for Multicore
                                                               Processors’, IEEE. VLSI J., vol. 19, no.
                                                               6, pp.1011-1022.
                                                        [2].   Burleson, W., Madduri, S., Tessier, R.
                                                               and Vadlamani, R., (2009) ‘A monitor
                                                               interconnect and support subsystem for
                                                               multicore     processors’,   in    proc.
     Fig 8. Output of critical path monitor.                   IEEE/ACM Design Autom. Test Eur.
   In Fig 8.Synin input is given to CPM. The                   Conf., pp. 761–766.
output is obtained at the comparator side.              [3].   Jayaseelan, R. and Mitra, T. (2009) ‘A
Output produced may be equal, less or great.                   hybrid    local-global   approach    for
                                                               multicore thermal management’, in
                                                               Proc. IEEE/ACM Int. Conf. Comput.-
                                                               Aided Des., pp. 314–320.
                                                        [4]    Haitham Akkary1, Komal Jothi, Renjith
                                                               Retnamma, Satyanarayana Nekkalapu
                                                               and Xiaoyu Song (2008), ‘A Simple
                                                               Latency Tolerant Processor’ in Proc.
                                                               IEEE Int. Conf. Embedded Comput.                                                                                   Page 7
Oct. 31                                 IJASCSE, Vol 1, Issue 3, 2012

Syst.: Arch., Model. Simulation, pp. 384–389.

[5].      Carpenter, G., Deogun, H., Drake, A.,
          Floyd, M., Ghiasi, James N., S.,
          Nguyen, T., Pokala V. and Senger, R.,
          (2007), ‘A Distributed Critical-Path
          Timing Monitor for a        65nm High-
          Performance       Microprocessor’,     In
          Proceedings of the IEEE International
          Solid-State Circuits Conference.
[6].      Monchiero, M., Palermo, G., Silvano, C.
          and Villa, O. (2006), ‘Exploration of
          distributed        shared        memory
          architectures     for    NoC      -based
          multiprocessors’, in Proc. IEEE Int.
          Conf. Embedded Comput. Syst.: Arch.,
          Model. Simulation, pp. 144–151.
[7].      De Micheli, G., Grecu, C., Ivanov, A.,
          Pande, P. and Saleh, R., (2005)
          ‘Design, synthesis, and test of networks
          on chips’, IEEE Des. Test Comput,
          Vol.22, no. 5, pp. 404–413.
[8].      Bannon, P., Lang, S., Mukherjee, S.,
          Spink, A. and Webb, D.(2002) ‘The
          Alpha 21364 network architecture’,
          IEEE Micro, vol. 22, no. 1, pp. 26–35.
[9].      Alvin R. Lebeck Durham and Srikanth,
          T. Srinivasan (2002), ‘Load Latency
          Tolerance In Dynamically Scheduled
          Processors’, IEEE Micro, vol. 22, no. 1,
          pp. 120–132.
[10].     Kazuyuki MIYASAKA, Hiroaki HARAMI
          ISHI, Naohiko SHIMIZU (2002) ‘Design
          of A Memory Latency Tolerant
          Processor’ in Proc. IEEE Int. Conf.
          Embedded Comput. Syst.: Arch.,
          Model. Simulation, pp. 144–151.                                                          Page 8

Shared By: