L4: Architectural Level Design
성균관대학교 조 준 동 교수
http://vlsicad.skku.ac.kr
System-Level Solutions
• Spatial locality: an algorithm can be partitioned into natural clusters
based on connectivity
• Temporal locality:average lifetimes of variables (less temporal
storage, probability of future accesses referenced in the recent past).
• Precompute physical capacitance of Interconnect and switching
activity (number of bus accesses)
• Architecture-Driven Voltage Scaling: Choose more parallel
architecture
• Supply Voltage Scaling : Lowering V dd reduces energy, but
increase delays
Software Power Issues
Upto 40% of the on-chip power is dissipated on the buses !
• System Software : OS, BIOS, Compilers
• Software can affect energy consumption at various levels Inter-
Instruction Effects
• Energy cost of instruction varies depending on previous instruction
• For example, XORBX 1; ADDAX DX;
• Iest = (319:2+313:6)=2 = 316:4mA Iobs =323:2mA
• The difference defined as circuit state overhead
• Need to specify overhead as a function of pairs of instructions
• Due to pipeline stalls, cache misses
• Instruction reordering to improve cache hit ratio
Software Power Optimization
• Operand swapping
• Instruction packing – minimize activity
– reduce cache miss with a high associated with the operand
power penalty
– example – attempts to swap operands
• Fujisu DSP to ALU or FPU
• permit an ALU operation and
a memory data transfer to be
packed
• Instruction ordering
– attempt to minimize the energy
associated with the circuit state
effect
– reordering instruction to minimize
the total power for a given table
Software Power Optimization
• Minimizing memory access • Memory bank assignment
– formulated as a graph partitioning
costs problem
– minimizes the number of – each groups correspond to a
memory accesses required memory bank
– optimum code sequence can vary
by an algorithm using dual loads
– example b d
access graph
Before After e
for code fragment
FOR i:= 1 TO N DO FOR i:= 1 TO N DO
a c
B[i] = f(A[i]); B[i] = f(A[i]);
FOR i:= 1 TO N DO C[i] = g(B[i]);
b c
C[i] = g(B[i]); END_FOR; Bank A
e a d Bank B
partitioned access graph
Power Management Mode
• APM System
• Support power management
APM-Aware APM-Aware APM-Aware APM-Aware
– easy control for Application Application Device Driver Device Driver
applications and OS
• APM : Advanced power
management Operating
APM Driver OS dependent
– power states System
• Full On OS independent
• APM Enabled
• APM Standby BIOS APM BIOS
• APM Suspend
• Off
APM BIOS Add-In Add-In
Controlled Device Device
Hardware
Power Management Mode
• APM state transitions
Device
Responsiveness Full On
Decrease Off Switch
•APM Enable •APM Disable
•Enable Call •Disable Call
Power •Off Switch
•Off Call
Managed APM Enabled
•Short Inactivity
•Standby Call
APM Standby
Off Switch
APM Suspend
Off Switch •Long Inactivity
Hibernation •Suspend Interrupt
•Suspend Call
Power
On Switch Usage
Off Increase
Power Management Mode
• MIPS 4200
• PowerPC 603 – Reduced power
• clocks at 1/4 bus clock
– Doze frequency
• clock running to data
cache, snooping logic, • Hitachi SH7032
time base/decrementer
only – Sleep
– Nap • CPU clocks stopped,
preipherals remain clocked
• clocks running to time
base/decrementer only – Standby
– Sleep • all clocks stopped
• all clocks stopped, no peripherals initialized
external input clock
Power Optimization
• Modeling and Technology
• Circuit Design Level
• Logic and Module Design Level
• Architecture and System Design Level
• Some Design Examples
– ARM7TDMI
Some Design Examples
Processor System Power(W) MIPS/W
ARM7D 33Mhz 5V 0.165 185
• ARM7TDMI core ARM7TDMI 33Mhz 5V 0.181 143
– size : 1mm2 @ 0.25um PC403GA 40Mhz 5V 1 39
V810 25Mhz 5V 0.5 36
– power : 68349 25Mhz 5V 0.96 9
• 0.181W@33MHz 5V 29200 16Mhz 5V 1.1 7
• 143 MIPS/W 486DX 33Mhz 5V 4.5 6
– feature i960SA 16Mhz 5V 1.25 4
• 32 bit addressing
• 32x8 DSP multiplier
• 32-bit register bank and ALU
• 32-bit barrel shifter
– thumb instruction set
• compressed 32-bit ARM
instruciton
• high-code density
Processor with Power
Management
• Clock power management
– basic logical method
• gated clocking
– hardware method
• external pin + control register bit
– software method
• specific instructions + control register bit
Avoiding Wastful Computation
• Preservation of data correlation
• Distributed computing / locality of reference
• Application-specific processing
• Demand-driven operation
• Transformation for memory size reduction
• Consider arrays A and C are already available in memory
• When A is consumed another array B is generated; when C is
consumed a scalar value D is produced.
• Memory Size can be reduced by executing the j loop before the i loop
so that C is consumed before B is generated and the same memory
space can be used for both arrays.
Avoiding Wastful Computation
Architecture Lower Power Design
• Optimum Supply Voltage Architecture through Hardware
Duplication (Trading Area for Lower Power) and/or Pipelining
– complex and fewer instruction requires less encoding, but larger
decode logic!
• use small complex instruction with smaller instruction length
(e.g., Hitachi SH: 16-bit fixed-length, arithmetic instruction uses
only two operands, NEC V800: variable-length instruction
decoding overhead )
• Superscalar: CPI max->min
5.28 2.46 0.0
(-7,+7,-7,+7,…) 0 2 4 6 8 10 12
Bit Position
Architecture Optimization
• Ordering of input signals
SUM1 SUM2
– the ordering of operations can result IN + +
in reduced switching activity
IN >>7 IN >>8
– example
Transition Activity
• multiplication with a constant
: IN + (IN >> 7) + (IN >> 8) 0.4 SUM1
– topology II 0.2 SUM2
• the output of first adder has a
small amplitude 0.0
0 2 4 6 8 10 12
-> lower switching activity Bit Position
• switched 30% less
Transition Activity
0.4
SUM1 SUM2 SUM1
0.2 SUM2
IN >>8 + +
IN >>7 IN 0.0
0 2 4 6 8 10 12
Bit Position
•
Architecture Optimization
Reducing glitching activity
– static design can exhibit spurious transitions
• finite propagation delay from one logic block to the next
– important to balance all signal path and reduce the logic depth
– multiple input addition
• 4 input case : 1.5 larger than tree implementation
• 8 input case : 2.5 larger than tree implementation
A B
A B C D
+
C + +
+
D +
+
Chained implemenation Tree implemenation
Synchronous VS. Asynchronous SYSTEMS
• Synchronous system: A signal path starts from a clocked flip- flop
through combinational gates and ends at another clocked flip- flop. The
clock signals do not participate in computation but are required for
synchronizing purposes. With advancement in technology, the systems
tend to get bigger and bigger, and as a result the delay on the clock
wires can no longer be ignored. The problem of clock skew is thus
becoming a bottleneck for many system designers. Many gates switch
unnecessarily just because they are connected to the clock, and not
because they have to process new inputs. The biggest gate is the clock
driver itself which must switch.
• Asynchronous system (self-timed): an input signal (request) starts the
computation on a module and an output signal (acknowledge) signifies
the completion of the computation and the availability of the requested
data. Asynchronous systems are potentially response to transitions on
any of their inputs at anytime, since they have no clock with which to
sample their inputs.
Synchronous VS. Asynchronous SYSTEMS
• More difficult to implement, requiring explicit synchronization between
communication blocks without clocks
• If the signal feeds directly to conventional gate-level circuitry, invalid
• logic levels could propagate throughout the system.
• Glitches, which are filtered out by the clock in synchronous designs,
may cause an asynchronous design to malfunction.
• Asynchronous designs are not widely used, designers can't find the
supporting design tools and methodologies they need.
• DCC Error Corrector of Compact cassette player saves power of 80%
as compared to the synchronous counterpart.
• Offers more architectural options/freedom encourages distributed,
localized control offers more freedom to adapt the supply voltage
Asynchronous Modules
Example: ABCS protocol
6% more logics
Control Synthesis Flow
PIPELINED SELF-TIMED micro P
Programming Style
Speed vs. Power Optimization