Embed
Email

Architectural Level Design

Document Sample

Shared by: qinmei liao
Categories
Tags
Stats
views:
0
posted:
11/1/2011
language:
Korean
pages:
38
L4: Architectural Level Design









성균관대학교 조 준 동 교수

http://vlsicad.skku.ac.kr

System-Level Solutions

• Spatial locality: an algorithm can be partitioned into natural clusters

based on connectivity

• Temporal locality:average lifetimes of variables (less temporal

storage, probability of future accesses referenced in the recent past).

• Precompute physical capacitance of Interconnect and switching

activity (number of bus accesses)

• Architecture-Driven Voltage Scaling: Choose more parallel

architecture

• Supply Voltage Scaling : Lowering V dd reduces energy, but

increase delays

Software Power Issues

Upto 40% of the on-chip power is dissipated on the buses !

• System Software : OS, BIOS, Compilers

• Software can affect energy consumption at various levels Inter-

Instruction Effects

• Energy cost of instruction varies depending on previous instruction

• For example, XORBX 1; ADDAX DX;

• Iest = (319:2+313:6)=2 = 316:4mA Iobs =323:2mA

• The difference defined as circuit state overhead

• Need to specify overhead as a function of pairs of instructions

• Due to pipeline stalls, cache misses

• Instruction reordering to improve cache hit ratio

Software Power Optimization

• Operand swapping

• Instruction packing – minimize activity

– reduce cache miss with a high associated with the operand

power penalty

– example – attempts to swap operands

• Fujisu DSP to ALU or FPU

• permit an ALU operation and

a memory data transfer to be

packed

• Instruction ordering

– attempt to minimize the energy

associated with the circuit state

effect

– reordering instruction to minimize

the total power for a given table

Software Power Optimization

• Minimizing memory access • Memory bank assignment

– formulated as a graph partitioning

costs problem

– minimizes the number of – each groups correspond to a

memory accesses required memory bank

– optimum code sequence can vary

by an algorithm using dual loads

– example b d

access graph

Before After e

for code fragment



FOR i:= 1 TO N DO FOR i:= 1 TO N DO

a c

B[i] = f(A[i]); B[i] = f(A[i]);

FOR i:= 1 TO N DO C[i] = g(B[i]);

b c

C[i] = g(B[i]); END_FOR; Bank A



e a d Bank B

partitioned access graph

Power Management Mode

• APM System

• Support power management

APM-Aware APM-Aware APM-Aware APM-Aware

– easy control for Application Application Device Driver Device Driver

applications and OS

• APM : Advanced power

management Operating

APM Driver OS dependent

– power states System

• Full On OS independent

• APM Enabled

• APM Standby BIOS APM BIOS

• APM Suspend

• Off



APM BIOS Add-In Add-In

Controlled Device Device

Hardware

Power Management Mode

• APM state transitions

Device

Responsiveness Full On

Decrease Off Switch

•APM Enable •APM Disable

•Enable Call •Disable Call

Power •Off Switch

•Off Call

Managed APM Enabled

•Short Inactivity

•Standby Call

APM Standby

Off Switch







APM Suspend

Off Switch •Long Inactivity

Hibernation •Suspend Interrupt

•Suspend Call

Power

On Switch Usage

Off Increase

Power Management Mode

• MIPS 4200

• PowerPC 603 – Reduced power

• clocks at 1/4 bus clock

– Doze frequency

• clock running to data

cache, snooping logic, • Hitachi SH7032

time base/decrementer

only – Sleep

– Nap • CPU clocks stopped,

preipherals remain clocked

• clocks running to time

base/decrementer only – Standby

– Sleep • all clocks stopped

• all clocks stopped, no peripherals initialized

external input clock

Power Optimization

• Modeling and Technology

• Circuit Design Level

• Logic and Module Design Level

• Architecture and System Design Level

• Some Design Examples

– ARM7TDMI

Some Design Examples

Processor System Power(W) MIPS/W

ARM7D 33Mhz 5V 0.165 185

• ARM7TDMI core ARM7TDMI 33Mhz 5V 0.181 143

– size : 1mm2 @ 0.25um PC403GA 40Mhz 5V 1 39

V810 25Mhz 5V 0.5 36

– power : 68349 25Mhz 5V 0.96 9

• 0.181W@33MHz 5V 29200 16Mhz 5V 1.1 7

• 143 MIPS/W 486DX 33Mhz 5V 4.5 6

– feature i960SA 16Mhz 5V 1.25 4

• 32 bit addressing

• 32x8 DSP multiplier

• 32-bit register bank and ALU

• 32-bit barrel shifter

– thumb instruction set

• compressed 32-bit ARM

instruciton

• high-code density

Processor with Power

Management

• Clock power management

– basic logical method

• gated clocking

– hardware method

• external pin + control register bit

– software method

• specific instructions + control register bit

Avoiding Wastful Computation



• Preservation of data correlation

• Distributed computing / locality of reference

• Application-specific processing

• Demand-driven operation

• Transformation for memory size reduction

• Consider arrays A and C are already available in memory

• When A is consumed another array B is generated; when C is

consumed a scalar value D is produced.

• Memory Size can be reduced by executing the j loop before the i loop

so that C is consumed before B is generated and the same memory

space can be used for both arrays.

Avoiding Wastful Computation

Architecture Lower Power Design



• Optimum Supply Voltage Architecture through Hardware

Duplication (Trading Area for Lower Power) and/or Pipelining

– complex and fewer instruction requires less encoding, but larger

decode logic!

• use small complex instruction with smaller instruction length

(e.g., Hitachi SH: 16-bit fixed-length, arithmetic instruction uses

only two operands, NEC V800: variable-length instruction

decoding overhead )

• Superscalar: CPI max->min

5.28 2.46 0.0

(-7,+7,-7,+7,…) 0 2 4 6 8 10 12

Bit Position

Architecture Optimization

• Ordering of input signals

SUM1 SUM2

– the ordering of operations can result IN + +

in reduced switching activity

IN >>7 IN >>8

– example









Transition Activity

• multiplication with a constant

: IN + (IN >> 7) + (IN >> 8) 0.4 SUM1

– topology II 0.2 SUM2

• the output of first adder has a

small amplitude 0.0

0 2 4 6 8 10 12

-> lower switching activity Bit Position

• switched 30% less

Transition Activity

0.4

SUM1 SUM2 SUM1

0.2 SUM2

IN >>8 + +

IN >>7 IN 0.0

0 2 4 6 8 10 12

Bit Position



Architecture Optimization

Reducing glitching activity

– static design can exhibit spurious transitions

• finite propagation delay from one logic block to the next

– important to balance all signal path and reduce the logic depth

– multiple input addition

• 4 input case : 1.5 larger than tree implementation

• 8 input case : 2.5 larger than tree implementation



A B

A B C D

+

C + +

+

D +

+





Chained implemenation Tree implemenation

Synchronous VS. Asynchronous SYSTEMS



• Synchronous system: A signal path starts from a clocked flip- flop

through combinational gates and ends at another clocked flip- flop. The

clock signals do not participate in computation but are required for

synchronizing purposes. With advancement in technology, the systems

tend to get bigger and bigger, and as a result the delay on the clock

wires can no longer be ignored. The problem of clock skew is thus

becoming a bottleneck for many system designers. Many gates switch

unnecessarily just because they are connected to the clock, and not

because they have to process new inputs. The biggest gate is the clock

driver itself which must switch.

• Asynchronous system (self-timed): an input signal (request) starts the

computation on a module and an output signal (acknowledge) signifies

the completion of the computation and the availability of the requested

data. Asynchronous systems are potentially response to transitions on

any of their inputs at anytime, since they have no clock with which to

sample their inputs.

Synchronous VS. Asynchronous SYSTEMS



• More difficult to implement, requiring explicit synchronization between

communication blocks without clocks

• If the signal feeds directly to conventional gate-level circuitry, invalid

• logic levels could propagate throughout the system.

• Glitches, which are filtered out by the clock in synchronous designs,

may cause an asynchronous design to malfunction.

• Asynchronous designs are not widely used, designers can't find the

supporting design tools and methodologies they need.

• DCC Error Corrector of Compact cassette player saves power of 80%

as compared to the synchronous counterpart.

• Offers more architectural options/freedom encourages distributed,

localized control offers more freedom to adapt the supply voltage

Asynchronous Modules

Example: ABCS protocol



6% more logics

Control Synthesis Flow

PIPELINED SELF-TIMED micro P

Programming Style

Speed vs. Power Optimization



Related docs
Other docs by qinmei liao
Q CMA ExperienceRequirement
Views: 2  |  Downloads: 0
Lipid Learning Activity
Views: 3  |  Downloads: 0
MATERIAL SAFETY AND DATA SHEETS
Views: 5  |  Downloads: 0
Financial Planning The Ties That Bind
Views: 3  |  Downloads: 0
Inflammatory Pain
Views: 6  |  Downloads: 0
Group goal setting workshop
Views: 2  |  Downloads: 0
MEETINGS REPORT ACTION SHEET
Views: 4  |  Downloads: 0
LYMPHOMA RESEARCH FOUNDATION
Views: 2  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!