# 13 by xiangpeng

VIEWS: 6 PAGES: 41

• pg 1
```									Random Number
Generator
May 1, 2006

Dmitriy Solmonov W1-1
David Levitt W1-2
Jesse Guss W1-3
Sirisha Pillalamarri W1-4
Matt Russo W1-5

Design Manager – Thiago Hersan
Why Random Numbers?
• Real-Time Simulations
• Encryption
• Gambling

2
Encryption
• Need random numbers for authentication
• Key generation
• Software vs. Hardware
– Less power/time per number
– Portable
Gambling
• ePoker Rooms
• SoC Deck Generation
• Other future casino games

3
•Potential markets
•Defense and Intelligence
Organizations
•E-Gambling / Casinos
•Game Consoles
•Mobile Communication
•Our design will be part of a larger ASIC
or GPP design
4
IBAA Algorithm
• Uses RC4 encryption algorithm
– Cryptographically secure
– Deterministic
• 1024-bit number generated
• Internally Updated Seed
– not user visible = secure

5
The IBAA Algorithm
#define ALPHA (8)
#define SIZE (1<<ALPHA)
#define ind(x) ((x)&(0x1F))
#define barrel(a) (((a)<<19)^((a)13))
uint32 A, B, Y, X;
uint32 M[32], R[32];
…
for ( i=0; i<SIZE; i++ ) {
X = m[ind(i)];
A = barrel(A) + M[ind(i +16)];
M[ind(i)] = Y = M[ind(X)] + A + B;
R[ind(i)] = B = M[ind(Y>>ALPHA)] + X;
}
6
Architecture
IBAA Algorithm to
Architecture
for ( i=0; i<SIZE; i++ ) {            4 Reads from M
1 Write to M
X = M[ind(i)];               1 Write to R
A = barrel(A) + M[ind(i +16)];

M[ind(i)] = Y = M[ind(X)] + A + B;

R[ind(i)] = B = M[ind(Y>>ALPHA)] + X;
}

dependencies, feedback, and RAW hazards

8
Algorithm to Architecture
• Hardware Limits
– Max. of 2 simultaneous reads from
memory
• Can’t do better than two stages
• Each stage must take multiple cycles to
complete

9
Algorithm to Architecture
• Chosen Timing
– Memory Read = 0.5 cycles
– Memory is clocked ½ period off phase
• When forwarding is applied, need 4
cycles per stage

10
Stage 1
--------------------------------------
M1 = M[i+16]
--------------------------------------         (X)    (M4)                                   (M1)       (M2)   (M3)

X = M[i] | A = M1 + barrel(A)
Adder Reg Reg             SRAM (M)                  Reg        Reg    Reg

--------------------------------------
M3 = M[X] | C1 = (X==i-1)
--------------------------------------
Y1 = A + (C1) ? Y : M3
Control Logic
Stage 2                                                                                                   Counter

------------------------------------                               FSM        Counter        Register

Y = B + Y1
------------------------------------
------------------------------------
B = X + (C2) ? Y : M4                    SRAM        (B)     (Y)
Reg
(Y1)
Reg
(A)
Reg
Reg

------------------------------------      (R)
M[i] = Y | R[i] = B

11
Design For
Manufacture
Regular Fabrics
13
14
15
Why DFM?

•Ability to print on smaller processes
•Robust Manufacturability
•Sacrifice area, speed and metal layers
for a regular design

16
Regular Fabrics
Sample Layout:

17
Lithography Simulations

18
Hardware
• Four adders execute 256 times.
• Fast and low power.

B[27:10]

A[27:10]
B[31:28]

A[31:28]

B[9:4]

A[9:4]

B[3:0]

A[3:0]
C’[28]
C[32]

C[10]

C’[4]
CS4                           CS18                         CS6                       CS4

S[31:28]                       S[27:10]                         S[9:4]                    S[3:0]
20

21

22

23

24
• Delay: 1.56 ns
• Energy Consumption
– (worst case switching) : 12.4 pJ
• Power Dissipation
– (estimating with our switch factor) : 148 μW

25
SRAM
Single Bus Cell

Double Bus
Cell

26
SRAM

27
Functional Verification
• Structural Verilog vs. C Code:
– Generate numbers under equal load
conditions
– Compare Numbers
• Schematic vs. Structural Verilog
– Under equal inputs, check if port
outputs match
• LVS
28
Verification
• Schematic and Extracted Parasitic spice
simulations of major blocks
– Check for clean signals
– Check delays and rise/fall times
• Extracted Parasitic simulation of critical
Register-Register Path
– Signals are clean
– Delay = 2.1 ns
• Extracted Parasitic simulation of chip clock
distribution
29
Critical Delay

30
Final Layout

31
Poly Density
7.52%

Metal1 Density
20.85 %

32
Metal3 Density
18.76%

Metal2 Density
19.89%
33
Metal4 Density
9.36%

Metal5 Density
6.8%

34
Analysis
Specifications
• Pins
– 36 input pins
• 32 bit seed input, gen, read, rst, clk
– 34 output pins
• 32 bit random output, rdy, done
– 2 input/output pins
• vdd, gnd
• 475 MHz chip speed
• 436 KHz throughput
36
Putting it All Together
Trans                               Prop         Power     Power
Area
Part     Count                       Density Delay        (1x) (mW) (Avg) (mW)
(um2)
(ns)         500MHz    475 MHz
Adders   5,856         25,200                   1.45      0.60       0.14
(4)                                  0.232
(1,464 ea.)   (6,300 ea.)              1.56      0.62       0.148

SRAM                                                      W: 0.51
17,736        51,000        0.348
(M&R)                                           0.735     W: 3.25    0.27
(M=10,458     (M=35,000     (M=0.293
0.845     R: 0.19    1.86
R=7,278)      R=16,000      R=0.456)
R: 1.40
Regs     6,400         38,400                   0.220     0.53       0.13
(10)                                 0.167
(640 ea.)     (3,840 ea.)              0.275     0.59       0.145
Total                                           2.1 ns
33,371        182,000       0.194                   -----   4.1 mW
475 MHz
Schematic
ExtractRC
37
Performance Comparison

Operation                        Time (ms)
~4,000,000 Runs

Intel P4 3.20 GHz (90 nm)                    5000

W1-2006 475 MHz (180 nm)                     9000

AMD Opteron Blade 1.005 GHz ()              14000

ARM Intel XScale 700 MHz ()                125000

38
Where to Now ?
• ERC, tapeout, etc.
• Thermal noise unit to use as input
seed
• On-Chip Bus Interface
• HyperTransport™ Interface

39
References
•Jenkins, Robert J. “ISAAC”.
http://burtleburtle.net/bob/rand/isaac.html

•Chirca, Schulte, Glossner, et al. “A
Static Low-Power, High-Performance 32-bit Carry
http://mesa.ece.wisc.edu/publications/cp_2004-
12.pdf