ECE8405 Class Notes CHAPTER 5 - CPU IMPLEMENTATION
The CPU consists of the datapath, control, cache, and I/O peripherals
and interfaces. This week, we will consider the first two, which form the
backbone of the CPU. Today we will consider a non-pipelined
implementation of the datapath.
R-FORMAT: op(6) rs(5) rt(5) rd(5) shift(5) arith_op(6).
Let’s do it from the perspective of the instruction types. Starting with a
register-type instruction, what elements do we need in our datapath?
rt Registers ALU
IR regWrite ALU_op
Instruction Register (IR) stores the instruction currently being executed. The
operation can be almost completely combinatorial- updating the IR starts
executing the instruction, the addresses flow to the register, data to the ALU,
the ALU’s result flows back to the registers and is written.
Only problem- if dest_write_enable is enabled before the address of the
destination has settled, can spuriously write a wrong location. Usually this is
taken care of by enabling the write only during the later half of the clock
cycle during which the operation is executed.
Where does the instruction come from? Need circuitry to update addresses
and read instruction (see next page).
The driving force here is the clock, which allows the PC to update, which
causes the next instruction to be read and placed into the IR. We can assume
for now that the clock period is long enough to fetch the instruction AND
ADDER PC Program IR
Missing is the control circuitry, which decodes the op bits (IR26-IR31) and
combined with the low ALUop bits (IR0-IR5), derives control bits for the
ALU and registers. Note that shift bits shuffled to ALU also, in case op is a
I-Format instructions require us to make quite a few changes. First, note that
each I-format instruction type (immediate, load/store, and branch) requires
different hardware features. Let’s treat them separately.
Immediate (op6, rs5, rd5, immed16)
To treat an immediate arithmetic operation, we have two problems:
1) The destination reg address rd is in different bits from R-format
2) The ALU needs to operate on the immediate data from the IR
To solve the first, we need to use a mux to select whether rd comes from bits
11-15 or 16-20 of the IR. In any case, 11-15 can go to rt, because a read can
be done whether or not the data is used!
To solve the second, another mux can be used on the lower ALU data input
to select whether the data source is the register read rt or the instruction. If
the data is in the instruction, a sign extension operation is required (copy MS
bit) to make the data 32 bit with the correct sign.
regDst regWrite ALUsrc
bits 0-15 sign
To perform load/store, a few more blocks need to be added. The ALU is
used to perform address calculations, using exactly the circuit above. The
only difference is that data memory must be included in our system. Thus,
we need another mux to select whether the data returned to the register is
from memory or from ALU:
Note that two memory accesses are required- instruction fetch and data read
or write. If all is this is done in one clock cycle, then the clock rate will have
to be very low (1-10MHz). This is truly the limiting factor on clock rate!
To do a branch, consider what is required: two registers are compared, and
the resulting “flag” is used to make the decision whether or not to branch.
The ALU is used to make the comparison; it can’t be used for adding the
offset to the PC. A second adder is needed. A mux serves to decide whether
the PC is updated with the branch or no-branch address:
ADDER PC MEMORY IR
shift left2 Branch
J-FORMAT (op6, addr26)
The last instruction type is the jump (j, jr, jal). J is easy, we just need to sign
extend and shift as in the branch, and add another input to the MUX. The
others require a little more thought, and will be left as an exercise.
CONTROL (p 360 in text)
Note that there are remarkably few control lines in this implementation. Fig
5.19 on page 360 shows the control lines for the jump-less MIPS:
RegDst – allows mux to select register address from IR 11-15 or 16-20.
Branch – indicates that the instruction is a branch (enables offset add)
MemRead – allows the data memory to perform a read operation
MemToReg – controls the mux selecting readData or ALUdata
ALUop – multiple bits that may specify the ALU operation
MemWrite – specifies a data memory write
ALUSrc – specifies what is input to the lower ALU input
RegWrite – enables writing a result to register
In MIPS, the value 000000 was selected as the op (highest 6 bits) to indicate
an R-type instruction, which defers ALU control to the lowest 6 bits. Let’s
continue with that, since it’s easy to detect a zero word using just an OR
The book’s simplified ALU instruction set includes only 5 functions:
000=AND, 001=OR, 010=add, 110=subtract and 111=set on less than. Note
that in this case only 3 control inputs are needed. Let’s expand on that to
implement the instruction set given on the inside of the back cover, but:
We’ll skip the jump instructions for now.
INSTRUCTIONS ARE: add, sub, addi, addu, subu, addiu, mfc0, mult,
multu, div, divu, mfhi, mflo / and, or, andi, ori, sll, srl / lw, sw, lbu, sb,
lui / beq, bne, slt, slti, sltu, sltiu/ j, jr, jal.
So, how do we implement the control bits for these instructions? We
need to consider which datapaths are needed for each instruction! Let’s start
with the easy cases:
RegDst => high only for R-format instructions.! (IR = 0). NOR bits of op.
Branch => high only for branch instructions (IR = 4 or 5).
MemRead => high for lw, lb, etc (IR = 35) 100011
MemWrite => high for sw etc (IR = 43) 101011
MemToReg => same as MemRead
The rest are more complex, because they are activated by a variety of
ALUSrc => High for R-format (IR=0) instructions, and: beq, bne, slt, sltu.
RegWrite => all BUT sw, sb, beq, bne, j, jr, jal.
NOTE that in some cases the control signal can be either since, for example,
the instruction dataflow does not flow through the mux (e.g. MemToReg for
a sw operation). As the text points out, this can be treated as a don’t-care to
simplify the select circuitry.
Which leaves the worst: ALUsrc!
Which of these instructions require individual ALU_op patterns?
Add, addu, sub, subu, mfc0, mult, multu, div, divu, mfhi, mflo, and, or, srl,
sll, beq, slt, sltu, lui. How many bits to represent? 5 (plus shift bits).
We are assuming here that the coprocessor is “close” to the ALU on the
layout, so that mfc0 can be easily implemented in the ALU, rather than
externally with yet another mux.
IF the instruction is R-format, need to take bits from the lowest 6 bits of the
IR, otherwise must decode from IR op bits (26-31).
So, how best to code these? Just pass IR0-5 through to ALU directly, and
use a decoder to extract ALUop from the op field for non R-format
instructs? That may work. Smart selection of op patterns may minimize
control logic, too.
MULTICYCLE IMPLEMENTATION (Fig 5.33/5.34 p383/384)
The single-cycle machine has a low clock speed that is limited by the
instruction that takes the longest. That would probably be a load- requiring
all functional units and including two memory accesses. While it has a CPI
of 1.0, the clock rate would be severely limited.
With a faster clock, we could then use as many clock cycles as needed for
each instruction: an R-format instruction might execute in one cycle (two for
mul and div), branches in two, and loads in 5. This would allow a much
higher clock rate, and substantially improved throughput:
Problem: if 10% of instructions are loads, and 80% simple R-format
instructions (not mul/div), what is the speedup of the multicycle (MC) CPU?
Since the single-cycle (SC) system is limited by the load, we can say that it
has a CPI of 5 for all instructions (or, that the new clock rate is 5x the old):
CPI(SC)/CPI(MC)=5/(0.1*5 + 0.8*1 + 0.1*2)=3.33
It should be noted that this topic is more theoretical/historical than practical,
since today all high-performance CPUs are pipelined. Pipelining is more
efficient than multicycle implementations. Note that most older/simple
microcontrollers are multicycle, however.
PRINCIPLE: execution of each instruction proceeds in a number of steps,
where each step takes one clock cycle. Because a functional unit may be
reused during different steps of an instruction, this allows us to:
1. Use a unified memory for program and data
2. Reuse the ALU for math/logic, updating PC, and branch calculations
3. Can wait more than one clock if necessary for slow memory… 68000
In order to reuse functional units, we need to save the older results in
registers while the units are being reused. This leads to:
1. Instruction Register
2. Memory data register
3. Register set output registers – used also because one cycle used to read
register values, another to do ALU operation.
4. ALU output register(s)
To implement the multicycle datapath, we start from the single-cycle
system, and make the following modifications:
1. Add the intermediate registers
2. Add multiplexers/MUX inputs to allow reusing the ALU for calculating
3. Relocate the memory, and add address mux to allow unified
4. Add control lines for all the new registers.
What is the cost of this “upgrade”? Registers are reasonably cheap, VLSI-
wise. But the control circuitry is now very complex. Instead of a few
combinatorial gates, we need a sophisticated state machine that “knows” the
steps required for each instruction.
How should we “allocate” clock cycles? The slowest step determines the
max clock rate, so no step should be much longer than any other. Mem
access may require more than one step, as may div or FP ops. Register
access, simple ALU ops should each take one clock- these are our rate-
What are a consistent series of steps that serve to execute an instruction on
this multicycle machine?
1. Instruction Fetch – read the instruction and update the address at the
input of the PC:
IR = Memory[PC]; -- latch instruction in IR, uses memory
PC = PC + 4; -- uses ALU
2. Instruction decode and register fetch, calculate branch offset just in
A = Reg[IR[25-21]]; -- uses register
B = Reg[IR[20-16]];
ALUout = PC + (sign-extend (IR[15-0]) <<2); – uses ALU
3. Execute instruction by calculating result, memory address, or doing
branch conditional (and PC update if true)
Mem ref: ALUout = A + sign-extend(IR[15-0]);
R-format: ALUout = A op B;
Branch if(A==B) PC=ALUout;
Jump: PC = PC[31-28] & (IR[25-0]<<2)
4. Memory access or write-back (completion)
MemLoad: MDR = Memory[ALUout];
MemStore: Memory[ALUout = B];
R-format: Reg[IR[15-11]] = ALUout;
I-type: Reg[IR[20-16]] = ALUout; -- immediate ALU’s, slt
5. Memory read completion ONLY
Reg[IR[20-16]] = MDR;
MULTICYCLE CONTROL (Fig 5.42 p 396)
The difficulty with control is that each TYPE of instruction requires an
individual state machine!
Last semester, we looked at several ways to implement logic, which
included synthesis by gates, and table-lookup (e.g. by ROM). These are the
two ways that multicycle control may be implemented.
Let’s start by considering the initial steps that ALL instructions require:
PCSource=00 Mem R-type Branch Jump I-type
Return from FSMs INDIVIDUAL STATE MACHINES
Go over FSM’s with a few examples.
The FSM in discussed previously is not all that complex, and so can easily
be implemented with the techniques we covered last semester. When adding
the MIPS floating point instructions, some instructions take up to 20 clock
cycles. In cases where the control is horrendously complex- hundreds or
thousands of states as is often the case with CISC processors- a technique
called microprogramming is often used.
Microprogramming is a processor within a processor. Each instruction is
implemented as a series of MICROINSTRUCTIONS that specify the control
signals needed for one state of that instruction’s FSM. Also, the
microinstruction indicates which microinstruction must be executed next, if
it is not the next one in the sequence (i.e. indicates a branch).
As in programming, the key is to use an easily-understood “assembly
language” that can be “assembled” into the control hardware. The hardware
looks something like:
storage control signals
The microprogram vectors give the control signals as well as sequencing bits
that indicate which vector should be executed next.
Let’s look at the MIPS microprogramming “assembly language”. The
microinstruction contains eight fields:
1. ID string – a label that identifies this instruction (for jump-to’s)
2. ALU control – ALU function (Add, Sub, Fn)
3. SRC1 – source for first ALU operand (PC,A)
4. SRC2 – source for second ALU operand (B,4,ext,ext_shift)
5. Register control – source for write (read, write ALU, write MDR)
6. Memory – (Read Addr=PC, Read ALU, Write ALU)
7. PCWrite control – (ALU, ALUout-cond, Jump_addr)
8. Sequencing – give next microinstruction label or “seq”
So, for these 8 fields, the microprogram for the first two cycles is:
1 2 3 4 5 6 7 8
Fetch Add PC 4 ReadPC ALU Seq
Add PC ExtShft Read Dispatch 1
Let’s examine these:
The first cycle specifies that the ALU adds, with ALU inputs PC and 4, the
memory reads the instruction pointed to by the PC, and the PC is updated
from the ALU output.
The second cycle indicates that the ALU adds the PC (now updated) to the
sign-extended and shifted immediate field (branch offset calculation). ALU
write is deferred until instruction is decoded!
EXCEPTIONS AND INTERRUPTS (Handout: Figs 5.48, 5.49)
What are exceptions and interrupts? Terminology differs between platforms!
For MIPS, an interrupt is an external hardware request for action that is
independent of instructions. An exception is caused by instructions (or data)
Math errors (overflow, divide by zero)
Software “interrupt”!!! (e.g. syscall)
Memory access violation
Save program counter into special register (EPC)
Update PC to special location (exception processing routine)
Note that power-on reset/reset is handled differently: the processor initializes
all internal registers and starts executing at standard address (often 0000).
How are other exceptions/interrupts handled? One of three ways:
1) Cause of exception/interrupt is saved in special cause register, and all
exceptions and interrupts go to same (service routine) vector address.
PC = serviceAddress
2) Vectored- each exception/source has it’s own standard address for the
service routine, usually spaced 4-32 bytes apart (jumps used to go to
individual service routines).
PC = serviceAddress[i]
3) Indirect- adjacent vectors contain ADRESSES of service routines
(popular Motorola ploy). Efficient mechanism.
PC = mem[vector[i]]
MIPS implementation simple, just arithmetic overflow (cause = 1) and
illegal instruction (cause = 0). Uses Method 1 above, where service routine
is located at address 0xc0000000.
How can we extend model to incorporate these exceptions?
Add cause and EPC registers
Add Boolean control signal to update cause register bit 0
Add path to update EPC from PC upon exception
Add means to load PC with vector address 0xC00000
Add logic to detect exceptions in ALU and control
Add instructions that allow accessing cause and EPC
Hardware modifications for the first 4 are shown in Fig. 5.48.
Talk through these changes (note ALU subtracts 4 from PC!).
The modifications to the state machine requires just the addition of two
states, one for each exception. One happens when a signed “R-type”
instruction causes an overflow, and the other when the decoded instruction
out of state 1 is “other” than the implemented types.
1 (5.1) Describe the effect that a single stuck-at-0 fault would have on the
multiplexors in the single-cycle datapath of Figure 5.19. Which instructions
if any, would still work? Consider each of the following faults separately:
RegDst=0, ALUSrc=0, MemtoReg=0, Zero=0.
If RegDst =0, all R format instructions would not work properly since we
will specify the wrong address to write to. If ALUSrc = 0, then all I format
instructions except branch will not work because we will not be able to get
the sign-extended 16-bits into the ALU. If MemtoReg=0, then loads will not
work. If Zero=0, the branch instructions will never branch, even when it
2) (5.5) We wish to add the instruction addi to the single cycle datapath.
Add any necessary datapaths and control signals to the single cycle datapath
of Figure 5.19
No new additions are required. The new control is similar to the load word
because we want to use the ALU to add the immediate to a register. So,
RegDst=0, ALUSrc=1, ALUOp=00. The new control is also similar to an
R-format instruction because we want to write the result of the ALU to a
register so MemtoReg=0, RegWrite=1 and since we aren’t using branches or
memory, Branch=0, MemRead=0, MemWrite=0.
3) (5.7) Same as 2 but we want to add branch not equal (bne).
One possible solution is to add a new control signal called Invzero that
selects whether Zero or inverted Zero is an input to the AND gate used for
choosing what the new PC should be (so this means a new mux). The new
control signal Invzero would be a don’t care whenever branch is used.
Many other solutions are possible.
4) (5.10). A friend is proposing that the control signal MemtoReg be
eliminated. The multiplexer that has MemtoReg as an input will instead use
the control signal MemRead. Will this work? Consider both datapaths.
MemtoReg and MemRead are identical except for sw and beq, for which
MemtoReg is a don’t care. Thus, the modification will work for single-
cycle. For multi-cycle it will also work, assuming that the finite state
machine is changed so that MemRead is asserted whenever MemtoReg is.
5) (5.17) We wish to add the instruction jump and link (jal) to the
multicycle data path. Add any necessary datapath and control signals.
We need to expand the two multiplexors controlled by RegDst and
MemtoReg. The execution steps would be:
Instruction fetch (unchanged)
Instruction decode and register fetch (unchanged)
Jal : Reg = PC; PC=PC[31-28] || (IR[25-0]<<2);
So we are writing the PC alter it has already been incremented by 4 (in the
instruction fetch step) into register $ra so we need PC to be an input to the
MemtoReg multiplexer and 31 needs to be an input to the RegDst
multiplexer. We need to modify existing states to show proper values of
RegDst and MemtoReg and add a new state that performs the jal instruction
(and then returns to state 0). That state (say state 10) would have: PCWrite,
PCSource =10, RegWrite, and the appropriate values for MemtoReg and
1) 5.2) Describe the effect that a stuck at 1 fault would have on the
multiplexors in the single-cycle datapath of Figure 5.19. Which instructions,
if any would still work. Consider RegDst=1, ALUSrc=1, MemtoReg=1, and
2) (5.6) We wish to add the jump and link (jal) instruction to the single-
cycle datapath. Add any necessary datapaths and control signals necessary.
You can just add these to the Figure 5.19.
3) (5.15) We wish to add the instruction addi to the multicycle datapath.
Add any necessary datapaths and update the finite state machine. (Figures
5.33 and 5.42)