CDA 5155 Computer Architecture
Professor : Dr. Frank
Introduction to TMS320C6000
(VLIW VelociTI Architecture)
Jaeseok Kim (firstname.lastname@example.org, CourseWorX ID - jseokkim)
CPU core of TMS320C6000, the latest DSP of Texas Instruments, is described in detail,
with emphasis on fixed-point pipeline structure and operation. CPU core exploits
VelociTI architecture, which includes an advanced VLIW (Very Long Instruction Word)
developed for TI’s new DSP processors.
VelociTI can issue up to eight RISC instructions at a time, which are fed into
eight functional units. VelociTI is basically load-store architecture with 32 or 64 32-bit
general-purpose registers. Two data paths are acting as communication channel between
two register files and eight functional units. Structure and operations of data paths are
Although the part of unused instructions in traditional VLIW may consume bus
bandwidth, advanced VLIW of VelociTI offers instruction packing by using p-bit in a
fetch packet. Pipeline operations of five types of instructions are described in detail.
Examples resulting in pipeline stalls are given to understand pipeline better.
VelociTI architecture and implementations will be described briefly. Although there
are three models in TMS320C6000 family, only a basic fixed-point processor, C62x,
will be used for review of VelociTI.
a. Overview of VelociTI Architecture
VelociTI architecture is the latest DSP architecture developed in 1997 by
Texas Instruments Inc. VelociTI architecture is basically Load-Store RISC
architecture. It has 32 or 64 32-bit general-purpose registers and eight
functional units, six ALUs and two multipliers, used for issuing up to eight
RISC instructions during the same cycle. Advanced VLIW (Very Long
Instruction Word) architecture also supports not only ILP (Instruction Level
Parallelism) by static scheduling of a compiler, but instruction packing which
compresses a big code size of VLIW. In addition, all instructions can be
conditional, which provides a programmer or a compiler with flexibility.
b. TMS320C6000 DSP Family
There are three models in TMS320C6000 family. C62x and C64x are fixed
point DSP with 32 and 64 32-bit general-purpose registers, respectively. C67x
is floating point DSP with 32 32-bit general-purpose registers. C62x and C67x
share the same fixed-point structure, but C64x has more additional features,
such as Galois Field operation to support error-correcting code in the
TMS320C62x TMS320C64x TMS320C67x
Fixed point operation Yes Yes Yes
Floating point operation No No Yes
32bit general purpose registers 32 64 32
Galois Field operation No Yes No
Bit operations No Yes No
Figure 1. TMS320C6000 Family Specification
c. TMS320C6000 Block Diagram
TMS320C6000 CPU core has two register files, two data paths, and eight
functional units. Each register file consists of 16 32-bit registers. Eight
functional units are split into two groups, each group to support one register
file. For example, .L1, .S1, .M1, and .D1 are connected to register file A. Each
register has four read ports capable to transfer data simultaneously to any
functional unit in the same group.
Two cross paths are used for access to a register in different group.
Because these cross path are only two, one for each group, accessing two or
more registers in another group is prohibited. As we will discuss later, this is
one of resource constraints.
Figure 2. CPU Block Diagram
Long data like 64-bit or 40-bit can be combined by merging two registers
next to each other (A0 and A1, A2 and A3, etc). LSB parts are stored in even-
numbered registers and MSB parts are in odd-numbered registers.
Figure 3. How to store larger data than 32-bit
II. Data Paths and Control
a. Data Paths
Figure 4. Data Paths Block Diagram (C62x)
Figure 4 shows the structure of data paths in C62x. As shown in Figure 4, one
data path exists in each register file and there are two cross paths between two
b. Functional Units
The following table shows the fixed-point operations of each functional unit.
Although .L is a general logical unit, other three units have their own special
operations, shift operations of .S, multiply operations of .M, and data
operations of .D, respectively.
Functional Unit Fixed-Point Operations
.L unit (.L1, .L2) 32/40-bit arithmetic and compare operations
Logical unit 32-bit logical operations
Leftmost 1 or 0 counting for 32 bits
Normalization count for 32 and 40 bits
.S unit (.S1, .S2) 32-bit arithmetic operations
Shifter unit 32/40-bit shifts and 32-bit bit-field operations
32-bit logical operations
Register transfers to/from control register file (.S2 only)
.M unit (.M1, .M2) 16 x 16 multiply operations
.D unit (.D1, .D2) 32-bit add, subtract, linear and circular address calculation
Data unit Loads and stores with 5-bit constant offset
Loads and stores with 15-bit constant offset (.D2 only)
Figure 5. Functional Units and Operations
III. Instruction Set Architecture
ISA (Instruction Set Architecture) is overviewed with opcode map, latency of six
types of instructions, fetch & execute packets, branch, and resource constraints.
a. Opcode Map
Figure 6. Instruction Set
Above figure shows opcode map of all instruction types. Except IDLE and
NOP, every instruction has creg and z fields, 4 bits in MSB. As we will
discuss later, these bits can make instructions conditional. In addition, LSB (p-
bit) is used for constituting execute packets in a fetch packet.
b. Delay Slots
Six types of instructions exist in C6000 architecture in terms of delay slots.
Single-cycle and store instructions finish read/write operations in first
execution stage. These instructions need not have any delay slot. Multiply
instruction performs read and write operations in first and second stage,
respectively. So, multiply needs one delay slot. In similar way, the number of
delay slots in load instruction is four.
Figure 7. Delay Slot and Functional Latency
c. Parallel Operations
A series of eight instructions comprise a fetch packet, which can be issued at a
time. At the end of each instruction, p-bit determines whether the instruction
will be executed in parallel with the next one. In Figure 8, if p-bit of
instruction A is “1”, instruction A will be executed together with instruction B.
Notice that p-bit of last instruction in a fetch packet should be “0”, because
instructions in a fetch packet cannot be executed in parallel with any
instruction in other fetch packet.
For example, when a p-bit pattern is (1 1 1 0 1 1 0 0), first four
instructions are executed at a time, then next three instructions, and finally the
An execute packet is defined as a group of instructions to be executed at a
time. So, one to eight execute packets can exist in one fetch packet. As a result,
one instruction word in traditional VLIW is compressed to one execute packet
in VelociTI VLIW. One fetch packet can include up to eight execute packets,
each equivalent of one instruction word in traditional VLIW.
Figure 8. Basic Format of a Fetch Packet
d. Conditional Operations
One of features in VelociTI is that every instruction can be conditional and
that no instruction dedicated to branch exists. Figure 9 described conditional
operation of instructions, based on conditional registers.
“creg” field specifies which conditional register will be evaluated for
branch. “z” field determines how to evaluate the value of the specified register,
such as zero or non-zero.
Figure 9. Condition-testable Registers
e. Resource Constraints
There are a few types of resource constraints, such as functional units, cross
path, register reads, and register writes. In other words, any instruction cannot
use more resources than available.
Available Resources are eight functional units, two cross paths, four read
register ports, and one write register port.
IV. Pipeline Structure
Whole pipeline stages are split into three types, such as fetch, decode, and execute.
Each type includes several specific stages, which are in charge of reading, writing,
executing, etc. Operations in each stages are described in detail with a block diagram.
a. Pipeline Overview
PG Program address Generate
PS Program address Send
PW Program access ready Wait
PR Program fetch packet Receive
DP instruction DisPatch
DC instruction DeCode
Figure 10. Fixed-Point Pipeline Stages
As shown in Figure 11, CPU generates instruction address to fetch during PG
phase, then the generated address is sent to memory in PS phase, next PW
phase CPU is waiting until data in memory is ready. Finally, a fetch packet is
transferred to CPU at the end of PR phase.
Figure 11. CPU Block Diagram in Fetch Phases
While CPU is in DP phase, a fetch packet is split into one or many execute
packets. In addition, the proper functional unit is assigned to each instruction
in an execute packet. As you will see later, multiple execute packets in a fetch
packet result in pipeline stalls.
Each instruction is decoded to decide the source register, destination
register, and the data path to access these registers. The execution type is also
determined during this DC phase.
Figure 12. CPU Block Diagram in Decode Phases
The following figure depicts the block diagram during execution phases.
Since operations depends on type of instructions during execution phases, we
will discuss later in detail.
Figure 13. CPU Block Diagram in Execute Phases
e. Pipeline Operation Example
Figure 14 shows the pipeline states each cycle when only one execute packet
in a fetch packet is issued into pipeline. In this case, it is working eight times
as fast as single-issued pipelined processor.
Figure 14. One Execute Packet per Fetch Packet
V. Pipeline Execution of Instruction Types
Six types of instructions exist with regard to pipeline execution and delay slot. As you
will see, only operations in execution phases differ from each other. Six types of
instructions are described with block diagram to show the interaction between a
functional unit and peripherals.
a. Single-Cycle Instructions
All read and write operations are finished within E1 phase. So, it does not
cause any pipeline stalls.
Figure 15. Single-Cycle Instruction Phases
Figure 16. Single-Cycle Execution Block Diagram
b. Two-Cycle Instructions
Although reading data is done in E1 phase, writing should be delay until E2
phase. This type of instructions needs one delay slot to make up the delayed
Figure 17. Two-Cycle Instruction Phases
Figure 18. Two-Cycle Execution Block Diagram
c. Three-Cycle Instructions (STORE)
During E1 phase, the address of data to be stored is calculated, then in E2
phase the calculated address is sent to memory to write data next phase, E3.
Although three execution phases are required to finish the instruction, E2 &
E3 do not affect on functional units and registers. Therefore, no delay slot is
Figure 19. Three-Cycle Instruction Phases
Figure 20. Three-Cycle Execution Block Diagram
d. Four-Cycle Instructions (Extended Multiply)
Four-cycle instructions are working the same way as two-cycle instructions,
except that two more execution phases are necessary. Three delay slots are
necessary to finish this instruction.
Figure 21. Four-Cycle Instruction Phases
Figure 22. Four-Cycle Execution Block Diagram
e. Five-Cycle Instructions (LOAD)
LOAD instructions are operating in a similar way as STORE instructions. But,
reading data takes two more execution phases than writing. Instead of
finishing up writing data in E3 phase, LOAD instructions spend three phases,
such as E3, E4, and E5.
Figure 23. Five-Cycle Instruction Phases
Figure 24. Five-Cycle Execution Block Diagram
f. Branch Instructions
Although branch instructions consume only one execution phase, the branch
target cannot be obtained until E1 phase. It results in 5 delay slots.
Figure 25. Branch Instruction Phases
g. Multiple Execute Packets in a Fetch Packet
Figure 26 shows pipeline operations in the case of three execute packets in a
fetch packet. Although all fetch packets are fed into pipeline one by one, three
execute packets in the first fetch packet make two stalls in pipeline.
In other words, one less stalls than execute packets in a fetch packet occur
Figure 26. Pipeline Operation: Multiple execute packets in a fetch packet
h. Memory Configurations
As we reviewed before, the pipeline phases during program and data memory
accesses are very similar to each other. Figure 27 shows the equivalent
pipeline phases for program and data memory accesses.
During PG and E1, the address is calculated, then sent to memory in PS
and E2. Finally it takes last three phases to finish reading.
Figure 27. Pipeline Phases used During Memory Accesses
Figure 28. Program Memory Access versus Data Memory Access
i. Memory Stalls
Figure 29 shows two kinds of memory stalls, such as program and data stalls.
Program memory stalls are caused by fourth fetch packet, because it takes two
more cycles to fetch a program which is not in cache. Data memory stalls
result from first fetch packet, because data memory is not ready to access for
Figure 29. Program and Data Memory Stalls
VelociTI architecture of TMS320C6000 has been reviewed with emphasis on fixed-
point pipeline structure and operations. An advanced VLIW of VelociTI improves the
sparse instruction word caused by unused instructions in traditional VLIW, by
exploiting instruction packing.
Eight functional units enable multiple issue of up to eight instructions. 32 32-bit
general-purpose registers are big enough to make full use of multiple issue without
causing resource conflict.
The fact that every instruction can be conditional provides lots of ease especially
to a programmer or a compiler.
Although VelociTI architecture is so powerful, it should be carefully considered
not to make too many stalls caused by multiple execute packets or cache miss.
Compiler optimization is critical to reduce execute packets in a fetch packet. Memory
hierarchy, which has an important role in reducing cache miss, is another
consideration for performance improvement.
 Texas Instruments, TMS320C6000 CPU and Instruction Set Reference Guide, Oct
 Texas Instruments, Tutorial on TMS320C6000 VelociTI Architecture,