CDA 5155 Computer Architecture Introduction to TMS320C6000 (VLIW by gregorio11


									          CDA 5155 Computer Architecture
                                Professor : Dr. Frank

              Introduction to TMS320C6000
              (VLIW VelociTI Architecture)

Team member:
       Jaeseok Kim (, CourseWorX ID - jseokkim)


CPU core of TMS320C6000, the latest DSP of Texas Instruments, is described in detail,
with emphasis on fixed-point pipeline structure and operation. CPU core exploits
VelociTI architecture, which includes an advanced VLIW (Very Long Instruction Word)
developed for TI’s new DSP processors.
       VelociTI can issue up to eight RISC instructions at a time, which are fed into
eight functional units. VelociTI is basically load-store architecture with 32 or 64 32-bit
general-purpose registers. Two data paths are acting as communication channel between
two register files and eight functional units. Structure and operations of data paths are
also described.
       Although the part of unused instructions in traditional VLIW may consume bus
bandwidth, advanced VLIW of VelociTI offers instruction packing by using p-bit in a
fetch packet. Pipeline operations of five types of instructions are described in detail.
Examples resulting in pipeline stalls are given to understand pipeline better.

I.     Introduction

VelociTI architecture and implementations will be described briefly. Although there
are three models in TMS320C6000 family, only a basic fixed-point processor, C62x,
will be used for review of VelociTI.

       a. Overview of VelociTI Architecture

       VelociTI architecture is the latest DSP architecture developed in 1997 by
       Texas Instruments Inc. VelociTI architecture is basically Load-Store RISC
       architecture. It has 32 or 64 32-bit general-purpose registers and eight
       functional units, six ALUs and two multipliers, used for issuing up to eight
       RISC instructions during the same cycle. Advanced VLIW (Very Long
       Instruction Word) architecture also supports not only ILP (Instruction Level
       Parallelism) by static scheduling of a compiler, but instruction packing which
       compresses a big code size of VLIW. In addition, all instructions can be
       conditional, which provides a programmer or a compiler with flexibility.

       b. TMS320C6000 DSP Family

       There are three models in TMS320C6000 family. C62x and C64x are fixed
       point DSP with 32 and 64 32-bit general-purpose registers, respectively. C67x
       is floating point DSP with 32 32-bit general-purpose registers. C62x and C67x
       share the same fixed-point structure, but C64x has more additional features,
       such as Galois Field operation to support error-correcting code in the
       communication application.

                                             TMS320C62x TMS320C64x TMS320C67x
       Fixed point operation                    Yes              Yes               Yes
       Floating point operation                 No               No                Yes
       32bit general purpose registers           32               64               32

Galois Field operation                    No                Yes               No
Bit operations                            No                Yes               No

           Figure 1. TMS320C6000 Family Specification

c. TMS320C6000 Block Diagram

TMS320C6000 CPU core has two register files, two data paths, and eight
functional units. Each register file consists of 16 32-bit registers. Eight
functional units are split into two groups, each group to support one register
file. For example, .L1, .S1, .M1, and .D1 are connected to register file A. Each
register has four read ports capable to transfer data simultaneously to any
functional unit in the same group.
   Two cross paths are used for access to a register in different group.
Because these cross path are only two, one for each group, accessing two or
more registers in another group is prohibited. As we will discuss later, this is
one of resource constraints.

                   Figure 2. CPU Block Diagram

   Long data like 64-bit or 40-bit can be combined by merging two registers
next to each other (A0 and A1, A2 and A3, etc). LSB parts are stored in even-
numbered registers and MSB parts are in odd-numbered registers.

          Figure 3. How to store larger data than 32-bit

II.   Data Paths and Control

      a. Data Paths

                Figure 4. Data Paths Block Diagram (C62x)

Figure 4 shows the structure of data paths in C62x. As shown in Figure 4, one
data path exists in each register file and there are two cross paths between two

b. Functional Units

The following table shows the fixed-point operations of each functional unit.
Although .L is a general logical unit, other three units have their own special
operations, shift operations of .S, multiply operations of .M, and data
operations of .D, respectively.
Functional Unit        Fixed-Point Operations
.L unit (.L1, .L2)     32/40-bit arithmetic and compare operations
Logical unit           32-bit logical operations
                       Leftmost 1 or 0 counting for 32 bits
                       Normalization count for 32 and 40 bits
.S unit (.S1, .S2)     32-bit arithmetic operations
Shifter unit           32/40-bit shifts and 32-bit bit-field operations
                       32-bit logical operations
                       Constant generation
                       Register transfers to/from control register file (.S2 only)
.M unit (.M1, .M2)     16 x 16 multiply operations
Multiplier unit
.D unit (.D1, .D2)     32-bit add, subtract, linear and circular address calculation
Data unit              Loads and stores with 5-bit constant offset
                       Loads and stores with 15-bit constant offset (.D2 only)

               Figure 5. Functional Units and Operations

III.   Instruction Set Architecture

ISA (Instruction Set Architecture) is overviewed with opcode map, latency of six
types of instructions, fetch & execute packets, branch, and resource constraints.

       a. Opcode Map

                       Figure 6. Instruction Set

Above figure shows opcode map of all instruction types. Except IDLE and
NOP, every instruction has creg and z fields, 4 bits in MSB. As we will
discuss later, these bits can make instructions conditional. In addition, LSB (p-
bit) is used for constituting execute packets in a fetch packet.

b. Delay Slots

Six types of instructions exist in C6000 architecture in terms of delay slots.
Single-cycle and store instructions finish read/write operations in first
execution stage. These instructions need not have any delay slot. Multiply
instruction performs read and write operations in first and second stage,
respectively. So, multiply needs one delay slot. In similar way, the number of
delay slots in load instruction is four.

            Figure 7. Delay Slot and Functional Latency

c. Parallel Operations

A series of eight instructions comprise a fetch packet, which can be issued at a
time. At the end of each instruction, p-bit determines whether the instruction
will be executed in parallel with the next one. In Figure 8, if p-bit of
instruction A is “1”, instruction A will be executed together with instruction B.
Notice that p-bit of last instruction in a fetch packet should be “0”, because
instructions in a fetch packet cannot be executed in parallel with any
instruction in other fetch packet.
    For example, when a p-bit pattern is (1 1 1 0 1 1 0 0), first four
instructions are executed at a time, then next three instructions, and finally the
last instruction.
    An execute packet is defined as a group of instructions to be executed at a
time. So, one to eight execute packets can exist in one fetch packet. As a result,
one instruction word in traditional VLIW is compressed to one execute packet
in VelociTI VLIW. One fetch packet can include up to eight execute packets,
each equivalent of one instruction word in traditional VLIW.

             Figure 8. Basic Format of a Fetch Packet

d. Conditional Operations

One of features in VelociTI is that every instruction can be conditional and
that no instruction dedicated to branch exists. Figure 9 described conditional
operation of instructions, based on conditional registers.
   “creg” field specifies which conditional register will be evaluated for
branch. “z” field determines how to evaluate the value of the specified register,
such as zero or non-zero.

               Figure 9. Condition-testable Registers

e. Resource Constraints

There are a few types of resource constraints, such as functional units, cross
path, register reads, and register writes. In other words, any instruction cannot
use more resources than available.
   Available Resources are eight functional units, two cross paths, four read
register ports, and one write register port.

IV.    Pipeline Structure

Whole pipeline stages are split into three types, such as fetch, decode, and execute.
Each type includes several specific stages, which are in charge of reading, writing,
executing, etc. Operations in each stages are described in detail with a block diagram.

       a. Pipeline Overview

            PG        Program address Generate
            PS        Program address Send
            PW        Program access ready Wait
            PR        Program fetch packet Receive
            DP        instruction DisPatch
            DC        instruction DeCode

                      Figure 10. Fixed-Point Pipeline Stages

       b. Fetch

       As shown in Figure 11, CPU generates instruction address to fetch during PG
       phase, then the generated address is sent to memory in PS phase, next PW
       phase CPU is waiting until data in memory is ready. Finally, a fetch packet is
       transferred to CPU at the end of PR phase.

          Figure 11. CPU Block Diagram in Fetch Phases

c. Decode

While CPU is in DP phase, a fetch packet is split into one or many execute
packets. In addition, the proper functional unit is assigned to each instruction
in an execute packet. As you will see later, multiple execute packets in a fetch
packet result in pipeline stalls.
    Each instruction is decoded to decide the source register, destination
register, and the data path to access these registers. The execution type is also
determined during this DC phase.

         Figure 12. CPU Block Diagram in Decode Phases

d. Execute

The following figure depicts the block diagram during execution phases.
Since operations depends on type of instructions during execution phases, we
will discuss later in detail.

         Figure 13. CPU Block Diagram in Execute Phases

e. Pipeline Operation Example

Figure 14 shows the pipeline states each cycle when only one execute packet
in a fetch packet is issued into pipeline. In this case, it is working eight times
as fast as single-issued pipelined processor.

Figure 14. One Execute Packet per Fetch Packet

V.     Pipeline Execution of Instruction Types

Six types of instructions exist with regard to pipeline execution and delay slot. As you
will see, only operations in execution phases differ from each other. Six types of
instructions are described with block diagram to show the interaction between a
functional unit and peripherals.

       a. Single-Cycle Instructions

       All read and write operations are finished within E1 phase. So, it does not
       cause any pipeline stalls.

                    Figure 15. Single-Cycle Instruction Phases

                Figure 16. Single-Cycle Execution Block Diagram

       b. Two-Cycle Instructions

       Although reading data is done in E1 phase, writing should be delay until E2
       phase. This type of instructions needs one delay slot to make up the delayed

             Figure 17. Two-Cycle Instruction Phases

          Figure 18. Two-Cycle Execution Block Diagram

c. Three-Cycle Instructions (STORE)

During E1 phase, the address of data to be stored is calculated, then in E2
phase the calculated address is sent to memory to write data next phase, E3.
Although three execution phases are required to finish the instruction, E2 &
E3 do not affect on functional units and registers. Therefore, no delay slot is

             Figure 19. Three-Cycle Instruction Phases

         Figure 20. Three-Cycle Execution Block Diagram

d. Four-Cycle Instructions (Extended Multiply)

Four-cycle instructions are working the same way as two-cycle instructions,
except that two more execution phases are necessary. Three delay slots are
necessary to finish this instruction.

             Figure 21. Four-Cycle Instruction Phases

         Figure 22. Four-Cycle Execution Block Diagram

e. Five-Cycle Instructions (LOAD)

LOAD instructions are operating in a similar way as STORE instructions. But,
reading data takes two more execution phases than writing. Instead of
finishing up writing data in E3 phase, LOAD instructions spend three phases,
such as E3, E4, and E5.

             Figure 23. Five-Cycle Instruction Phases

         Figure 24. Five-Cycle Execution Block Diagram

f. Branch Instructions

Although branch instructions consume only one execution phase, the branch
target cannot be obtained until E1 phase. It results in 5 delay slots.

                Figure 25. Branch Instruction Phases

  g. Multiple Execute Packets in a Fetch Packet

  Figure 26 shows pipeline operations in the case of three execute packets in a
  fetch packet. Although all fetch packets are fed into pipeline one by one, three
  execute packets in the first fetch packet make two stalls in pipeline.
      In other words, one less stalls than execute packets in a fetch packet occur
  in pipeline.

Figure 26. Pipeline Operation: Multiple execute packets in a fetch packet

  h. Memory Configurations

  As we reviewed before, the pipeline phases during program and data memory
  accesses are very similar to each other. Figure 27 shows the equivalent
  pipeline phases for program and data memory accesses.
      During PG and E1, the address is calculated, then sent to memory in PS
  and E2. Finally it takes last three phases to finish reading.

       Figure 27. Pipeline Phases used During Memory Accesses

Figure 28. Program Memory Access versus Data Memory Access

i. Memory Stalls

Figure 29 shows two kinds of memory stalls, such as program and data stalls.
Program memory stalls are caused by fourth fetch packet, because it takes two
more cycles to fetch a program which is not in cache. Data memory stalls
result from first fetch packet, because data memory is not ready to access for
some reasons.

           Figure 29. Program and Data Memory Stalls

VI.     Conclusion

VelociTI architecture of TMS320C6000 has been reviewed with emphasis on fixed-
point pipeline structure and operations. An advanced VLIW of VelociTI improves the
sparse instruction word caused by unused instructions in traditional VLIW, by
exploiting instruction packing.
   Eight functional units enable multiple issue of up to eight instructions. 32 32-bit
general-purpose registers are big enough to make full use of multiple issue without
causing resource conflict.
   The fact that every instruction can be conditional provides lots of ease especially
to a programmer or a compiler.
   Although VelociTI architecture is so powerful, it should be carefully considered
not to make too many stalls caused by multiple execute packets or cache miss.
Compiler optimization is critical to reduce execute packets in a fetch packet. Memory
hierarchy, which has an important role in reducing cache miss, is another
consideration for performance improvement.

VII. Reference

[1] Texas Instruments, TMS320C6000 CPU and Instruction Set Reference Guide, Oct
[2] Texas Instruments, Tutorial on TMS320C6000 VelociTI Architecture,


To top