Intro to C6000 and CCS
Document Sample


Workshop Outline
0. Welcome TI DSP Overview
1. Intro C6000 and CCS
2. CSL/BSL Using Peripherals
3. eXpressDSP TI‟s System Software
4. Optimize Enhancing Performance
5. Wrap Up Where To Go From Here
T TO
Technical Training
Organization
Outline
C6000 Overview
C6000 Parallelism
CCS Overview
Lab: Build and Graph a Sinewave
Example C6000 System
/ Timer / Clockin
2
Counters PLL Clockout
/
2 VCP TCP Clockoutx
Switches
Lamps
Latches
FPGA
/
0-16+
GPIO Utopia 2 /
8 ATM
Etc. C6000
Reset CPU /
NMI HWI McASP 3
/
Audio Codec
Ext Interrupts /
3
4
PCI EDMA /
Serial Codec
32
/ PCI McBSP 3
/
3
Boot
Host P /
16 or 32
HPI Loader EMIF EMAC Ethernet
\ 16, 32, or 64-bits (TCP/IP stack avail)
Sync
EPROM
SDRAM SRAM
Note: Not all ‘C6000 devices have all the various peripherals shown above.
Please refer to the C6000 Product Update for a device-by-device listing.
C6415 DSP (720MHz)
1064 MB/s EMIF 64
Enhanced DMA Controller (64 channels)
L1P Cache
11.5 GB/s
266 MB/s EMIF 16 23 GB/s
12.5 MB/s McBSP 0
L2 Memory
TM
C64x
12.5 MB/s McBSP 1 CPU Core
2.9 GB/s
or
5760 MIPS
100 MB/s Utopia 2
11.5 GB/s
11.5 GB/s
12.5 MB/s McBSP 2
L1D Cache
133 MB/s HPI / PCI
JTAG Power PLL Timer 0 Timer 1 Timer 2
RTDX Down Logic
Before looking into the CPU, what does a DSP do anyway?
What is DSP?
x Y
ADC DSP DAC
Digital sampling of Most DSP algorithms can be
an analog signal: expressed with MAC:
count
A
Y = coeffi * xi
i = 1
for (i = 0; i < count; i++){
t sum += c[i] * x[i]; }
How has the C6000 CPU been designed to handle this?
'C6000 CPU Architecture
Memory
„C6000 Compiler excels at
Natural C
A0 B0
.D1 .D2 While dual-MAC speeds
math intensive algorithms,
flexibility of 8 independent
functional units allows the
.S1 .S2 compiler to quickly perform
other types of processing
Dual MACs All „C6000 instructions are
conditional allowing efficient
. .M1 .M2 . hardware pipelining
. .
A15 B15 „C6000 CPU can dispatch up
. . to eight parallel instructions
. . each cycle
.L1 .L2
A31 B31
Controller/Decoder What are some differences
between C6000 devices?
The C62x/C67x CPU
Instruction Fetch Control Registers
Interrupt
Control
Instruction Dispatch Emulation
Instruction Decode
Registers (A0 - A15) Registers (B0 - B15)
L1 S1 M1 D1 D2 M2 S2 L2
+ + X + + X + +
+ +
The C64x CPU adds ...
Instruction Fetch Control Registers
Interrupt
Control
Instruction Dispatch Emulation
Advanced Instruction
Packing Advanced
Emulation
Instruction Decode
Registers (A0 - A15) Registers (B0 - B15)
Registers (A16 - A31) Registers (B16 - B31)
L1 S1 M1 D1 D2 M2 S2 L2
+ + x X + + X x + +
+ + x + + x + +
+ + x x + +
+ + x X X x + +
How can we best make use of the functional unit parallelism?
Given this simple loop … 40
y = cn * xn
n = 1
short mac(short *c, short *x, int count) {
c for (i=0; i < count; i++) {
x .S1
sum += c[i] * x[i]; } …
cnt
prod .M1 MVK .S1 40, cnt
y loop:
*cp LDH .D1 *cp++, c
*xp .L1 LDH .D1 *xp++, x
*yp MPY .M1 c, x, prod
.D1 ADD .L1 y, prod, y
SUB .L1 cnt, 1, cnt
[cnt] B .S1 loop
STW .D y, *yp
How many of these instructions can we get in parallel?
C62x Intense Parallelism
short mac(short *c, short *x, int count) { MPY .M2 B7,A3,B4
|| MPYH .M1 B7,A3,A5
for (i=0; i < count; i++) { || [B0] B .S1 L3
sum += c[i] * x[i]; } … || LDW .D1 *A4++,A3
|| LDW .D2 *B6++,B7
L2: ; PIPED LOOP PROLOG MPY .M2 B7,A3,B4
Given this C code || MPYH .M1 B7,A3,A5
LDW .D1 *A4++,A3 || [B0] B .S1 L3
|| LDW .D2 *B6++,B7 || LDW .D1 *A4++,A3
The C62x compiler can achieve
|| LDW .D2 *B6++,B7
LDW .D1 *A4++,A3
Two Sum-of-Products per cycle -----------------------*
|| LDW .D2 *B6++,B7 ;**
L3: ; PIPED LOOP KERNEL
[B0] B .S1 L3 ADD .L2 B4,B5,B5
|| LDW .D1 *A4++,A3
|| LDW .D2 *B6++,B7 || ADD .L1 A5,A0,A0
|| MPY .M2 B7,A3,B4
[B0] B .S1 L3 || MPYH .M1 B7,A3,A5
|| LDW .D1 *A4++,A3
|| LDW .D2 *B6++,B7 || [B0]B .S1 L3
|| [B0]SUB .S2 B0,1,B0
[B0] B .S1 L3
|| LDW .D1 *A4++,A3 || LDW .D1 *A4++,A3
|| LDW .D2 *B6++,B7 || LDW .D2 *B6++,B7
;** -----------------------*
What about the ‘C67x?
C67x MAC using Natural C
Memory float mac(float *c, float *x, int count)
{ int i, float sum = 0;
The C67x compiler gets two 32-bit
A0 B0
floating-point
.D1 .D2 for (i=0; i < count; i++) {
sum += c[i] * x[i]; } …
Sum-of-Products per iteration
.M1 .M2 ;** --------------------------------------------------*
LOOP: ; PIPED LOOP KERNEL
LDDW .D1 A4++,A7:A6
|| LDDW .D2 B4++,B7:B6
.L1 .L2 || MPYSP .M1X A6,B6,A5
|| MPYSP .M2X A7,B7,B5
.
. .
. || ADDSP .L1 A5,A8,A8
.S1 .S2 || ADDSP .L2 B5,B8,B8
A15 B15 || [A1] B .S2 LOOP
|| [A1] SUB .S1 A1,1,A1
Controller/Decoder ;** --------------------------------------------------*
Can the 'C64x do better?
C64x gets four MAC‟s using DOTP2
short mac(short *c, short *x, int count)
DOTP2 { int i, short sum = 0;
m1 m0 A5 for (i=0; i < count; i++) {
x sum += c[i] * x[i]; } …
n1 n0 B5
;** --------------------------------------------------*
= ; PIPED LOOP KERNEL
LOOP: ADD .L2 B8,B6,B6
m1*n1 + m0*n0 A6 || ADD .L1 A6,A7,A7
|| DOTP2 .M2X B4,A4,B8
|| DOTP2 .M1X B5,A5,A6
+ || [ B0] B .S1 LOOP
|| [ B0] SUB .S2 B0,-1,B0
running sum A7 || LDDW .D2T2 *B7++,B5:B4
|| LDDW .D1T1 *A3++,A5:A4
;** --------------------------------------------------*
How many multiplies can the ‘C6x perform?
MMAC‟s
How many 16-bit MMACs (millions of MACs per second)
can the 'C6201 perform?
400 MMACs (two .M units x 200 MHz)
How about 16x16 MMAC‟s on the „C64x devices?
2 .M units
x 2 16-bit MACs (per .M unit / per cycle)
x 720 MHz
----------------
2880 MMACs
How many 8-bit MMACs on the „C64x?
5760 MMACs (on 8-bit data)
How Do We Get Such High Parallelism?
Compiler and Assembly Optimizer use a technique
called Software Pipelining
Software pipelining enables high performance
(esp. on DSP-type loops)
Key point: Tools do all the work!
What is software pipelining?
Let's look at a simple example ...
Tools Use Software Pipelining
Here‟s a simple example to demonstrate ...
How many cycles would
LDH it take to perform this
loop 5 times?
|| LDH
MPY 5 x 3 = 15
______________ cycles
ADD
Our functional units could be used like ...
Without Software Pipelining
Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2
1 ldh ldh
2 mpy
3 add
4 ldh ldh
5 mpy
6 add
7 ldh ldh
In seven cycles, we’re almost half-way done ...
With Software Pipelining
Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2
1 ldh ldh
2 ldh ldh mpy
3 ldh ldh mpy add
4 ldh ldh mpy add
Completes in only 7 cycles
5 ldh ldh mpy add
6 mpy add
7 add
It takes 1/2 the time! How does this translate to code?
S/W Pipelining Translated to Code
c1: LDH
Cycle
|| LDH
1 ldh
.D1 ldh
.D2 .S1 .S2
c2: MPY
2 || LDH
ldh ldh mpy || LDH
3 ldh ldh mpy add c3: ADD
|| MPY
4 ldh ldh mpy add || LDH
|| LDH
5 ldh ldh mpy add
6 mpy add
7 add
Outline
What is DSP
C6000 CPU Architecture
Making use of Parallelism
Tool Overview
C6416 DSK
Code Composer Studio (CCS)
CCS Projects
Build Options
CDB Files
C6416 DSK
Diagnostic Utility included with DSK ...
DSK‟s Diagnostic Utility
Test/Diagnose
DSK hardware
Verify USB
emulation link
Use Advanced
tests to facilitate
debugging
Reset DSK
hardware
CCS Overview ...
Code Composer Studio
Standard SIM
Compiler Runtime
Asm Opto Libraries
DSK
.out
Edit Asm Link Debug
EVM
DSP/BIOS DSP/BIOS
Config Third
Tool Libraries Party
DSK’s Code Composer Studio Includes: XDS
Integrated Edit / Debug GUI Simulator
Code Generation Tools DSP
BIOS: Real-time kernel Board
Real-time analysis
CCS is Project centric ...
What is a Project?
Project (.PJT) file contain:
References to files:
Source
Libraries
Linker, etc …
Project settings:
Compiler Options
DSP/BIOS
Linking, etc …
The project menu ...
Project Menu
Hint:
Project Menu
Access open projects
Create andvia pull-down menu
or by Project menu,
from the right-clicking .pjt file
in project explorer window
not the File menu.
Build Options... Next slide
Build Options
-g -q -fr"c:\modem\Debug" -mv6700
Eight Categories of
Compiler options
The most common Compiler Options are ...
Compiler‟s Build Options
Nearly one-hundred compiler options available to
tune your code's performance, size, etc.
Following table lists the most common options:
Options Description
-mv6700 Generate ‘C67x code (‘C62x is default)
-mv6400 Generate 'C64x code
-fr <dir> Directory for object/output files
-fs <dir> Directory for assembly files
-q Quiet mode (display less info while compiling)
debug -g Enables src-level symbolic debugging
options -s Interlist C statements into assembly listing
In Chapter 4 we will examine the options which
enable the compiler‟s optimizer
And, the Config Tool ...
DSP/BIOS Configuration Tool
Simplifies system design by:
Automatically includes the appropriate
runtime support libraries
Automatically handles interrupt vectors
and system reset
Handles system memory configuration
(builds CMD file)
Generates 5 files when CDB file is saved:
C file, Asm file, 2 header files and a
linker command (.cmd) file
More to be discussed later …
Lab Exercises – C67x vs. C64x
Which DSK are you using?
We provide instructions and solutions for both
C67x and C64x.
We have tried to call out the few differences in
lab steps as explicitly as possible:
Lab 1 – Create & Graph a Sine Wave
CPU
sineGen() buffer
Introduction to Code Composer Studio (CCS)
Hook up DSK hardware
Create and build a project
Examine variables, memory, code
Run, halt, step, multi-step, use breakpoints
Graph results in memory (to see the sine wave)
Creating a Sine Wave
Sine_float.c A
Generates a value for
each output sample
t
float y[3] = {0, 0. 0654031, 0};
float A = 1. 9957178;
short sineGen() {
y[0] = y[1] * A - y[2];
y[2] = y[1];
y[1] = y[0];
return((short)(32000*y[0]);
}
Lab 1 Debrief
1. What differences are there in Lab1 between
the C6713 and C6416 solutions?
2. What do we need CCS Setup for?
3. Why did we return from main?
4. What did you have to add to LAB1.C to get
printf to work?
5. Did you find the “clearArrays” GEL menu
command useful?
Take Home Exercises (Optional)
Lab1a - Customize CCS
Lab1b - Using GEL Scripts
Lab1c - Using Printf
Lab1d - Float vs Fixed Point
Lab1e - Explore CCS Scripting
(scripting)
Click Here for
Chapter 2
Using Peripherals
Optional Topics
CCS Automation
Command Window
GEL Scripting
CCS Scripting
TCONF Scripting
CPU Architecture Detail
C6000 Instruction Set
Benchmarks
1
Click Here for
Chapter 2
Using Peripherals
Optional Topics
CCS Automation
Command Window
GEL Scripting
CCS Scripting
TCONF Scripting
CPU Architecture Detail
C6000 Instruction Set
Benchmarks
2
Click Here for
Chapter 2
Using Peripherals
Optional Topics
CCS Automation
Command Window
GEL Scripting
CCS Scripting
TCONF Scripting
CPU Architecture Detail
C6000 Instruction Set
Benchmarks
Technical Training
Organization
ti
Get documents about "