Intro to C6000 and CCS by wulinqing


									                     Workshop Outline
            0. Welcome    TI DSP Overview
            1. Intro      C6000 and CCS
            2. CSL/BSL    Using Peripherals
            3. eXpressDSP TI‟s System Software
            4. Optimize   Enhancing Performance
            5. Wrap Up    Where To Go From Here

      T TO
Technical Training
   C6000 Overview
   C6000 Parallelism
   CCS Overview
   Lab: Build and Graph a Sinewave
                      Example C6000 System
                          /     Timer /                                                  Clockin
                               Counters                                   PLL            Clockout
                      2                        VCP TCP                                   Clockoutx
                                GPIO                                   Utopia 2   /
                                                                                  8       ATM
     Etc.                                        C6000
          Reset                                   CPU                             /
            NMI                  HWI                                    McASP     3
                                                                                      Audio Codec
  Ext Interrupts          /

        PCI                                      EDMA                             /
                                                                                      Serial Codec
                          /      PCI                                    McBSP     3

      Host P             /
                    16 or 32
                                 HPI        Loader       EMIF            EMAC            Ethernet
                                                           \ 16, 32, or 64-bits       (TCP/IP stack avail)

                                               SDRAM                     SRAM

              Note: Not all ‘C6000 devices have all the various peripherals shown above.
                    Please refer to the C6000 Product Update for a device-by-device listing.
               C6415 DSP (720MHz)
1064 MB/s   EMIF 64

                              Enhanced DMA Controller (64 channels)
                                                                                                               L1P Cache

                                                                                                   11.5 GB/s
 266 MB/s   EMIF 16                                                                                                      23 GB/s

12.5 MB/s   McBSP 0

                                                                                       L2 Memory
12.5 MB/s   McBSP 1                                                                                            CPU Core

                                                                      2.9 GB/s
                                                                                                               5760 MIPS
 100 MB/s   Utopia 2

                                                                                                   11.5 GB/s
                                                                                                                         11.5 GB/s
12.5 MB/s   McBSP 2
                                                                                                               L1D Cache
 133 MB/s   HPI / PCI

            JTAG          Power                                                  PLL               Timer 0     Timer 1        Timer 2
            RTDX        Down Logic

                            Before looking into the CPU, what does a DSP do anyway?
                 What is DSP?

                     x                        Y
          ADC                 DSP                    DAC

Digital sampling of           Most DSP algorithms can be
 an analog signal:              expressed with MAC:
                                    Y =            coeffi * xi
                                            i = 1

                             for (i = 0; i < count; i++){
                 t             sum += c[i] * x[i]; }

                         How has the C6000 CPU been designed to handle this?
         'C6000 CPU Architecture
                                    „C6000 Compiler excels at
                                     Natural C
A0                         B0
         .D1    .D2                 While dual-MAC speeds
                                     math intensive algorithms,
                                     flexibility of 8 independent
                                     functional units allows the
         .S1    .S2                  compiler to quickly perform
                                     other types of processing
          Dual MACs                 All „C6000 instructions are
                                     conditional allowing efficient
 .       .M1    .M2         .        hardware pipelining
 .                          .
A15                        B15      „C6000 CPU can dispatch up
 .                          .        to eight parallel instructions
 .                          .        each cycle
         .L1    .L2
A31                        B31

      Controller/Decoder                      What are some differences
                                               between C6000 devices?
              The C62x/C67x CPU
      Instruction Fetch          Control Registers

     Instruction Dispatch             Emulation

     Instruction Decode

     Registers (A0 - A15)             Registers (B0 - B15)

L1     S1       M1          D1   D2          M2      S2      L2
+      +             X      +    +       X           +       +
       +                                             +
             The C64x CPU adds ...
       Instruction Fetch                 Control Registers

     Instruction Dispatch                      Emulation
     Advanced Instruction
           Packing                             Advanced
      Instruction Decode

     Registers (A0 - A15)                     Registers (B0 - B15)

     Registers (A16 - A31)                    Registers (B16 - B31)

L1      S1        M1         D1          D2           M2           S2        L2
+       +     x        X     +           +        X        x       +         +
+       +     x              +           +                 x       +         +
+       +     x                                            x       +         +
+       +     x        X                          X        x       +         +

                             How can we best make use of the functional unit parallelism?
Given this simple loop …                             40
                                          y =            cn * xn
                                                  n = 1

                            short mac(short *c, short *x, int count) {
     c                       for (i=0; i < count; i++) {
     x       .S1
                              sum += c[i] * x[i]; } …
   prod      .M1                      MVK       .S1       40, cnt
     y                      loop:
   *cp                                LDH       .D1       *cp++, c
   *xp       .L1                      LDH       .D1       *xp++, x
   *yp                                MPY       .M1       c, x, prod
             .D1                      ADD       .L1       y, prod, y
                                      SUB       .L1       cnt, 1, cnt
                             [cnt]    B         .S1       loop
                                      STW       .D        y, *yp

                   How many of these instructions can we get in parallel?
                C62x Intense Parallelism
short mac(short *c, short *x, int count) {           MPY    .M2   B7,A3,B4
                                             ||      MPYH   .M1   B7,A3,A5
 for (i=0; i < count; i++) {                 || [B0] B      .S1   L3
  sum += c[i] * x[i]; } …                    ||      LDW    .D1   *A4++,A3
                                             ||      LDW    .D2   *B6++,B7
   L2: ; PIPED LOOP PROLOG          MPY .M2 B7,A3,B4
Given this C code           ||      MPYH .M1 B7,A3,A5
       LDW .D1 *A4++,A3     || [B0] B    .S1 L3
  ||   LDW .D2 *B6++,B7     ||      LDW .D1 *A4++,A3
The C62x compiler can achieve
                            ||      LDW .D2 *B6++,B7
       LDW .D1 *A4++,A3
Two Sum-of-Products per cycle -----------------------*
  ||   LDW .D2 *B6++,B7     ;**
                                             L3:    ; PIPED LOOP KERNEL
   [B0] B   .S1 L3                                    ADD .L2 B4,B5,B5
   ||   LDW .D1 *A4++,A3
   ||   LDW .D2 *B6++,B7                     ||       ADD .L1 A5,A0,A0
                                             ||       MPY .M2 B7,A3,B4
   [B0] B   .S1 L3                           ||       MPYH .M1 B7,A3,A5
   ||   LDW .D1 *A4++,A3
   ||   LDW .D2 *B6++,B7                     ||   [B0]B    .S1 L3
                                             ||   [B0]SUB .S2 B0,1,B0
   [B0] B   .S1 L3
   ||   LDW .D1 *A4++,A3                     ||       LDW .D1 *A4++,A3
   ||   LDW .D2 *B6++,B7                     ||       LDW .D2 *B6++,B7
                                             ;** -----------------------*
                                                             What about the ‘C67x?
         C67x MAC using Natural C
            Memory                  float mac(float *c, float *x, int count)
                                    { int i, float sum = 0;
The C67x compiler gets two 32-bit
 A0                         B0
            .D1    .D2               for (i=0; i < count; i++) {
                                      sum += c[i] * x[i]; } …
  Sum-of-Products per iteration
           .M1    .M2               ;** --------------------------------------------------*
                                    LOOP: ; PIPED LOOP KERNEL
                                                 LDDW .D1                  A4++,A7:A6
                                    ||           LDDW .D2                  B4++,B7:B6
           .L1    .L2               ||           MPYSP .M1X                A6,B6,A5
                                    ||           MPYSP .M2X                A7,B7,B5
  .                          .
                             .      ||           ADDSP .L1                 A5,A8,A8
           .S1    .S2               ||           ADDSP .L2                 B5,B8,B8
 A15                        B15     || [A1] B                 .S2          LOOP
                                    || [A1] SUB               .S1          A1,1,A1
       Controller/Decoder           ;** --------------------------------------------------*
                                                        Can the 'C64x do better?
C64x gets four MAC‟s using DOTP2
                       short mac(short *c, short *x, int count)
   DOTP2               { int i, short sum = 0;

 m1        m0   A5       for (i=0; i < count; i++) {
       x                  sum += c[i] * x[i]; } …

  n1       n0   B5
                       ;** --------------------------------------------------*
       =               ; PIPED LOOP KERNEL
                       LOOP: ADD               .L2        B8,B6,B6
m1*n1 + m0*n0   A6     ||          ADD         .L1        A6,A7,A7
                       ||          DOTP2 .M2X B4,A4,B8
                       ||          DOTP2 .M1X B5,A5,A6
       +               || [ B0] B              .S1        LOOP
                       || [ B0] SUB            .S2        B0,-1,B0
 running sum    A7     ||          LDDW .D2T2 *B7++,B5:B4
                       ||          LDDW .D1T1 *A3++,A5:A4
                       ;** --------------------------------------------------*
                     How many multiplies can the ‘C6x perform?
   How many 16-bit MMACs (millions of MACs per second)
    can the 'C6201 perform?
             400 MMACs         (two .M units x 200 MHz)

   How about 16x16 MMAC‟s on the „C64x devices?

                     2 .M units
                x    2 16-bit MACs (per .M unit / per cycle)
                x 720 MHz
                  2880 MMACs

   How many 8-bit MMACs on the „C64x?

              5760 MMACs (on 8-bit data)
How Do We Get Such High Parallelism?
    Compiler and Assembly Optimizer use a technique
     called Software Pipelining
    Software pipelining enables high performance
     (esp. on DSP-type loops)
    Key point: Tools do all the work!

                                          What is software pipelining?
                                     Let's look at a simple example ...
  Tools Use Software Pipelining
Here‟s a simple example to demonstrate ...

                 How many cycles would
      LDH        it take to perform this
                 loop 5 times?
 ||   LDH
      MPY           5 x 3 = 15
                 ______________ cycles

                        Our functional units could be used like ...
        Without Software Pipelining
Cycle    .D1   .D2   .M1   .M2       .L1       .L2       .S1       .S2
 1      ldh    ldh
 2                   mpy
 3                                  add
 4      ldh    ldh
 5                   mpy
 6                                  add
 7      ldh    ldh

                           In seven cycles, we’re almost half-way done ...
        With Software Pipelining
Cycle   .D1   .D2   .M1     .M2       .L1       .L2       .S1        .S2
 1      ldh   ldh
 2      ldh   ldh   mpy
 3      ldh   ldh   mpy              add
 4      ldh   ldh   mpy              add
                                     Completes in only 7 cycles
 5      ldh   ldh   mpy              add
 6                  mpy              add
 7                                   add

                     It takes 1/2 the time! How does this translate to code?
 S/W Pipelining Translated to Code
                                c1:         LDH
                                      ||    LDH
 1      ldh
        .D1   ldh
              .D2                     .S1    .S2
                                c2:         MPY
 2                                    ||    LDH
        ldh   ldh   mpy               ||    LDH
 3      ldh   ldh   mpy   add   c3:         ADD
                                      ||    MPY
 4      ldh   ldh   mpy   add         ||    LDH
                                      ||    LDH
 5      ldh   ldh   mpy   add
 6                  mpy   add
 7                        add

 What is DSP
 C6000 CPU Architecture
 Making use of Parallelism
   Tool Overview
       C6416 DSK
       Code Composer Studio (CCS)
       CCS Projects
       Build Options
       CDB Files
C6416 DSK

Diagnostic Utility included with DSK ...
DSK‟s Diagnostic Utility

                          Test/Diagnose
                           DSK hardware
                          Verify USB
                           emulation link
                          Use Advanced
                           tests to facilitate
                          Reset DSK

                             CCS Overview ...
           Code Composer Studio
                          Standard                                  SIM
Compiler                  Runtime
Asm Opto                  Libraries
  Edit          Asm         Link             Debug

DSP/BIOS                 DSP/BIOS
 Config                                                            Third
  Tool                    Libraries                                Party

         DSK’s Code Composer Studio Includes:                       XDS
          Integrated Edit / Debug GUI  Simulator
          Code Generation Tools                                   DSP
          BIOS: Real-time kernel                                  Board
                    Real-time analysis
                                                     CCS is Project centric ...
What is a Project?

          Project (.PJT) file contain:
          References to files:
               Source
               Libraries
               Linker, etc …

          Project settings:
               Compiler Options
               DSP/BIOS
               Linking, etc …

                            The project menu ...
                   Project Menu
                         Project Menu
                            Access open projects
                         Create andvia pull-down menu
                             or by Project menu,
                         from the right-clicking .pjt file
                             in project explorer window
                         not the File menu.

Build Options...                                  Next slide
Build Options
                           -g -q -fr"c:\modem\Debug" -mv6700

    Eight Categories of
     Compiler options

                                    The most common Compiler Options are ...
               Compiler‟s Build Options
         Nearly one-hundred compiler options available to
          tune your code's performance, size, etc.
         Following table lists the most common options:

           Options      Description
           -mv6700      Generate ‘C67x code (‘C62x is default)
           -mv6400      Generate 'C64x code
            -fr <dir>   Directory for object/output files
           -fs <dir>    Directory for assembly files
                -q      Quiet mode (display less info while compiling)
debug           -g      Enables src-level symbolic debugging
options         -s      Interlist C statements into assembly listing

         In Chapter 4 we will examine the options which
          enable the compiler‟s optimizer

                                                        And, the Config Tool ...
DSP/BIOS Configuration Tool

            Simplifies system design by:
               Automatically includes the appropriate
                runtime support libraries
               Automatically handles interrupt vectors
                and system reset
               Handles system memory configuration
                (builds CMD file)
               Generates 5 files when CDB file is saved:
                  C file, Asm file, 2 header files and a
                     linker command (.cmd) file
               More to be discussed later …
    Lab Exercises – C67x vs. C64x
   Which DSK are you using?
   We provide instructions and solutions for both
    C67x and C64x.
   We have tried to call out the few differences in
    lab steps as explicitly as possible:
Lab 1 – Create & Graph a Sine Wave

                sineGen()     buffer

 Introduction to Code Composer Studio (CCS)
     Hook up DSK hardware
     Create and build a project
     Examine variables, memory, code
     Run, halt, step, multi-step, use breakpoints
     Graph results in memory (to see the sine wave)
                  Creating a Sine Wave
    Sine_float.c                   A
    Generates a value for
    each output sample

float y[3] = {0, 0. 0654031, 0};
float A = 1. 9957178;

short sineGen() {
 y[0] = y[1] * A - y[2];
 y[2] = y[1];
 y[1] = y[0];

               Lab 1 Debrief
1.   What differences are there in Lab1 between
     the C6713 and C6416 solutions?
2.   What do we need CCS Setup for?
3.   Why did we return from main?
4.   What did you have to add to LAB1.C to get
     printf to work?
5.   Did you find the “clearArrays” GEL menu
     command useful?
Take Home Exercises (Optional)
   Lab1a - Customize CCS
   Lab1b - Using GEL Scripts
   Lab1c - Using Printf
   Lab1d - Float vs Fixed Point
   Lab1e - Explore CCS Scripting
        Click Here for
          Chapter 2
Using Peripherals

Optional Topics
   CCS Automation
       Command Window
       GEL Scripting
       CCS Scripting
       TCONF Scripting
   CPU Architecture Detail
   C6000 Instruction Set
   Benchmarks

        Click Here for
          Chapter 2
Using Peripherals

Optional Topics
   CCS Automation
       Command Window
       GEL Scripting
       CCS Scripting
       TCONF Scripting
   CPU Architecture Detail
   C6000 Instruction Set
   Benchmarks

        Click Here for
          Chapter 2
Using Peripherals

Optional Topics
   CCS Automation
       Command Window
       GEL Scripting
       CCS Scripting
       TCONF Scripting
   CPU Architecture Detail
   C6000 Instruction Set
   Benchmarks
Technical Training


To top