VIEWS: 7 PAGES: 60 POSTED ON: 8/16/2011
Introduction to DSP Maurizio Palesi Maurizio Palesi 1 What is a DSP? Digital Operating by the use of discrete signals to represent data in the form of numbers Signal A variable parameter by which information is conveyed through an electronic circuit Processing To perform operations on data according to programmed instructions Digital Signal Processing Changing or analysing information which is measured as discrete sequences of numbers Maurizio Palesi 2 1 Main Characteristics Compared to other embedded computing applications, DSP applications are differentiated by the following Computationally demanding Iterative numeric algorithms Sensitivity to small numeric errors (audible noise) Stringent real-time requirements Streaming data High data bandwidth Predictable (though often eccentric) memory access pattern Predictable program flow (nested loops) Maurizio Palesi 3 DSP Processors 1970 1970 Not adequate DSP techniques in DSP techniques in Not adequate performance flexibility and telecommunication telecommunication reusability equipment equipment Custom Custom Microprocessor Microprocessor fixed function fixed function hardware hardware DSP processors DSP processors Maurizio Palesi 4 2 DSP vs. General Purpose DSPs adpot a range of specialized features Single-cycle multiplier VLIW, Superscalar, SIMD, Multiply-accumulate operations multiprocessing, ... Saturation arithmetic Separate program and data memories Dedicated, specilized addressing hw Complex, specialized instruction sets GP GP DSP DSP Today, virtually very commercial 32-bit microprocessor architecture (from ARM to 80x86) has been subject to some kind of DSP-oriented enhancement Maurizio Palesi 5 Converting Analogue Signals A continuous signal Measured against a clock Is first held at each clock tick The signal is measured, and the measurement converted to a digital value Maurizio Palesi 6 3 Aliasing Some higher frequencies can be incorrectly interpreted Aliasing problem: One frequency looks like another A high frequency signal sampled at too low rate looks like... … a lower frequency signal Maurizio Palesi 7 Aliasing We must sample faster than twice the frequency of the highest frequency component [Nyquist’s theorem] This avoids aliasing Actually, Nyquist says that we have to sample faster than the signal bandwidth A signal sampled twice per cycle has enough information to be reconstructed Maurizio Palesi 8 4 Frequency Resolution We cannot see slow changes in the signal if we don't wait long enough We must sample for at least one complete cycle of the lowest frequency we want to resolve Compromise We must sample fast to avoid and for a long time to achieve a good frequency resolution Sampling fast for a long time means we will have a lot of samples Lots of samples means lots of computation So we will have to compromise between resolving frequency components of the signal, and being able to see high frequencies Maurizio Palesi 9 Quantisation When the signal is converted to digital form, the precision is limited by the number of bits available The errors introduced by digitisation are both Non linear: We cannot calculate their effects using normal maths Signal dependent: the errors are coherent and so cannot be reduced by simple means Limited precision leads to errors... which are signal dependent Maurizio Palesi 10 5 Time Domain Processing Correlation Autocorrelation to extract a signal from noise Cross correlation to locate a know signal Cross correlation to identify a signal Convolution Maurizio Palesi 11 Correlation Correlation is a weighted moving average x r ( n) = ∑ x ( k ) × y ( k + n) y k Shift y by n Multiply the two together Requires a lot of calculation Integrate If one signal is of length M and the other is of length N, then we need (N * M) multiplications, to calculate the whole correlation function Note that really, we want to multiply and then accumulate the result - this is typical of DSP operations and is called a multiply & accumulate operation Maurizio Palesi 12 6 Correlation Correlation is a maximum when two signals are similar in shape Correlation is a measure of the similarity between two signals as a function of time shift between them If two signals are similar and unshifted... their product is all positive But as the shift increase... parts of it become negative... and the correlation function shows where the signals are similar and unshifted Maurizio Palesi 13 Detecting Periodicity EEG signal EEG autocorrelation Autocorrelation as a way to detect periodicity in signals Maurizio Palesi 14 7 Detecting Periodicity EEG signal with noise EEG with noise autocorrelation Although a rhythm is not even visible (upper trace) it is detected by autocorrelation (lower trace) Maurizio Palesi 15 Align Signals Signal x Signal y 6 corr(x,y) 6 Maurizio Palesi 16 8 Align Signals x y Maurizio Palesi 17 Cross correlation Cross correlation (correlating a signal with another) can be used to detect and locate known reference signal in noise A radar or sonar ‘chirp’ signal bounced off a target may be buried in noise... bounced but correlating with the ‘chirp’ reference crearly reveals when the echo comes Maurizio Palesi 18 9 Cross Corelation to Identify a Signal Cross correlation (correlating a signal with another) can be used to identify a signal by comparison with a library of known reference signals The chirp of a nightingale... correlates strongly with another nightgale... but weakly with a dove... or a heron... Maurizio Palesi 19 Cross Corelation to Identify a Signal Cross correlation is one way in which sonar can identify different types of vessel Each vessel has a unique sonar signature The sonar system has a library of pre-recorded echoes from different vessels An unknown sonar echo is correlated with a library of reference echoes The largest correlation is the most likely match Maurizio Palesi 20 10 Convolution Correlation is a weighted moving average with one signal flipped back to front To convolve one signal r ( n) = ∑ x ( k ) × y ( k − n) with another signal k first flip the second signal Then shift it Then multiply the two together Requires a lot of calculation And integrate under the curve If one signal is of length M and the other is of length N, then we need (N * M) multiplications, to calculate the whole convolution function We need to multiply and then accumulate the result - this is typical of DSP operations and is called a multiply & accumulate operation Maurizio Palesi 21 Convolution vs. Correlation Convolution is used for digital filtering Convolving two signals is equivalent to multiplying the frequency spectra of the two signals together It is easily understood, and is what we mean by filtering Correlation is equivalent to multiplying the complex conjugate of the frequency spectrum of one signal by the frequency spectrum of the other It is not so easily understood and so convolution is used for digital filtering Convolving by multiplying frequency spectra is called fast convolution Maurizio Palesi 22 11 Fourier Transform The Fourier Transform is a methamatical procedure that allows to convert a signal from the time domain to the frequency domain Any signal or waveform could be made up just by adding together a series of sine waves with appropriate amplitude and phase A square wave can be made by adding... the fundamental minus 1/3 of the third harmonic plus 1/5 of the fifth harmonic... minus 1/7 of the 7th harmonic... Maurizio Palesi 23 Fourier Transform The Fourier transform is an equation to calculate the frequency, amplitude and phase of each sine needed to make up any given signal The Fourier Transform (FT) is a mathematical formula using integrals The Discrete Fourier Transform (DFT) is a discrete numerical equivalent using sums instead of integrals The Fast Fourier Transform (FFT) is just a computationally fast way to calculate the DFT The Discrete Fourier Transform involves a summation H ( f ) = ∑ c[k ]× e −2πjk ( f∆ ) k DFT and the FFT involve a lot of multiply and accumulate the result This is typical of DSP operations and is called a multiply & accumulate operation Maurizio Palesi 24 12 Filtering Raw Filtered signal Filter Filter signal The function of a filter is to remove unwanted parts of the signal Random noise Extract useful parts of the signal Components lying within a certain frequency range Filters Analog Digital Maurizio Palesi 25 Analog Filters An analog filter uses analog electronic circuits Use components such as resistors, capacitors and op amps Widely used in such applications Noise reduction Video signal enhancement Graphic equalisers in hi-fi systems ..., and many other areas Maurizio Palesi 26 13 Digital Filters A digital filter uses a digital processor to perform numerical calculations on sampled values of the signal Specialised DSP chip A/D A/D DSP DSP D/A D/A Unfiltered Sampled Digitally Filtered analog digitised filtered analog signal signal signal signal Maurizio Palesi 27 Advantage of Digital Filters Programmability The digital filter can easily be changed without affecting the circuitry Analog filter circuits are subject to drift and are dependent on temperature Digital filters can handle low frequency signals accurately As the speed of DSP technology continues to increase, digital filters are being applied to high frequency signals in the RF domain Versatility Adapt to changes in the characteristics of the signal Maurizio Palesi 28 14 DSP Processors Characteristic features of DSP processors Special features for arithmetic I/O interfaces Memory architectures Data formats Some basic DSP chip designs Brief overview of some major DSP processors Maurizio Palesi 29 Characteristics of DSP Processors DSP processors are mostly designed with the same few basic operations in mind They share the same set of basic characteristics Specialised high speed arithmetic Data transfer to and from the real world Multiple access memory architectures Maurizio Palesi 30 15 Characteristics of DSP Processors The basic DSP operations c[0] x y Additions and multiplications Fetch two operands Z-1 Perform the addition or c[1] multiplication (usually both) Store the result or hold it for a Z-2 repetition c[2] Delays Hold a value for later use Array handling Fetch values from consecutive memory locations Copy data from memory to memory Maurizio Palesi 31 Characteristics of DSP Processors To suit these fundamental operations DSP processors often have Parallel multiply and add Multiple memory accesses (to fetch two operands and store the result) Lots of registers to hold data temporarily Efficient address generation for array handling Special features such as delays or circular addressing Maurizio Palesi 32 16 Address Generation The ability to generate new addresses efficiently is a characteristic feature of DSP processors Usually, the next needed address can be generated during the data fetch or store operation, and with no overhead DSP processors have rich sets of address generation operations *rP *rP register indirect read the data pointed to by the address in register rP register indirect read the data pointed to by the address in register rP having read the data, postincrement the address having read the data, postincrement the address *rP++ *rP++ postincrement pointer to point to the next value in the array postincrement pointer to point to the next value in the array having read the data, postdecrement the address having read the data, postdecrement the address *rP-- *rP-- postdecrement pointer to point to the previous value in the array postdecrement pointer to point to the previous value in the array having read the data, postincrement the address having read the data, postincrement the address register register pointer by the amount held in register rIrIto point to rIrI pointer by the amount held in register to point to *rP++rI postincrement *rP++rI postincrement values further down the array values further down the array having read the data, postincrement the address having read the data, postincrement the address pointer to point to the next value in the array, as ififthe pointer to point to the next value in the array, as the *rP++rIr bit reversed *rP++rIr bit reversed address bits were in bit reversed order address bits were in bit reversed order Maurizio Palesi 33 Bit Reversed Addressing DSPs are tightly targeted to a small number of algorithms It is surprising that an addressing mode hase been specifically defined for just one application (the FFT) Addresses generated by a radix-2 FFT Whithout special support such address transformations would 0 (0002) 0 (0002) Take an extra memory access to 1 (0012) 4 (1002) get the new address 2 (0102) 2 (0102) Involve a fair amount of logical 3 (0112) 6 (1102) instructions 4 (1002) 1 (0012) 5 (1012) 5 (1012) 6 (1102) 3 (0112) 7 (1112) 7 (1112) Maurizio Palesi 34 17 Memory Addressing As DSP programmers migrate toward larger programs, they are more attracted to compilers Such compilers are not able to fully exploit such specific addressing modes DSP community routinely uses library routines Programmers may benefit even if they write at a high level Addressing mode Addressing mode Percent Percent Immediate Immediate 30,02% 30,02% Displacement Displacement 10,82% 10,82% Register indirect Register indirect 17,42% 17,42% ~90% Direct Direct 11,99% 11,99% Autoincrement, postincrement Autoincrement, postincrement 18,84% 18,84% Autoincrement, preincrement with 16 bit immediate Autoincrement, preincrement with 16 bit immediate 0,77% 0,77% Autoincrement, preincrement with circular addresing Autoincrement, preincrement with circular addresing 0,08% 0,08% Autoincrement, postincrement by contents of AR0 Autoincrement, postincrement by contents of AR0 1,54% 1,54% Autoincrement, postincrement by contents of AR0, with circular addressing Autoincrement, postincrement by contents of AR0, with circular addressing 2,15% 2,15% Autodecrement, postdecrement Autodecrement, postdecrement 6,08% 6,08% Maurizio Palesi 35 DSP Processors: Input/Output DSP is mostly dealing with the real world • Communication with an overall system controller • Signals coming in and going out • Communication with other DSP processors System controller Signal In DSP DSP Signal Out DSP DSP DSP DSP Other DSP Maurizio Palesi 36 18 DSP Evolution When DSP processors first came out, they were rather fast processors The first floating point DSP, the AT&T DSP32, ran at 16 MHz at a time when PC computer clocks were 5 MHz A fashionable demonstration at the time was to plug a DSP board into a PC and run a fractal (Mandelbrot) calculation on the DSP and on a PC side by side The DSP fractal was of course faster Today… The fastest DSP processor is the Texas TMS320C6201 which runs at 200 MHz This is no longer very fast compared with an entry level PC – ...And the same fractal today will actually run faster on the PC than on the DSP! But… Try feeding eight channels of high quality audio data in and out of a Pentium simultaneously in real time, without impacting on the processor performance Maurizio Palesi 37 Signals They are usually handled by high speed synchronous serial ports Serial ports are inexpensive Having only two or three wires Well suited to audio or telecommunications data rates up to 10 Mbit/s Usually operate under DMA Data presented at the port is automatically written into DSP memory without stopping the DSP Maurizio Palesi 38 19 Host Communications Many systems will have another, general purpose, processor to supervise the DSP For example, the DSP might be on a PC plug-in card Whereas signals tend to be continuous, host communication tends to require data transfer in batches for instance to download a new program or to update filter coefficients Some DSP processors have dedicated host ports Lucent DSP32C has a host port which is effectively an 8 bit or 16 bit ISA bus the Motorola DSP56301 and the Analog Devices ADSP21060 have host ports which implement the PCI bus Maurizio Palesi 39 Interprocessor Communications Interprocessor communications is needed when a DSP application is too much for a single processor The Texas TMS320C40 and the Analog Devices ADSP21060 both have six link ports Would ideally be parallel ports at the word length of the processor, but this would use up too many pins A hybrid called serial/parallel is used 'C40, comm ports are 8 bits wide and it takes four transfers to move one 32 bit word 21060, link ports are 4 bits wide and it takes 8 transfers to move one 32 bit word Maurizio Palesi 40 20 Memory Architectures Additions and multiplications require us to Fetch two operands Perform the addition or multiplication (usually both) Store the result or hold it for a repetition To fetch the two operands in a single instruction cycle, we need to be able to make two memory accesses simultaneously Plus one access to write back the result Plus one access to fetch the instruction itself Maurizio Palesi 41 Memory Architectures There are two common methods to achieve multiple memory accesses per instruction cycle Harvard architecture Modified von Neuman architecture Maurizio Palesi 42 21 Harvard Architecture Program Program DSP DSP Data Data DSP operations usually involve at least two operands DSP Harvard architectures usually permit the program bus to be used also for access of operands It is often necessary to fetch the instruction too The Harvard architecture is inadequate to support this Super Harvard architecture (SHARC) – DSP Harvard architectures often also include a cache memory, leaving both Harvard buses free for fetching operands Maurizio Palesi 43 Modified von Neuman Architectures The Harvard architecture requires two memory buses This makes it expensive to bring off the chip Even the simplest DSP operation requires four memory accesses (three to fetch the two operands and the instruction, plus a fourth to write the result) This exceeds the capabilities of a Harvard architecture Some processors get around this by using a modified von Neuman architecture Maurizio Palesi 44 22 Modified von Neuman Architectures Program Program && DSP DSP Data Data The modified von Neuman architecture allows multiple memory accesses per instruction Run the memory clock faster than the instruction cycle Lucent DSP32C runs with an 80 MHz clock This is divided by four to give 20 MIPS The memory clock runs at the full 80 MHz Each instruction cycle is divided into four 'machine states' and a memory access can be made in each machine state Maurizio Palesi 45 Example Processor Address generation Lots of registers Signal in Signal out Parallel multiply/add Efficient I/O System controller Multiple memories Other DSP Maurizio Palesi 46 23 Example Processor: Lucent DSP32C Modified von Neuman architecture 22x24 bit registers Also serve for integer arithmetic Address 24 Data 32 40 Maurizio Palesi 47 Example Processor: ASP21060 Super Harvard architecture Six link ports Prog. address 24 Prog data 48 Data address 32 Data data 40 Two serial ports Maurizio Palesi 48 24 Data Formats DSP processors store data in fixed or floating point formats Integer 0 1 0 1 0 0 1 1 = 26 + 24 + 21 + 20= 83 -27 26 25 24 23 22 21 20 Fixed point 0 1 0 1 0 0 0 0 = 2-1 + 2-3 = 0.5 + 0.125 = 0.625 -20 2-1 2-2 2-3 2-4 2-5 2-6 2-7 The programmer has to make some decisions If a fixed point number becomes too large for the available word length, he has to scale the number down, by shifting it to the right If a fixed point number is small, he has to scale the number up, in order to use more of the available word length Maurizio Palesi 49 Fixed Point Fixed point can be thought of as just low- cost floating point It does not include an exponent in every word No hw that automatically aligns and normalizes operands DSP programmer take cares to keep the exponent in a separate variable Often this variable is shared by a set of fixed-point variables – Blocked floating point Maurizio Palesi 50 25 Floating Point Floating point format has the remarkable property of automatically scaling all numbers by moving, and keeping track of, the binary point so that all numbers use the full word length available but never overflow Mantissa Exponent 0 1 1 0 1 0 0 0 0 0 1 1 0 -2-1 20 2-1 2-2 2-3 2-4 2-5 2-6 2-7 -23 22 21 20 Mantissa = 20 + 2-1 + 2-3= 1 + 0.5 + 0.125 = 1.625 Exponent = 22 + 21 = 6 Decimal value = 1.625 × 26 Maurizio Palesi 51 Data Formats In Floating Point the HW automatically scales and normalises every number Errors due to truncation and rounding depend on the size of the number These errors can be seen as a source of quantisation noise Then the noise is modulated by the size of the signal The signal dependend modulation of the noise is undesiderable because is audible The audio industry prefers to use fixed point DSP processors over floating point Maurizio Palesi 52 26 Saturating Arithmetics DSPs are often used in real-time applications No exception on arithmentic overflow It could miss an event To support such an environment, DSP architectures use saturating arithmetic If the result is too large to be represented, it is set to the largest representable number Normal two’s complement arithmetic Saturating arithmetic Maurizio Palesi 53 Programming a DSP Processor A simple FIR filter program Using pointers Avoiding memory bottlenecks Assembler programming Maurizio Palesi 54 27 A Simple FIR Filter The simple FIR filter equation is y[n ] = ∑ c[k ]× x[n − k ] k Which can be implemented quite directly in C language y[n] = 0.0; for (k=0; k<N; k++) y[n] = y[n] + c[k] * x[n-k]; Accessed Accessing by Arithmetic is repeatedly array index is needed to inefficient calculate this array index Maurizio Palesi 55 Problem in Addressing Five operation to calculate the address of the element x[n-k] Load the start address of the table in memory Load the value of the index n Load the value of the index k Calculate the offset [n - k] Add the offset to the start address of the array Only after all five operations can the compiler actually read the array element Maurizio Palesi 56 28 Using Pointers y[n] = 0.0; for (k=0; k<N; k++) y[n] = y[n] + c[k] * x[n-k]; float *y_ptr, *c_ptr, *x_ptr; y_ptr = &y[n]; for (k=0; k<N; k++) *y_ptr = *y_ptr + *c_ptr++ * *x_ptr--; c x y c_ptr x_ptr y_ptr Maurizio Palesi 57 Using Pointers float *y_ptr, *c_ptr, *x_ptr; y_ptr = &y[n]; for (k=0; k<N; k++) *y_ptr = *y_ptr + *c_ptr++ * *x_ptr--; Each pointer still has to be initialised But only once, before the loop Not requiring any arithmetic to calculate offsets Using pointers is more efficient than array indices on any processor It is especially efficient for DSP processors Address increments often come for free Maurizio Palesi 58 29 Using Pointers *rP register indirect read the data pointed to by the address in register rP having read the data, postincrement the address *rP++ postincrement pointer to point to the next value in the array having read the data, postdecrement the address *rP-- postdecrement pointer to point to the previous value in the array having read the data, postincrement the address register pointer by the amount held in register rI to point to rI *rP++rI postincrement values further down the array The address increments are performed in the same instruction as the data access to which they refer They incur no overhead at all Most DSP processors can perform two or three address increments for free in each instruction So the use of pointers is crucially important for DSP processors Maurizio Palesi 59 Limiting Memory Accesses float *y_ptr, *c_ptr, *x_ptr; y_ptr = &y[n]; for (k=0; k<N; k++) *y_ptr = *y_ptr + *c_ptr++ * *x_ptr--; Store Load Load Load Four memory accesses Even without counting the need to load the instruction, this exceeds the capacity of a DSP processor Fortunately, DSP processors have lots of registers Maurizio Palesi 60 30 Limiting Memory Accesses register float temp; This initialization temp = 0.0; is wasted! for (k=0; k<N; k++) temp = temp + *c_ptr++ * *x_ptr--; register float temp; temp = *c_ptr++ * *x_ptr--; for (k=1; k<N; k++) temp = temp + *c_ptr++ * *x_ptr--; Maurizio Palesi 61 Compiler for DSPs Despite the well documented advantages in programmer productivity and software maintenance... Ratio to assembly in Ratio to assembly in TMS320C6203 (C62) for Ratio to assembly in Ratio to assembly in Ratio to assembly in Ratio to assembly in TMS320C6203 (C62) for Ratio to assembly in Ratio to assembly in TMS320C54 D (C54) execution time (>1 code space (>1 EEMBC Telecom execution time (>1 code space (>1 TMS320C54 D (C54) execution time (>1 code space (>1 EEMBC Telecom execution time (>1 code space (>1 for DSPstone kernels means slower) means bigger) kernels means slower) means bigger) for DSPstone kernels means slower) means bigger) kernels means slower) means bigger) Convolution 11,8 16,5 Convolution encoder 44,0 0,5 Convolution 11,8 16,5 Convolution encoder 44,0 0,5 FIR 11,5 8,7 Fixed-point complex FFT 13,5 1,0 FIR 11,5 8,7 Fixed-point complex FFT 13,5 1,0 Matrix 1x3 7,7 8,1 Viterbi GSM decoder 13,0 0,7 Matrix 1x3 7,7 8,1 Viterbi GSM decoder 13,0 0,7 FIR2dim 5,3 6,5 Fixed-point bit allocation 7,0 1,4 FIR2dim 5,3 6,5 Fixed-point bit allocation 7,0 1,4 Dot product 5,2 14,1 Autocorrelation 1,8 0,7 Dot product 5,2 14,1 Autocorrelation 1,8 0,7 LMS 5,1 0,7 LMS 5,1 0,7 N real update 4,7 14,1 N real update 4,7 14,1 IIR n biquad 2,4 8,6 IIR n biquad 2,4 8,6 N complex update 2,4 9,8 N complex update 2,4 9,8 Matrix 1,2 5,1 Matrix 1,2 5,1 Complex update 1,2 8,7 Complex update 1,2 8,7 IIR one biquad 1,0 6,4 IIR one biquad 1,0 6,4 Real update 0,8 15,6 Real update 0,8 15,6 C54 geometric mean 3,2 7,8 C62 geometric mean 10,0 0,8 C54 geometric mean 3,2 7,8 C62 geometric mean 10,0 0,8 Maurizio Palesi 62 31 Introduction The TMS320C3x generation of DSPs are high performance 32-bit floating-point devices in the TMS320 family Extensive internal busing Powerful DSP instruction set 60 MFLOPS High degree of on-chip parallelism Up to 11 operations in a single instruction Maurizio Palesi 63 General Features General-purpose register file Program cache Dedicated auxiliary register arithmetic units (ARAU) Internal dual-access memories Direct memory access (DMA) Short machine-cycle time Maurizio Palesi 64 32 Block Diagram Maurizio Palesi 65 Central Processing Unit (CPU) The ’C3x devices have a register-based CPU architecture The CPU consists of the following components Floating-point/integer multiplier Arithmetic logic unit (ALU) 32-bit barrel shifter Internal buses (CPU1/CPU2 and REG1/REG2) Auxiliary register arithmetic units (ARAUs) CPU register file Maurizio Palesi 66 33 Block diagram of the CPU Maurizio Palesi 67 Single-cycle multiplications 24-bit integer Result 32-bit 32-bit floating-point Result 40-bit Maurizio Palesi 68 34 The ALU performs single- cycle operations on 32-bit integer 32-bit logical 40-bit floating-point data Single-cycle integer and floating point conversions The barrel shifter is used to shift up to 32 bits left or right in a single cycle Maurizio Palesi 69 Four internal buses CPU1, CPU2, REG1, and REG2 carry Two operands from memory Two operands from the register file Allowing parallel multiplies and adds/subtracts on four integer or floating-point operands in a single cycle Maurizio Palesi 70 35 Two auxiliary register arithmetic units (ARAU0 and ARAU1) can generate two addresses in a single cycle The ARAUs operate in parallel with the multiplier and ALU They support addressing with Displacements Index registers (IR0 and IR1) Circular Bit-reversed addressing Maurizio Palesi 71 28 registers in a multiport register file All of the primary registers can be Operated upon by the multiplier and ALU Used as general-purpose registers The registers also have some special functions The eight extended-precision registers are especially suited for maintaining extended-precision floating-point results The eight auxiliary registers support a variety of indirect addressing modes and can be used as general- purpose 32-bit integer and logical registers The remaining registers provide such system functions as addressing, stack management, processor status, interrupts, and block repeat Maurizio Palesi 72 36 Peripherals Timers The two timer modules are general-purpose 32- bit timer/event counters Serial ports The bidirectional serial ports are totally independent Each serial port can be configured to transfer 8, 16, 24, or 32 bits of data per word Maurizio Palesi 73 Direct Memory Access (DMA) The DMA controller can read/write any location in the memory map without interfering with the CPU operation Dedicated DMA address and data buses minimize conflicts between the CPU and the DMA controller When the CPU and DMA access the same resources priorities must be established CPU DMA Rotating Maurizio Palesi 74 37 Extended Precision Registers (R7-R0) Can store and support operations on 32-bit integer and 40-bit floating-point numbers Maurizio Palesi 75 Auxiliary Registers (AR7-AR0) The CPU can access the Eight 32-bit auxiliary registers (AR7−AR0) Two auxiliary register arithmetic units (ARAUs) The primary function of the auxiliary registers is the generation of 24-bit addresses They can also operate as loop counters in indirect addressing 32-bit general purpose registers that can be modified by the multiplier and ALU Maurizio Palesi 76 38 Other Registers Index Registers (IR1, IR0) Used by the ARAU for indexing the address Block size register (BK) Used by the ARAU in circular addressing to specify the data block size System-stack Pointer (SP) Contains the address of the top of the system stack SP always points to the last element pushed onto the stack SP is manipulated by interrupts, traps, calls, returns, and the PUSH, PUSHF, POP, and POPF instructions Maurizio Palesi 77 Status Register (ST) Contains global information about the state of the CPU Operations usually set the condition flags of the status register according to whether the result is 0, negative, etc. Global Cache Repeat Latched floating Zero Carry interrupt enable mode floating point enable point underflow overflow Clear Cache Overflow Latched Negative Overflow cache freeze mode overflow Maurizio Palesi 78 39 Repeat Counter (RC) and Block Repeat (RS,RE) RC is a 32-bit register that specifies the number of times a block of code is to be repeated when a block repeat is performed If RC=n, the loop is executed n+1 times RS register contains the starting address of the program-memory block to be repeated when the CPU is operating in the repeat mode RE register contains the ending address of the program-memory block to be repeated when the CPU is operating in the repeat mode Maurizio Palesi 79 Instruction Cache 64×32-bit instruction cache 2-way set associative LRU replacement policy It allows the use of slow, external memories while still achieving single-cycle access performances The cache also frees external buses from program fetches so that they can be used by the DMA or other system elements Maurizio Palesi 80 40 Addressing Modes Five types of addressing Register addressing Direct addressing Indirect addressing Immediate addressing PC-relative addressing Plus two specialized addressing modes Circular addressing Bit-reverse addressing Maurizio Palesi 81 Register Addressing A CPU register contains the operand ABSF R1 ; R1 = |R1| Every CPU’s registers can be used (R0-R7, AR0-AR7, DP, IR0, IR1, …) Maurizio Palesi 82 41 Direct Addressing The data address is formed by the concatenation of the eight LSBs of the data-page pointer (DP) with the 16 LSBs of the instruction word (expr) This results in 256 pages (64K words per page) Maurizio Palesi 83 Direct Addressing ADDI @0BCDEh, R7 Before Instruction After Instruction R7 00 0000 0000 R7 00 1234 5678 DP 8A DP 8A Data memory Data memory 8ABCDEh 1234 5678 8ABCDEh 1234 5678 Maurizio Palesi 84 42 Indirect Addressing Specifies the address of an operand in memory through the contents of an auxiliary register, optional displacements, and index registers The auxiliary register arithmetic units (ARAUs) perform the unsigned arithmetic Maurizio Palesi 85 Indirect Addressing Indirect addressing with displacement Maurizio Palesi 86 43 Indirect Addressing Indirect addressing with index register IR0 Maurizio Palesi 87 Indirect Addressing Indirect addressing (special cases) Maurizio Palesi 88 44 Immediate Addressing The operand is a 16-bit (short) or 24-bit (long) immediate value contained in the instruction word Depending on the data types assumed for the instruction, the immediate operand can be A 2s-complement integer an unsigned integer, or a floating-point number SUB 1, R0 Before Instruction After Instruction R0 00 0000 0000 R0 00 FFFF FFFF Maurizio Palesi 89 PC-relative Addressing It adds the contents of the 16 or 24 LSBs of the instruction word to the PC register The assembler takes the src (a label or address) specified by the user and generates a displacement The displacement is equal to [src − (instruction address+1)] ; pc=1001h, BU Label ; Label = 1005h ; --> displacement = 3 Before Instruction After Instruction Decode phase Execution phase PC 1002 PC 1005 Maurizio Palesi 90 45 Circular Addressing Many DSP algorithms, such as convolution and correlation, require a circular buffer in memory In convolution and correlation, the circular buffer acts as a sliding window that contains the most recent data to process As new data is brought in, the new data overwrites the oldest data Logical representation Physical representation Start End Maurizio Palesi 91 Circular Addressing Logical representation Physical representation value0 6 value5 Start value6 0 value7 1 value2 value7 1 value4 value3 value4 value2 value3 End value5 Maurizio Palesi 92 46 Implementation BK Length of the circular buffer (16 bit, <64K) The K LSB of the start address of the buffer must be 0 K is such that 2K > buffer length Length of buffer BK register value Starting address of buffer 31 31 XXXXXXXXXXXXXXXXXXX00000 32 32 XXXXXXXXXXXXXXXXXX000000 1024 1024 XXXXXXXXXXXXX00000000000 Maurizio Palesi 93 Algorithm for Circular Addressing if (0 ≤ index+step < BK) Start index = index+step; Index Buffer else if (index+step ≥ BK) length (BK) index = index+step-BK; End else index = index+step+BK; Maurizio Palesi 94 47 Circular Addressing - Example *ARn++(disp)% ; addr = ARn ; ARn = circ(ARn+disp) Addr Memory ; AR0 is 0 ; BK is 6 0 *AR0++(5)% 1 ; Now AR0 is circ(0+5)=5 2 *AR0++(2)% 3 ; Now AR0 is circ(5+2)=1 4 *AR0−−(3)% 5 ; Now AR0 is circ(1-3)=4 6 *AR0++(6)% 7 ; Now AR0 is circ(4+6)=4 8 *AR0−−% ; Now AR0 is circ(4-1)=3 ... *AR0% Maurizio Palesi 95 ISA Overview The instruction set contains 113 instructions Load and store 2-operand arithmetic/logical 3-operand arithmetic/logical Program control Interlocked operations Parallel operations Maurizio Palesi 96 48 Load & Store The ’C3x supports 13 load and store instructions Load a word from memory into a register Store a word from a register into memory Manipulate data on the system stack Maurizio Palesi 97 2-Operand Instructions The ’C3x supports 35 2-operand arithmetic and logical instructions The two operands are the source and destination The source operand can be Memory word Register Part of the instruction word The destination operand is always a register Maurizio Palesi 98 49 2-Operand Instructions Maurizio Palesi 99 3-Operand Instructions 3-operand instructions have two source operands and a destination operand A source operand can be Memory word Register The destination is always a register Maurizio Palesi 100 50 Program-Control Instructions The program-control instruction group consists of all of those instructions that affect program flow Maurizio Palesi 101 Low-Power Control Instructions The low-power control instruction group consists of 3 instructions that affect the low-power modes Maurizio Palesi 102 51 Interlocked-Operations Instructions The five interlocked-operations instructions support multiprocessor communication and the use of external signals to allow for powerful synchronization mechanisms They also ensure the integrity of the communication and result in a high-speed operation Maurizio Palesi 103 Parallel Operations The 13 parallel-operations instructions make a high degree of parallelism possible Some of the ’C3x instructions can occur in pairs that are executed in parallel Parallel loading of registers Parallel arithmetic operations Arithmetic/logical instructions used in parallel with a store instruction Maurizio Palesi 104 52 Parallel Operations Parallel arithmetic with store instructions Many other Maurizio Palesi 105 Parallel Operations Parallel load instructions Parallel multiply and add/subtract instructions Maurizio Palesi 106 53 Examples FIR Filter Matrix-Vector Multiplication Maurizio Palesi 107 Data Structure for FIR Filters Circular addressing is especially useful for the implementation of FIR filters Impulse response Input samples AR0 h(N-1) AR1 x(0) h(N-2) x(1) h(N-3) x(2) h(2) x(N-3) h(1) x(N-2) h(0) x(N-1) Maurizio Palesi 108 54 FIR Filter Code Addr Memory * Impulse Response .sect ”Impulse_Resp” ... H .float 1.0 H 1.0 .float 0.99 0.99 .float 0.95 Impulse_Resp ... ... 0.1 .float 0.1 ... * Input Buffer X ? X .usect ”Input_Buf”,128 ? Input_Buf ... .data ? HADDR .word H ... XADDR .word X HADDR H N .word 128 XADDR X N 128 ... Maurizio Palesi 109 FIR Filter Code (cnt’d) * Initialization LDP HADDR LDI @N,BK ; Load block size LDI @HADDR,AR0 ; Load pointer to impulse response LDI @XADDR,AR1 ; Load pointer to input samples TOP LDF IN,R3 ; Read input sample STF R3,*AR1++% ; Store the samples LDF 0,R0 ; Initialize R0 LDF 0,R2 ; Initialize R2 * Filter RPTS N−1 ; Repeat next instruction MPYF3 *AR0++%,*AR1++%,R0 || ADDF3 R0,R2,R2 ; MAC ADDF R0,R2 ; Last product accumulated STF R2,Y ; Save result B TOP ; Repeat Maurizio Palesi 110 55 Matrix-Vector Multiplication [P]K×1=[M]K×N × [V]N×1 for (i=0; i<K; for (i=0; i<K; i++) i++) { { p[i] = 0 p[i] = 0 for (j=0; for (j=0; j<N; j++) j<N; j++) p[i] p[i] = p[i] + m[i,j] * v[j] = p[i] + m[i,j] * v[j] } } Maurizio Palesi 111 Matrix-Vector Multiplication Data memory organization Maurizio Palesi 112 56 Matrix-Vector Multiplication ** AR0 :: ADDRESS OF M(0,0) AR0 ADDRESS OF M(0,0) ** AR1 :: ADDRESS OF V(0) AR1 ADDRESS OF V(0) ** AR2 :: ADDRESS OF P(0) AR2 ADDRESS OF P(0) ** AR3 :: NUMBER OF ROWS -- 11 (K-1) AR3 NUMBER OF ROWS (K-1) ** R1 :: NUMBER OF COLUMNS -- 22 (N-2) R1 NUMBER OF COLUMNS (N-2) MAT MAT LDI LDI R1,IR0 R1,IR0 ;; Number of columns-2 -> IR0 Number of columns-2 -> IR0 ADDI ADDI 2,IR0 2,IR0 ;; Number of columns -> IR0 Number of columns -> IR0 ROWS LDF ROWS LDF 0.0,R2 0.0,R2 ;; Initialize R2 Initialize R2 MPYF3 MPYF3 *AR0++(1),*AR1++(1),R0 ;; m(i,0) ** v(0) -> R0 *AR0++(1),*AR1++(1),R0 m(i,0) v(0) -> R0 RPTS RPTS R1 R1 ;; Multiply aa row by aa column Multiply row by column MPYF3 MPYF3 *AR0++(1),*AR1++(1),R0 ;; m(i,j) ** v(j) -> R0 *AR0++(1),*AR1++(1),R0 m(i,j) v(j) -> R0 || ADDF3 || ADDF3 R0,R2,R2 R0,R2,R2 ;; m(i,j-1) ** v(j-1) ++ R2 -> R2 m(i,j-1) v(j-1) R2 -> R2 SUBI SUBI 1,AR3 1,AR3 BNZD BNZD ROWS ROWS ;; Counts the no. of rows left Counts the no. of rows left ADDF ADDF R0,R2 R0,R2 ;; Last accumulate Last accumulate Delay slot STF STF R2,*AR2++(1) ;; Result -> p(i) R2,*AR2++(1) Result -> p(i) NOP NOP *––AR1(IR0) ;; Set AR1 to point to v(0) *––AR1(IR0) Set AR1 to point to v(0) Maurizio Palesi 113 C Programming Tips After writing your application in C language, debug the program and determine whether it runs efficiently If the program does not run efficiently Use the optimizer with –o2 or –o3 options when compiling Use registers to pass parameters (–ms compiling option) Use inlining (–x compiling option) Remove the –g option when compiling Follow some of the efficient code generation tips Use register variables for often-used variables Precompute subexpressions Use *++ to step through arrays Use structure assignments to copy blocks of data Maurizio Palesi 114 57 Use Register Variables Exchange one object in memory with another register float *src, *dest, temp; register float *src, *dest, temp; do { do { temp = *++src; temp = *++src; *src = *++dest; *src = *++dest; *dest = temp; *dest = temp; } while (––n); } while (––n); Maurizio Palesi 115 Precompute Subexpression and use *++ main() { main() { float a[10], b[10]; float a[10], b[10]; int i; int i; for (i = 0; i < 10; ++i) for (i = 0; i < 10; ++i) 19 cycles a[i] = (a[i] * 20) + b[i]; a[i] = (a[i] * 20) + b[i]; }} main() { main() { float a[10], b[10]; float a[10], b[10]; int i; int i; register float *p = a, *q = b; register float *p = a, *q = b; 12 cycles for (i = 0; i < 10; ++i) for (i = 0; i < 10; ++i) *p++ = (*p * 20) + *q++; *p++ = (*p * 20) + *q++; }} Maurizio Palesi 116 58 Structure Assignments The compiler generates very efficient code for structure assignments Nest objects within structures and use simple assignments to copy them struct Pixel { struct Pixel { int x1, y1, c1; int x1, y1, c1; int x, y, c; int x, y, c; int x2, y2, c2; int x2, y2, c2; }; }; x1 = x2; x1 = x2; struct Pixel p1, p2; struct Pixel p1, p2; y1 = y2; y1 = y2; c1 = c2; c1 = c2; p1 = p2; p1 = p2; Maurizio Palesi 117 Hints for Assembly Coding Use delayed branches Delayed branches execute in a single cycle Regular branches execute in four cycles The next three instructions are executed whether the branch is taken or not If fewer than three instructions are required, use the delayed branch and append NOPs – A reduction in machine cycles still occurs Maurizio Palesi 118 59 Hints for Assembly Coding Apply the repeat single/block construct In this way, loops are achieved with no overhead Note that using RPTS instruction the executed instruction is not refetched for execution This frees the buses for operand fetches Maurizio Palesi 119 Hints for Assembly Coding Use parallel instructions Maximize the use of registers Use the cache Use internal memory instead of external memory Avoid pipeline conflicts Maurizio Palesi 120 60