Embed
Email

Implementation of JPEG2000

Document Sample
Implementation of JPEG2000
Shared by: HC111202205210
Categories
Tags
Stats
views:
1
posted:
12/2/2011
language:
English
pages:
12
Implementation of JPEG2000

Using SSE Instruction Set









ECE 734

VLSI Array Structures

For Digital Signal Processing









Ami Mehta

Gilles Muller Spring 2004

Table of Contents





Introduction ..................................................................................................... 2

1 - JPEG2000 Algorithm ................................................................................ 3

1.1 - Architecture ......................................................................................... 3

1.2 – 2D-DWT ............................................................................................. 3

2 – Implementation ......................................................................................... 5

2-1 – Non optimized version ....................................................................... 5

2.2 – Optimizations ..................................................................................... 6

3 – Verification ............................................................................................... 8

4 – Profiler ...................................................................................................... 9

5 – Results ....................................................................................................... 9

6 – References ............................................................................................... 11









1

Introduction

JPEG2000 is a new image compression standard being developed by the Joint

Photographic Experts Group (JPEG), part of the International Organization for

Standardization (ISO). It is designed for different types of still images (bi-level, gray-

level, color, multicomponent) allowing different imaging models (client/server, real-time

transmission, image library archival, limited buffer and bandwidth resources, etc), within

a unified system.



JPEG2000 is intended to provide low bit rate operation with rate-distortion and subjective

image quality performance superior to existing standards, without sacrificing

performance at other points in the rate-distortion spectrum.



JPEG2000 addresses areas where current standards fail to produce the best quality of

performance, such as:

 Low bit rate compression performance (rates below 0.25 bpp for highly-detailed

gray-level images)

 Lossless and lossy compression in a single code stream

 Seamless quality and resolution scalability, without having to download the entire

file. The major benefit is the conservation of bandwidth

 Large images: JPEG is restricted to 64k x 64k images (without tiling). JPEG2000

will handle image sizes up to ( 232-1 )

 Single decompression architecture

 Error resilience for transmission in noisy environments, such as wireless and the

Internet

 Computer generated imagery

 Compound documents

 Region of Interest coding

 Improved compression techniques to accommodate richer content and higher

resolutions

 Metadata mechanisms for incorporating additional non-image data as part of the

file

 JPEG2000 will be able to handle up to 256 channels of information, as compared

to JPEG, which is limited to only RGB data. Thus, JPEG2000 will be capable of

describing complete alternate color models, such as CMYK, and full ICC

(International Color Consortium).









2

1 - JPEG2000 Algorithm



1.1 - Architecture









The encoding procedure is as follows:

The source image is decomposed into components.

The image and its components are decomposed into rectangular tiles. The tile-

component is the basic unit of the original or reconstructed image.

The wavelet transform is applied on each tile. The tile is decomposed in different

resolution levels.

These decomposition levels are made up of subbands of coefficients that describe the

frequency characteristics of local areas (rather than across the entire tile-component) of

the tile component.

Markers are added in the bitstream to allow error resilience.

The codestream has a main header at the beginning that describes the original image

and the various decomposition and coding styles that are used to locate, extract, decode

and reconstruct the image with the desired resolution, fidelity, region of interest and other

characteristics.

The optional file format describes the meaning of the image and its components in the

context of the application.





1.2 – 2D-DWT

This process decomposes the original image into two sub-bands, usually denoted as the

coarse scale approximation (lower band) and the detail signal (higher band). This

transform can be easily extended to multiple dimensions by using separable filters, i.e. by

applying separate 1-D transforms along each dimension. In particular, we have

implemented the most common approach, commonly known as the square

decomposition. This scheme alternates between operations on rows and columns, i.e. one

stage of the 1-D DWT is applied first to the rows of the image and then to the columns.







3

This process is applied recursively to the quadrant containing the coarse scale

approximation in both directions. In this way, the data on which computations are

performed is reduced to a quarter in each step.



From a performance point of view, the main bottleneck of this transformation is caused

by the vertical filtering (the processing of image columns) or the horizontal one (the

processing of image rows), depending on whether we assume a row-major or a column-

major layout for the images.



The lifting filter computes the low frequency and high frequency coefficients in 1-D.









The following filter coefficients are used for JPEG2000 lossy filter:









4

2 – Implementation



2-1 – Non optimized version

We chose to implement the lifting transform using Daubichies 9/7 Analysis. This was

done mainly because lifting requires less working memory and fewer arithmetic

computations. The lossy filter was implemented since our focus is to make JPEG2000

faster on general purpose architectures, where some loss of resolution is tolerable and

lossy compression gives a better compression ratio.



This version was implemented only using C code. The lifting filter described above

computes high and low frequency coefficients in the same time, reusing intermediate

results. This is a good point as only one filter is needed to compute both coefficients.



The wavelet transform poses a major hurdle for the memory hierarchy, due to the

discrepancies between the memory access patterns of the two main components of the 2-

D wavelet transform: the vertical and horizontal filtering. Consequently, the improvement

in the memory hierarchy use represents the most important challenge of this algorithm

from a performance perspective.



As a result, we needed to think about memory use even if this version isn’t optimized for

SSE. The goal of this project was to optimize the algorithm implementation for speed. As

a result, we chose the strategy to store all intermediate results independently preventing

unnecessary false dependences, but thus increasing the amount of memory used.



The Mallat strategy uses an auxiliary matrix to store the results of the horizontal filtering.

In this way, as the following figure shows, the horizontal high and low frequency

components are not interleaved in memory. This method prevents memory scattering

and enable to exploit better SIMD parallelism.



The vertical filtering reads these components and writes the results into the original

matrix following the order expected by the quantization step.



For our implementation we did the down sampling before calculations, to ensure

none of our calculations were wasted and also reused calculations across

computations whenever possible.









5

2.2 – Optimizations

The analysis of this C code using VTune Analyzer showed that the major bottleneck of

our code was the memory access, and particularly a low hit rate in the L1 data cache. To

solve this problem, the following optimizations are used:



 Data alignment

Strictly speaking, data alignment is not required in our codes since the SSE instruction

set includes instructions that allow unaligned data to be copied into and out of the vector

registers. However, such operations are much slower than aligned accesses, which may

cause a significant overhead. To avoid this drawback we have employed 16-byte aligned

data in all our codes, although for the scalar versions this optimization has no significant

effect.









6

access access









Cache layout without alignment Cache layout with alignment



 Matrix juxtaposition

Input and output matrices are juxtaposed in the memory to prevent associativity conflicts.

Indeed, after juxtaposition, the 2 matrices can be loaded entirely in the L1 data cache

without having to switch between one matrix to another.



Here is a part of our code showing these optimizations:

/* Allocate memory to the 2 matrices */

/* The 2 matrices are juxtaposed and aligned on the cache row size:128

bits=16 Bytes */

inline void load_matrix(float **matrix_in, int row, int column, float

**matrix_out){

unsigned long int size;



size=2*row*column*sizeof(float);

if( (*matrix_in=(float*)malloc(size+16)) == NULL ){

printf("Not enough memory available.\n");

exit(1);

}



//align the address at the beginning of a cache line

*matrix_in=(float*)((((int)(*matrix_in))+16)-

((((int)(*matrix_in))+16)%16));

*matrix_out=(float*)((*matrix_in)+row*column);

}



Once these were done we transformed the code using SSE2. This was done to make use

of the inherent parallelism in the code. SSE does four calculations at a time using special

registers allocated for SSE. We used these 128 bit floating point registers to store four 64

bit values. Then these computations were done in parallel. We used intrinsics for this

part. The following piece of code is typical of our use of SSE intrinsics.



m1 =_mm_add_ps(*pscr2,*pscr0);

m2 = _mm_mul_ps(m0_a0, m1);

m3 = _mm_add_ps(*pscr1,m2);

// high1[i]=start1[i]+a0*(start2[i] + start0[i]);



m1, m2, m3, pscr2, m0_a0, etc are all 64 bit values, stored in groups of four as mentioned

before. The following figure shows how the 128 bit registers are used for four

calculations to take advantage of parallelism.









7

3 – Verification

We verified our code using MATLAB. We took an image and converted it to a matrix.

We used this matrix as an input to our DWT code. Then the output matrix of our code

was dumped into a file. We took this output matrix and normalized it by subtracting the

smallest number and dividing by the greatest number and displayed it is an image. The

following results were produced.









Original Image Matlab output for DWT









After row transform using our code After DWT, using our code



The results from our code match those of MATLAB which shows our code works.







8

4 – Profiler

As mentioned before, we used VTune Analyzer to profile our results. The VTune™

Performance Analyzer helps you analyze the performance of your application by locating

hotspots. Hotspots are areas in your code that take a long time to execute. Not only that it

also helps you find out what is causing these hotspots and decide what type of

improvements to make. You can also track critical function calls and monitor specific

processor events, such as cache misses, triggered by sections in your code. You can also

calculate event ratios to determine if processor events are causing the hotspots.



The VTune analyzer collects performance data on your application and system, and

displays it in graphs or tables. From this display, you can analyze the performance of

your application and determine which portions of an application are slowest and why.



To optimize the performance of your application or system, you can do one or more of

the following to find the performance bottlenecks:

 Determine how your system resources, such as memory and processor, are being

utilized to identify system-level bottlenecks.

 Measure the execution time for each module and function in your application.

 Determine how the various modules running on your system affect the

performance of each other.

 Identify the most time-consuming function calls and call sequences within your

application.

 Determine how your application is executing at the processor level to identify

microarchitecture-level performance problems.



The VTune™ Performance Analyzer can find this information by automating the process

of data collection with three types of data collectors, namely, sampling, call graph, and

counter monitor.



Thus we decided to use VTune Analyzer. With the help of profiling we were able to

identify that memory access was the biggest bottle neck for our code.



5 – Results

We conducted a set of simulation using the pre-optimized and post optimized versions of

our code. The simulations were done on Pentium Celeron processor 2.4 GHz, with a 8K

uops L1 trace cache and a 128 Kb L2 cache. We did a number of simulations, since

profiling take samples to get a good estimate we ran a set of three simulations on images

of size 256 x 256, 512 x 512 and 1024 x 1024. The results for the biggest image are

shown here.



A set of three simulations were done on a 1024 x 1024 image. The results for the

improvement in cycles per uops are given below.







9

Cycles per retired uop



4







3.5







3







2.5

Clockticks









Before Optimization

2

After Optimization





1.5







1







0.5







0

1 2 3 4 5 6





As can be seen on an average there is an improvement of 44.61% was seen during

optimization. The results for 1st level cache load misses are shown below.



1st Level Cache Load Misses



6000000









5000000









4000000









Before Optimization

3000000

After Optimization









2000000









1000000









0

1 2 3 4









10

As can be seen we dramatically reduced 1st level cache load misses by the various

optimization techniques. The misses reduced on an average by 74%, from about a little

below 5,000,000 to a little over 1,000,000. The second level cache improvement is as

shown below.

2nd Level Cache Load Misses



350000









300000









250000









200000

Before Optimization

After Optimization

150000









100000









50000









0

1 2 3 4





There is an improvement of 53% on an average.



Thus it can be shown that our cache optimizations and using SSE code to exploit

parallelism worked well causing an improvement on performance and cache access.





6 – References

[1] Taubman, David and Marcellin, Michael, JPEG 2000, Image compression

Fundamentals, Standards and practice.



[2] Michael D. Adams, The JPEG-2000 Still Image Compression Standard.



[3] C. Tenllado, D. Chaver, L. Piñuel, M. Prieto and F. Tirado, Vectorization of the 2D

Wavelet Lifting Transform using SIMD extensions.



[4] D. Chaver, C. Tenllado, L. Piñuel, M. Prieto and F. Tirado, 2-D Wavelet Transform

Enhancement on General-Purpose Microprocessors: Memory Hierarchy and SIMD

Parallelism Exploitation.



[5] IA 32 Intel Software Developers Manual, Volume 2: Instruction Reference.







11


Related docs
Other docs by HC111202205210
Voluntary counseling and Testing in Pakistan
Views: 0  |  Downloads: 0
DFR POPs waste
Views: 0  |  Downloads: 0
Declaration of Conflict of Interest
Views: 0  |  Downloads: 0
??????????
Views: 1  |  Downloads: 0
SELECTBOARD MEETING
Views: 52  |  Downloads: 0
Climatology Lecture 9
Views: 2  |  Downloads: 0
IR DOBLES 5 yco
Views: 1  |  Downloads: 0
No Slide Title
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!