What is the project called by rrboy

VIEWS: 141 PAGES: 13

                           A GPU based internet image library

                              CIS 665 Final Report

                                    Edward Kim
                             University of Pennsylvania


With the advent of the GPU, many processing tasks can be offloaded from the CPU to
achieve a dramatic increase in efficiency and performance. One area that has can
benefit from this parallel processing boost is image and video processing. The GPU
implementation of many image and video processing tasks can decrease the computing
time exponentially. Especially with the new uniform architecture, there is an even
greater benefit to sharing these tasks with the GPU.

However, for the majority of computer users, there exist issues of compatibility and
resource scarcity. It will be years before even the majority of users are able to benefit
from the uniform architecture. And with the competing technologies of ATI and Nvidia,
development effort is greatly increased to maintain compatibility. But as the GPU
technology remains divided, the internet is becoming a ubiquitous tool for more and
more complex processes.

This project will merge the benefits of internet accessibility with the efficiency and power
of a GPU using CUDA to deliver an image and video processing library over the
internet. The UI will be either flash or ajax based and the GPU will run transforms on
the uploaded image/file and make them accessible to the end user.

The benefits of this system are threefold. First, using CUDA and the GPU enables
users to process video and images much faster and at much higher volumes than
previous pixel/vertex GPU implementations. Second, using the GPU on the server side
will allow regular users without a unified GPU to benefit from this efficiency. Finally,
using a GPU will allow the server to process significantly more requests much faster
than typical CPU web servers.

The project named, “Texela” is a GPU based internet image library.
1.1. Significance of Problem or Production/Development Need
There are many image tools and libraries in use right now. There are also many
commercial video editing applications available. However the high powered tools that
are typically used are cost prohibitive for the regular user. With the explosion of video
on the internet, there is a clear need for a fast, accessible video and image editing tool.
Also, video manipulation and filtering are done primarily on the CPU which is very slow
compared to the performance gain that can be had with GPU technology.

However, the efficiency benefits of the GPU introduce the compatibility issues of various
generation video cards. To deliver hardware accelerated implementation of an image
library would not guarantee the usability across users. Especially with the new CUDA
and unified architecture, this would be impractical in the upcoming years. But again, as
the GPU technology remains divided, the internet is becoming a ubiquitous tool for more
and more complex processes.

Although, it is clear that using the internet as a medium will incur an upload/download
overhead, this is acceptable for our application. Once the image/video has been
uploaded, the video will reside on the server. In previous applications, the processing of
the clip took the most time which involves lots of trial and error and several passes.
Thus the one time cost of upload/download will be negligible to the efficiency saving of
processing that will be done now on the GPU.

1.2. Technology
To address this problem we will utilize the following technology.

The image and video library will be written in CUDA on a linux system (Red Hat
Enterprise 4). The linux system will be an AMD Athlon X2 Dual Core processor with a
Nvidia 8800 GTS.
The server will be run on the apache webserver and the UI will be created using a
combination of Flash, AJAX, and PHP.

The following resources will be used.

Interactive Graph Cuts for Optimal Boundary & Regoin Segmentation of Objects in N-D
Images, Yuri Y. Boykov Marie-Pierre Jolly, 2001

GrowCut – Interactive Mult-Label N-D Image Segmentation By Cellular Automata,
Vladimir Vezhnevets, Vadim Konouchine, 2005

Video Enhancement Using Per-Pixel Virtual Exposures, Eric P. Bennett Leonard
McMillan, 2005

Handbook of Image and Video Processing – Al Bovik 2000
Image and Video Processing -

CUDA FFT library -
CUDA BLAS library -

1.3. Design Goals
      1.3.1 Target Audience.
      Our target audience for this application is anyone with access to a broadband
      internet connection with a need for image or video manipulation.

      1.3.2 User goals and objectives
      The goals of the project are laid out in 3 tiers varying in complexity. The first tier
      of the project will involve the basic linear and nonlinear filtering algorithms on
      images and videos.

      1st Tier -
              Conversion to grayscale
              Image Histogram Transformations
                     Histogram Stretch
                     Bilinear Interpolation

      The second stage of development will use DFT and FFT transformations on

      2nd Tier –
             Edge Detection
             Sequence Stabilization
             Image Segmentation using Graph Cuts

      The third stage will involve setting up the web server and interfacing it with the

      3rd Tier –
            Apache Server
            Flash/AJAX UI

     Finally the fourth stage will be optional enhancements.

     4th Tier (optional) –
             Video Enhancement Using Per-Pixel Virtual Exposures

     1.3.3 Tool features and functionality
     The features of this tool will be a user interface on a web client that can control
     the video and image processing library listed above in real time.


2.1. Algorithm Details
     Many algorithms will be utilized in this image library. Some example algorithms
     that will be used are the following.
     Full-Scale Histogram Stretch
                                                  ���� − 1
                          ���� ���� = FSHS g(n) =              ���� ���� − ����
                                                  ���� − ����

     FFT Magnitude shift
                                                   ������������              ����       ����
                           (−1)���� +���� ����(����, ����)            ����(���� −       , ���� − )
                                                                       2        2

     Butterworth Filters
                                     ���� ����, ���� =
                                                       1 + (���� )2

     Interactive Graph Cuts
                                         ���� = ���� ∪ { ����, ���� , ����, ���� }

     Cellular Automaton (more details in the implementation)

                                          ���� = (����, ����, ����)
                                         ���� = (��������, ��������, ����)

     More algorithms located in the Handbook of Image and Video Processing1.
2.1. Target Platforms
     2.2.1 Hardware
          The target server machine will run on an AMD Athlon X2 Dual Core
          processor with the Nvidia 8800 GTS.
          For the clients, any machine running a flash enabled browser should be
          able to take advantage of our system.

     2.2.2 Software
          The software we will use to code the GPU application will be CUDA on a
          linux system. The Apache server will be our webserver.
          For the client, any machine capable of running a web browser will suffice.

2.3. Project Versions
     2.3.1. Project Milestone Report (Alpha Version)
          The alpha version will process images on the GPU using CUDA. The first
          stage of development at this time should be complete. The basic
          framework for the web UI will also have some work done. See Tier 1 for a
          complete description of image filters.

     2.3.2. Project Beta Version
          The Beta Version will have Tier 2 and Tier 3 completed. Basically, all of
          the Tiers should be completed and some work could begin on the final
          optional tier 4. Bug fixes will be done at this time.

     2.3.3. Project Final Deliverables
          The final code will consist of an image and video library that can be
          interfaced to by an apache webserver.


     3.3.1. Project Milestone Report (Alpha Version)
          Task 1 – Setup and Installation of CUDA and 8800GTS
          Task 2 - Conversion to grayscale
          Task 3 - Image Histogram Transformations
          Task 4 - Rotation
          Task 5 – Zoom

     3.3.2. Project Beta Version
          Task 6 - Sharpening/Blurring
          Task 7 - Edge Detection
          Task 8 - Sequence Stabilization
              Task 9 - Image Segmentation
              Task 10 – UI implementation

       3.3.3. Project Final Deliverables
              Task 11 – Debugging
              Task 12 (optional) - Video Enhancement Using Per-Pixel Exposures


The implementation of the project proceeded as planned to the alpha version. All of the
functions were programmed in Linux on a Fedora Core 5 distro. The Nvidia 8800 GTS
(320MB) was successfully installed along with the CUDA SDK . The basic framework
for the project was taken from the simpleTexture example that came along with the

Most of the image filter processes listed such as brightness, contrast, inversion, and full
stretch histogram manipulation were completed at the time of the alpha due date.
These computations lend themselves to parallel processing and were processed on the
CUDA kernel with average times of < 1ms for 512x512 images. RGB ppm support was
added to the cutil.cpp file and recompiled into the shared library.

Implementation of the basic image manipulations was done using an 8x8x1 dimension
block and a grid size that equaled the number of pixels / block size. The 8x8 thread
count was the minimum recommended number of threads per block as stated by the
NVIDIA Cuda SDK. The images were first read on the host computer, copied to a
texture and then read by the GPU. Subsequent manipulations by Cuda usually used a
128x1x1 block size and a 1D representation of the pixel data. The brightness and
inversion algorithms are as follows:

Brightness : pixel value + float [0-1]
Inversion: 1.0 – pixel value

These processes were very simple and thus could not take advantage of fast shared
memory. However, a proposed algorithm called the, “tag and test” method that was
posted on the Cuda discussion board as well as a part of the UIUC (university of Illinois
urbana) lecture did take advantage of shared memory. This algorithm was implemented
for use in our image library, but had disappointing results. The main problem with this
algorithm is that the naturally sequential nature of bin updates could not be done in
          do {
               u32 myVal = histogram[myHist][bin] & 0x7FFFFFF;
               myVal = ((tid & 0x1F) << 27) | (myVal + 1);
                    histogram[myHist][bin] = myVal;
         } while (histogram[myHist][bin] != myVal);

As seen by the pasted code, the tagging of the upper bit of the value should prevent the
histogram bins from being updated by more than one thread at a time. But in actual
implementations on the GPU, one 1 or 2 writes to the bins actually get through. As
stated in the SDK, writing to the same place in memory is a bad idea. Thus, a slower
implementation of histograms was implemented where each block has its own
histogram. After being updated by only 1 thread per block, the results are then reduced
in parallel by N/2 passes.

The cufft library was the next obstacle in the project. The cufft library was based upon
the fftw libraries used on the CPU. Although buggy at high dimensionalities, the cufft
library was implemented successfully. When first applied to the data set, the resulting
magnitude and phase are separated into a cufft complex data variable. The actual raw
image representations of the fft doesn’t tell us much so this was manipulated for much
more informative 2D image. The shifted magnitude was accomplished by the equation,

                                                   ������������              ����       ����
                           (−1)���� +���� ����(����, ����)            ����(���� −       , ���� − )
                                                                       2        2

After shifting the magnitude, a log representation of the scaled version can be applied.
This produces the 2D magnitude log representation of the image. At this point the
cublas library was utilized for the max value calculation of an array.

From the properties of DFT’s and convolution being equal to multiplication in the
frequency domain, I was able to implement a butterworth filter to do both low pass and
high pass filtering on the images. The equations for low pass filtering is,

                                     ���� ����, ���� =
                                                       1 + (���� )2

Where R = to the Euclidean distance map ( ����2 + ���� 2 ) from -0.5 to 0.5
And Ro is the given frequency cutoff.
In order to transform this into a high pass filter, the following can be applied.

                                           1 − ����(����, ����)

Graph cuts were the next step in the implementation plan and things were going quite
well. I separated the image in its graph components where,
                                       ���� = ���� ∪ ����, ����

S and T are the sink and source nodes. The edges consisted of,
                                   ���� = ���� ∪ �������� , ��������
Where the N = neighborhood of pixels edges and eS and eT are sink and source edges.
The regional weights were calculated by a negative log likelyhood calculation of the
histogram biased object and background pixels. The neighborhood weights were
calculated with a pixel intensity function. However, when it came to the min-cut/ max-
flow implementation on the GPU, this method fell apart.

There basic max-flow proposed in this paper did not lend itself well to parallelism.
There was another method to calculate min-cut/max-flow by doing a push-relabel
algorithm that didn’t have impressive results on the GPU. However, during this
research phase, another method, growcut, came into the picture.

Growcut is a segmentation method based upon cellular automata where each cell or
pixel is attacked by its neighboring pixels. Each cell contains a tuple,

                                        ���� = (����, ����, ����)

Where S is the state of the system, in this case it is another tuple.

                                       ���� = (��������, ��������, ����)

Lp = label (object or background)
����p = strength
C = pixel intensity

N is the neighborhood system. In the case of this implementation, I chose a von
Neumann neighborhood system (four pixels). The delta is the evolution rule. For this
case, the rule was,

                                           ������������ ∀ ���� ∈ ����
                                 �������� ���� �������� − �������� ∗ �������� > ��������
                                            �������� + 1 = ��������
                                �������� + 1 = ���� �������� − �������� ∗ ��������

                                    ���� ���� = 1 −
                                                    max | ���� |

The grow cut algorithm usually converges in 60-100 passes on the GPU.

The UI was created in html/php/ajax running on the apache server. The php script calls
the Cuda executable with the parameters for which filter to run. Some additional work
was done on the linux system to support movie implementation and image conversion.
For these tasks, I am using the ffmpeg library and the imagemagik convert utility. All of
these are integrated via a php script that is interacted with by the user via a web client.
The current site, www.texela.net , has some sample php and ajax scripts that interact in
non real time with images. Unfortunately, the live version, located at
http://live.texela.net:8080/texela is not publicly available; however screen shots and
movie clips will be available on the site wiki.

Finally, the code was ported to be supported on a windows platform as well as linux.
Modifications to the cutil.cpp, cutil.h were needed in order to support standard std calls
in the code. The resulting application is a command line utility that works both on linux
and windows.

In order to run the code on a windows machine, you must have the application, the cutil
dll and the input images in the same directory. Then the command to run it is,

simpleTexture.exe -f=lena_bw.pgm –b=0.1 (brightness + 0.1)

Further tests on the load that the multiple processes accessing the GPU was also done.
When multiple processes access the GPU, it does extend the processing time, but not
linearly. However after 8 simultaneous processes are running, the system becomes
unstable. At this point we would need to develop some sort of queuing system later on.
Here are some final result images and statistics.
Brightness and Inversion (0.62 ms) vs (1.45 ms on the CPU)
Command to run:
simpleTexture.exe –f=lena_bw.pgm –b=0.3 (float from -1 – 1)
simpleTexture.exe –f=lena_bw.pgm –invert

Histogram and it’s comparison to photoshops’.
Time: 112.6 ms on the GPU
Command to run:
simpleTexture.exe –f=lena_bw.pgm –h=0.0
Shifted magnitude log
Time: 7.34 ms
Command to run:
simpleTexture –f=lena_bw.pgm -fftmagnitude

FFT Blur and Edge detection
Time: 8.054 ms
Command to run:
simpleTexture.exe –f=lena_bw.pgm –fftlp=0.2 [0-0.6]
simpleTexture.exe –f=lena_bw.pgm –ffthp=0.2 [0-0.6]
Grow cut example, seed, 10 passes, 60 passes
Time: 1255.38 ms for 60 passes
Command to run:
simpleTexture.exe –f=image.pgm –growcut=60 (0-100)
ftest.txt , btest.txt that hold the x y coordinates of the object and background seeds.
These files can be generated by the online gui. Videos of this process are available on
the wiki.

Sample UI
Sample screens at www.texela.net
Live version at live.texela.net:8080/texela
Sample Grow Cut selection process.

In conclusion, I have created an online image and video processing library that utilizes
the power of GPU technology that could benefit the average internet user. The
processing tools were built in CUDA and run on the server side 8800 graphics card.
The functions available include brightness, contrast stretching, inversion, fft blurring,
edge detection and image segmentation via cellular automata. The user interface was
build using PHP, HTML, and AJAX.


Handbook of Image and Video Processing – Al Bovik 2000
Image and Video Processing -

CUDA FFT library -
CUDA BLAS library -

To top