VIEWS: 200 PAGES: 134 CATEGORY: Business POSTED ON: 1/4/2010
REAL-TIME EMBEDDED ALGORITHMS FOR LOCAL TONE MAPPING OF HIGH DYNAMIC RANGE IMAGES A Dissertation Presented to The Graduate Faculty of the University of Akron In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Firas Hassan December, 2007 REAL-TIME EMBEDDED ALGORITHMS FOR LOCAL TONE MAPPING OF HIGH DYNAMIC RANGE IMAGES Firas Hassan Dissertation Approved: ______________________________ Advisor Dr. Joan Carletta Accepted: ______________________________ Department Chair Dr. Jose A. De Abreu-Garcia ______________________________ Committee Member Dr. Shiva Sastry ______________________________ Dean of the College Dr. George K. Haritos ______________________________ Committee Member Dr. John Welch ______________________________ Dean of the Graduate School Dr. George R. Newkome ______________________________ Committee Member Dr. Daniel B. Sheffer ______________________________ Date ______________________________ Committee Member Dr. Jose A. De Abreu-Garcia ______________________________ Committee Member Dr. Dale Mugler ii ABSTRACT Digital imaging is a widespread and powerful technology, and yet the quality of digital images suffer from a fundamental mismatch in the physical devices used in imaging systems: new technology camera sensors can capture scenes of much wider dynamic range than standard display devices can display. The result of this mismatch is that printed and displayed images are commonly lacking in clear detail. Cameras are able to capture the detail in, for example, both shaded and sunlit areas of a natural scene, but when the image is displayed either the sunlit area is saturated to become uniformly bright, or the shaded area is saturated to become uniformly dark. The problem is even more pronounced for synthetic images produced by fusing multiple views of a scene. Image processing is required to compress the dynamic range of the image before it is displayed, and to control its brightness and contrast. Ideally, this processing would be done in real-time, and embedded within the camera or display. Published algorithms for processing images of wide dynamic range have been implemented mostly on workstations and applied to stored images, without hardware and timing constraints. The algorithms have been developed without taking into account the difficulties inherent in hardware implementation, and are not suitable for implementation in embedded systems that work in real time. This limits their application. This work iii proposes to develop algorithms for processing images of high dynamic range that are suitable for implementation on embedded systems, and for real-time applications. One difficulty in developing real-time algorithms for image enhancement is the interdisciplinary nature of the work. Both image processing and digital hardware implementation aspects must be carefully considered, and these two aspects are intertwined so that they must be considered simultaneously. It is not sufficient to simply translate algorithms that perform well in floating-point mathematics on workstations into hardware; the hardware in a real-time application so constrains what can be done that the algorithm and the hardware must be developed hand-in-hand. The quality of the system as a whole depends heavily on both signal processing and hardware performance aspects. The intellectual merit of the work lies in the following key concepts. Algorithms for local tone mapping of high dynamic range images that are uniquely appropriate for implementation on embedded systems and can do the processing in real-time are developed. Both hardware performance and signal processing aspects of the design are simultaneously considered so that system-level trade-offs can be made. The work also addresses an additional and crucial manifestation of computing implementations: the effect of fixed-point quantization and computational structure on the quality with which a digital signal processing algorithm is implemented. The final outcome of our research is a local tone mapping operator, implemented on an FPGA that can process a 3 × 32 -bit high dynamic range color image of 1024×768 pixels. It has a maximum operating frequency of 65.86 MHz, which is compatible with a video rate of 60 frames per second. The quality of the images processed by this operator was compared to a floating-point state-of-the-art local tone mapper considered to be a iv gold standard. Images processed by this operator had a peak signal-to-noise ratio of about 35dB on average with respect to the gold standard. The operator requires relatively simple hardware, and is well suited for embedded real-time application. v DEDICATION To a special parents, The parents that inspired me, The parents that really believed in me, The parents that I really admire, My Mother & Father, May Abou Najem, Hussein Hassan May God Bless you vi ACKNOWLEDGEMENTS I would like to express my sincere and heartfelt thanks to my adviser and thesis committee chair, Dr. Joan Carletta, whose encouragement, guidance and advice brought me successfully through this research project, and to the rest of my committee members, Dr. Shiva Sastry, Dr. John Welch, Dr. Daniel B. Sheffer, Dr. Jose A. De Abreu-Garcia, and Dr. Dale Mugler who generously gave the gift of their time. Thanks also to Dr. Jose A. De Abreu-Garcia, who served as my Department Head during this project and whose flexibility allowed me to finish it. I deeply appreciated the valuable help in all aspects that was offered by all the faculty and staff in the ECE department, who never hesitated to create time for me in their very busy schedules. To my family, especially I would like to mention, my dearest parents May Abou Najem & Hussein Hassan and my brother Wissam Hassan, for their understanding and support during this time. To my best friends, especially Sami, Bilal, Rami, Ziad, Khalid, Mohand, Abed, Wasil, Madhar, Samer, Ehab, both Mohamads, and all others, Thank you for everything you have done to me. vii TABLE OF CONTENTS Page LIST OF TABLES............................................................................................................. xi LIST OF FIGURES ......................................................................................................... xiii Chapter I. INTRODUCTION ........................................................................................................... 1 1.1 1.2 1.3 1.4 1.5 Statement of the Problem.................................................................................... 1 Goals of Research ............................................................................................... 4 Research Approach ............................................................................................. 6 Contribution to Knowledge................................................................................. 9 Dissertation Outline .......................................................................................... 10 II. RElATED WORK........................................................................................................ 12 2.1 2.1.1 2.1.2 2.1.3 2.1.4 2.1.5 2.2 2.3 Local tone mapping operators........................................................................... 12 Local TMOs based on Retinex theory ..................................................... 15 Local TMOs based on the zone system ................................................... 17 Local TMOs that use edge-preserving filters............................................ 20 Local TMOs based on image appearance models..................................... 23 Local TMOs based on general image processing techniques ................... 24 High dynamic range display devices ................................................................ 27 Evaluation studies ............................................................................................. 28 viii III. DIFFERENT LOCAL ESTIMATES OF ILLUMINATION ..................................... 31 3.1 3.1.1 3.1.2 3.1.3 3.2 3.2.1 3.2.2 FIR blocks......................................................................................................... 31 FIR block 1 ............................................................................................... 33 FIR block 2 ............................................................................................... 36 FIR block 3 ............................................................................................... 43 Average selection criteria ................................................................................ 55 Average selection criterion 1 .................................................................... 55 Average selection criterion 2 .................................................................... 56 IV. LOG AVERAGE LUMINANCE AND NORMALIZATION METHODS .............. 61 4.1 4.2 4.3 4.4 Convert-to-floating block.................................................................................. 61 Reciprocal block ............................................................................................... 63 Log average luminance block ........................................................................... 63 Normalization block.......................................................................................... 68 V. RESUTLS FOR LOCAL TONE MAPPERS IMPLEMENTED IN HARDWARE ... 71 5.1 5.2 5.3 Block diagram of the different local tone mappers........................................... 73 Simulation results for the different local tone mappers .................................... 73 Hardware costs of the different local tone mappers.......................................... 76 VI. NINE-SCALE APPROXIMATE GAUSSIAN PYRAMID MAPPER...................... 81 6.1 6.1.1 6.1.2 6.2 6.3 Nine-scale Gaussian pyramid mapper .............................................................. 81 Structure of vertical filters ........................................................................ 85 Structure of horizontal filters ................................................................... 89 Results of Hardware Synthesis ......................................................................... 93 Experiments and Results................................................................................... 95 ix VII. TONE MAPPING OF COLOR HIGH DYNAMIC RANGE IMAGES .................. 99 7.1 7.2 7.3 Nine-scale approximate Gaussian pyramid mapper for color images ............. 99 Results of Hardware Synthesis ....................................................................... 103 Experiments and Results................................................................................. 104 VIII. Conclusions............................................................................................................ 109 8.1 8.2 8.3 Summary of Accomplished Research ............................................................. 109 Results of Hardware synthesis and Experiments ............................................ 110 Future Work .................................................................................................... 111 REFERENCES ............................................................................................................... 114 x LIST OF TABLES Table Page 2.1. The values of the constants used in Reinhard’s implementation.............................. 19 3.1. Different states of the row counters used for the read/write and read address of the memories and the state of the mux select for the vertical filter of FIR block 1........ 38 3.2. Pixel weights for the four windows in the third FIR block....................................... 44 3.3. Pseudocode for average selection criterion 2............................................................ 57 4.1. Comparison of methods for approximating the fraction part of the logarithm......... 66 4.2. The percentage error of the log average luminance with approximations of the base2 logarithm and a fixed-point average calculation.................................................... 66 4.3. The percentage error of the log average luminance with a fully hardware-friendly approximation. .......................................................................................................... 67 4.4. The algorithm to transfer the correct bits of the luminance mantissa to the output in the normalization block............................................................................................. 70 5.1. The parameters of the MGIP mapper for different images....................................... 72 5.2. The PSNR of the three different local tone mappers for the set of input images. .... 76 5.3. The synthesis report from Quartus for the four different local tone mappers. ......... 77 6.1. The hardware details for the nine vertical filters, given an input pixel stream of 28 bits............................................................................................................................. 87 6.2 Hardware details for the horizontal filters. ................................................................ 90 6.3. The coefficients for the horizontal filters for the fourth and fifth scales, in canonical signed digit representation. ....................................................................................... 91 6.4. Summary of hardware synthesis report.................................................................... 95 xi 6.5. PSNR results for the test images............................................................................... 96 7.1. Coefficients for the computation of scene luminance using Equation 7.1. ............ 101 7.2. Summary of hardware synthesis report................................................................... 103 7.3. The percentage of the time that each of the nine scales was chosen for the six different values of ε ................................................................................................ 106 xii LIST OF FIGURES Figure Page 1.1. The set of images used for simulation. ....................................................................... 8 2.1. Test image from [1], displayed by mapping white to luminance value of 1.0, 10.0 and 500.0, respectively. ............................................................................................ 13 2.2. Basic structure of TMOs based on illumination compensation. ............................... 13 3.1. The structure of the windows used by the first FIR block........................................ 33 3.2. Memory organization for the vertical filter of FIR block 1. ..................................... 34 3.3. The structure of the windows used by the second FIR block. .................................. 38 3.4. Memory organization of the vertical filter of the second and third FIR blocks. ...... 39 3.5. A block diagram of the hardware used to implement the four horizontal filters of the second FIR block. ..................................................................................................... 41 3.6. The base-2 logarithm of the pixel weights for the four different windows used by the third FIR block.......................................................................................................... 44 3.7. The basic one-dimensional window in the logarithmic scale. .................................. 46 3.8. Basic function of ALU1 when processing the middle of the image. ........................ 48 3.9. Basic function of ALU2 when processing the middle of the image. ........................ 50 3.10. General architecture of horizontal computation block (HCB1 or HCB2). ............. 52 3.11. General architecture of vertical computation block (VCB1 or VCB2). ................. 53 3.12. Hardware for the 56x56 pixel window. .................................................................. 53 3.13. Block diagram for computation of the multi-scale average.................................... 53 xiii 3.14. Memory organization to supply data to the four windows, and for the delayed pixel stream........................................................................................................................ 54 3.15. The internal hardware for calculating the exponent of ratio1................................. 58 3.16. MGIP local illumination estimate........................................................................... 59 3.17. Four-scale constant weight local illumination estimate.......................................... 59 3.18. Four-scale approximate Gaussian local illumination estimate. .............................. 59 4.1. The internal hardware of the convert-to-floating block. ........................................... 62 4.2. The block diagram of the log average luminance. .................................................... 65 4.3. Hardware internal to the normalization block for calculating the display luminance mantissa and exponent. ............................................................................................. 68 5.1. Block diagram of the MGIP mapper......................................................................... 72 5.2. Block diagram of the four-scale constant weight pyramid mapper. ......................... 72 5.3. Block diagram of the four-scale approximate Gaussian pyramid mapper................ 72 5.4. HDR images from the Debevec library after being processed by the multi gain image processing mapper.......................................................................................... 78 5.5. HDR images from the Debevec library after being processed by the four-scale constant weight pyramid mapper. ............................................................................. 79 5.6. HDR images from the Debevec library after being processed by the four-scale approximate Gaussian mapper. ................................................................................. 80 6.1. Block diagram of the nine-scale approximate Gaussian pyramid mapper. .............. 83 6.2. The windows for the original Gaussian surround. .................................................... 83 6.3. The hardware approximation for the original Gaussian surrounds........................... 84 6.4. The ALU for the vertical filter for the fifth scale, given a pixel stream of 28 bits.. 86 6.6. Memory organization for the nine-scale approximate Gaussian pyramid. ............... 88 6.7. Hardware architecture of the factored form FIR filter used for the horizontal filter for the first scale........................................................................................................ 90 xiv 6.8. The ALU for the horizontal filter for the sixth scale, given an input pixel stream of 28 bits........................................................................................................................ 91 6.9. The hardware architecture for the horizontal filter for the sixth scale...................... 91 6.10. The HDR images from the Debevec library after being processed by the nine-scale approximate Gaussian pyramid mapper for grayscale images. ................................ 97 6.11. The HDR images from the Debevec library after being processed by the floatingpoint version of nine-scale Reinhard local tone mapper for grayscale images......... 98 7.1. Block diagram of the nine-scale approximate Gaussian pyramid mapper for color images. .................................................................................................................... 100 7.2. The set of color HDR images used to test the nine-scale approximate Gaussian pyramid mapper for color images. .......................................................................... 106 7.3. The set of color HDR images after being processed by the nine-scale approximate Gaussian pyramid mapper for color images. .......................................................... 107 7.4. The effect of the selection parameter ε on the processed Nave image................... 108 xv CHAPTER I INTRODUCTION 1.1 Statement of the Problem We experience a huge range of light energy in the course of a day. The light of the sun is 100 million times more intense than starlight. In fact, the human visual system (HVS) functions over a luminance range of nearly 14 log units. Imaging sensors, on the other hand, have limited dynamic range and hence are sensitive to only a part of the luminance range. Limited dynamic range is an obvious difficulty for any amateur or professional photographer. A photographer must decide on the range of luminance that is of interest and determine the exposure time accordingly. Sunlit scenes and scenes with shiny materials and artificial light sources often have extreme differences in luminance values that are impossible to capture without either under-exposing or saturating the film. To capture a scene with high dynamic range (HDR), photographers take a number of photographs with different exposures. They then combine these separate images into a composite HDR image. Nowadays, photography of even the most extreme HDR scenes is becoming possible with multiple-exposure image capture methods and novel cameras. Displaying the textures and details of HDR images is a major problem for two main reasons. First, available displays have small dynamic ranges that are easily 1 overwhelmed by HDR images. A typical cathode ray tube display offers a dynamic range of no more than five hundred luminance levels. This is insufficient for HDR images that include visible light sources, deep shadows, and specular highlights, which can reach a dynamic range of one hundred thousand or more. Second, simple ways to adjust HDR image intensities for display damage or destroy important details and textures. The most commonly used adjustments are borrowed from photography, and are given by Ld = F m Ls ( ( )) γ (1.1) where Ld and Ls are display and scene luminance in cd m 2 , m is a scale factor that depends on film exposure, γ is the dynamic range sensitivity of the camera, and F() limits the output to the display range. The simplest F() saturates out of range intensities to the display limits, but this discards the fine details and textures in the scene's shadows and highlights. Choosing a better limiting function F() such as an S-shaped function that compresses the out-of-range intensities to the display limits can help, but any function choice forces a trade-off between preserving details at mid-range and preserving them in shadows and highlights. Tone mapping operators (TMOs) are used to bridge the mismatch between the dynamic range of the HDR images and the dynamic range of the display devices. TMOs compress the dynamic range of the HDR image to the displayable range while reproducing as much as possible of the visual sensation of the scene. The main goals of TMOs are to: • Compress dynamic range: Map the input HDR image into a low dynamic range image that can be displayed on limited dynamic range devices. 2 • Preserve visibility: Preserve visibility of details in the tone mapped image if and only if these details are visible in the natural scene by a human observer. • Preserve overall contrast: Preserve the impression of overall contrast that a human observer experiences in the real scene. • Preserve saturation: Preserve the impression of saturation perceivable by a human observer, that is, do not amplify or decrease saturation. • Recover perceived colors: Recover the color that the human observer would perceive under the same lighting setting. • Match brightness: Map the luminance input values into display device luminance values while preserving their original brightness. A good TMO will try to fulfill all these goals, but some goals must be traded-off against others. As an example, it is difficult to preserve visibility and contrast simultaneously. Contrast can be preserved by keeping the luminance ratios between pixels unchanged. This can easily lead to a loss of visibility. TMOs can be classified according to whether their operators are spatially uniform or non-uniform. Spatially uniform operators, so called global operators, are independent of the local spatial context. They apply the same transformation function to each pixel of the image. Spatially non-uniform operators, so called local operators, use a transformation function that varies adaptively with the local characteristics of the image. In general, spatially non-uniform operators produce higher quality tone mapped images than spatially uniform operators, but they have some drawbacks. Visually, the main disadvantage of spatially non-uniform operators is the appearance of halo artifacts; these are dark or white bands that appear at the borders between regions of different brightness 3 in the image. Spatially non-uniform operators require complex computations, and implementing them in real-time is a challenging task. Overcoming these two drawbacks is the main problem addressed in our research. 1.2 Goals of Research Our research does not address human vision system perception and limitation, but focuses only on the task of dynamic range compression that preserves as much scene detail as possible without introducing halo artifacts. The HDR images are processed in real-time on an embedded system with high throughput and low cost. The processing speed required depends on the specific application. For example, a defense application for enhanced night vision requires video frame rates [52]; at frame rates slower than 30 frames/second, the person using the system tends to experience headaches. Various platforms can be considered for embedded systems computing, from general purpose digital signal processors, to field programmable gate arrays (FPGAs), to application specific integrated circuits. Of these platforms, FPGAs are the most compelling for embedded, high throughput image processing. Commercially available FPGA boards and tools permit low cost implementations that are easily updated and are significantly faster than microprocessor-based implementations. It is important to note that despite the importance of local TMOs there has been no FPGA implementation and evaluation of state-of-the-art local TMOs. The successful implementation of any image processing algorithm on an embedded platform requires an interdisciplinary approach. Image processing researchers typically evaluate their work using floating point software on high performance personal 4 computers; they do not consider the ramifications of implementing their algorithms in hardware on a more restrictive platform. Hardware researchers, on the other hand, evaluate their work in terms of the resulting speed and hardware complexity, not often evaluating the quality of the system from an image processing point of view. An end-toend system analysis is crucial in understanding the combined effects of algorithmic and implementation design choices. This analysis can only be achieved by simultaneous consideration of both design aspects. In summary the goal of our research is to develop algorithms for local tone mapping of gray scale images of high dynamic range such that: • such images can be displayed with clear detail on standard display devices without halo artifacts. • the processing can be done on a field-programmable gate array-based platform in real-time. Images to be displayed on standard LCD monitors require a frame rate of 60 frames/second. • the system (algorithm plus hardware implementation) is the result of a careful trade-off of both image processing and hardware performance aspects. In the course of doing the work, the following outcomes were achieved: • the development of real-time algorithms for local tone mapping of images of high dynamic range. Previous local tone mapping algorithms do not consider the hardware complexity required for implementation; in contrast, the new algorithms developed are uniquely suited for embedded applications and work in real-time. No such algorithms that achieve video frame rates has been previously published. 5 • the implementation and testing of the algorithms developed using synthesis and simulation tools such as Quartus and Modelsim. • a system-level, end-to-end analysis of the techniques and implementations developed that includes both image processing and hardware performance measures. 1.3 Research Approach Local TMOs can be classified in five different categories including: TMOs based on Retinex theory, TMOs based on the zone system, TMOs based on illumination compensation, TMOs based on image appearance models and TMOs based on general image processing techniques. Each category will be discussed in detail in the second chapter. Several local TMOs that are available in the literature were not considered in our research for a number of reasons including: • TMOs that are iterative by nature. Iterative methods are difficult to parallelize and therefore hard to implement in real time. • TMOs that require complex computation. FPGA platforms are limited in hardware resources and embedded memory, and therefore impose a limit on the complexity of the method. • TMOs that require large latencies, where latency is the difference in time between the moment the pixel enters the system and the moment it exits the system. In general, video protocols have a blanking period of limited length between 6 consecutive rows and consecutive frames. The latency of the system should not exceed the length of this blanking period. We have developed a number of fixed point simplifications of local TMOs that are amenable to hardware. Three of these algorithms were implemented in hardware and tested. Implementation detail will be given in the following chapters. To test the visual effects of the implemented algorithms we used a set of testbench HDR images from the Debevec Library [50]. The images in the library are given in a RGBE format, rather than the standard 24-bit RGB format to represent color images of ordinary dynamic range. Here, R, G, and B represent the mantissas of the three color components and E is the common exponent. For our gray scale systems, we derived a fixed-point gray scale image with sixteen bits of fraction and twelve bits of integer per pixel from the color image in the Debevec library. Each gray scale pixel is set to the luminance value for the original RGB triplet using: Ls = (0.27 R + 0.67G + 0.06 B ) ∗ exp E (1.2) The set of grayscale test images are shown in Figure 1.1. Because they are of high dynamic range they can not be displayed without some processing. In Figure 1.1 the image pixels have been linearly mapped to the display range: Ld = 128 s L Lsmean (1.3) where Ls is the luminance of the original HDR image, Ld is the luminance of the displayed image and Lsmean is the mean value of luminance of the original image. Memorial, Rosette and Nave are images of halls with windows and domes that are highly 7 exposed to light. GroveC, GroveD, and Vinesunset are images of nature taken at sunset. The six images contain large areas of shadows and hidden details. It is clear from Figure 1.1 that a linear mapping is not sufficient to display high dynamic range images. The goal of this research is to develop more appropriate mappings that are simple enough for real-time embedded implementation. The output of each algorithm was tested qualitatively. We looked for the following: the absence of halo artifacts, the preservation of details, and the enhancement of edges and contrast. We also judged the quality of the output of the algorithms quantitatively. For this purpose, we used peak-signal-to-noise ratio of the processed images compared to the output of a gold standard. The gold standard is a floating-point implementation of a well-known local TMO, slightly modified to allow direct pixel-by-pixel comparison with our algorithms. Memorial Rosette Nave Vinesunset GroveC Figure 1.1. The set of images used for simulation. 8 GroveD 1.4 Contribution to Knowledge The basic contribution of our work is in the real-time embedded implementation of local tone mappers using FPGAs. To the best of our knowledge, this is the first published implementation of a local tone mapper on an FPGA, and also the first published hardware implementation on any platform that is compatible with frame rates for video applications. The work makes a convincing case for the use of FPGAs as a computational platform for computer graphics applications. Graphics cards are not particularly well suited to implementation of local tone mapping; their hardware architecture and memory organization is not as flexible as an FPGA’s. It is easy to imagine a future graphics card that contains an FPGA co-processor so that computationally complex algorithms like local tone mapping can be implemented independently. The implementations developed in this work could also be included inside future high dynamic range cameras as a processing interface between the HDR image and the LCD screen that is used as a viewfinder. The implementation developed could also be useful in military night vision applications, where HDR images taken using advanced sensors at night must be processed in real-time for use by pilots. We believe that our work can be generalized beyond the real-time embedded implementation of local tone mapping operators. The Gaussian pyramid, which is central to our algorithms, is used not only for local tone mapping, but also in general for the multiscale representation of images. Computation of a Gaussian kernel of large scale has high computational complexity, and there has been consistent interest in efficient computation of the Gaussian pyramid for a number of applications. We have approximated the Gaussian kernels using coefficients that are easier to implement in 9 hardware and have developed an accumulator-based architecture to implement filtering using these coefficients (in Section 6.1). The same architecture used to approximate the Gaussian pyramid here is likely to be useful in other applications. Thus, the ideas used to approximate the Gaussian pyramid in a hardware-friendly way contribute to current knowledge in embedded algorithms for multiscale image representation. The embedded memory available inside FPGAs is always limited in size and also in number of blocks. When designing our memory organizations, we tried to reduce the need for memory (both bits and blocks) by reusing a given memory block for multiple tasks. For example, instead of having a separate memory for every window in the Gaussian pyramid, we divided the memory of the largest scale window into a number of memory blocks that can provide data to all the windows simultaneously. We also implemented the symmetric extension by using the memories as stacks and delay blocks at the same time (in Section 3.1.2). We believe that this memory organization design process can be generalized to contribute to state-of–the-art methods for embedded memory organization. Our method for computing the base-2 logarithm of a pixel is novel, and gives better accuracy than other published methods that are similar in complexity. It can be considered a contribution that generalizes beyond implementation of local tone mapping, since it can be used for other embedded systems that are based on logarithmic arithmetic. 1.5 Dissertation Outline The rest of this dissertation is organized as follows. Chapter Two is a presentation of the related work, including: the necessity of local TMOs, the five different categories 10 of local TMOs, HDR devices, and evaluation studies. Chapter Three describes the different local estimates of illumination used by the local TMOs that were implemented in hardware in this work. Chapter Four describes the remaining building blocks of the implemented local TMOs. Chapter Five explains the synthesis and simulation details of our three implemented grayscale local tone mappers: the MGIP mapper, the four-scale constant weight pyramid mapper, and the four-scale approximate Gaussian pyramid mapper. Chapter Six describes a nine-scale version of the approximate Gaussian pyramid mapper, and gives its synthesis and simulation results. Chapter Seven extends the ninescale approximate Gaussian pyramid mapper to the color domain. Finally, Chapter Seven summarizes the accomplished research and describes possible future work. 11 CHAPTER II RELATED WORK In this chapter, we start by a descriptive example that shows the necessity of local tone mapping operators (TMOs). Then, we divide local TMOs into five different categories and explain each category separately. HDR devices are presented as an alternative to TMOs as well as an evaluation tool for the performance of different operators. Finally, a summary of different methods for evaluating the image processing performance of an algorithm is given. 2.1 Local tone mapping operators Chui et al. [1] were the first to describe the necessity of local TMOs. They developed a test image that shows qualitatively that global operators are not always sufficient. The test image is of a room lit by a single incandescent bulb. The luminance in the image varies from 0.0167 cd m 2 at a table leg, to 500 cd m 2 at the light-bulb, for a ratio of approximately 1:30000. The test image is displayed using three different global TMOs in Figure 2.1; all three map a luminance value of 0 to black. Figure 1a, 1b, and 1c map luminance values of 1, 10 and 500 to white, respectively. Because of the wide range of luminance in the image, none of the mappings is satisfactory. Figure 1a displays the floor in a way that is qualitatively similar to the way the floor looks in the real scene 12 (with clearly visible wood grain), but pixels near the light bulb are washed out. Figure 1b shows both the light and the floor, but the detail in the floor is lost, and the pixels on the edge of the bulb are all mapped to white. Figure 1c shows only the light bulb while the rest of the image pixels go to black or near black. For proper display of Chui’s test image, a pixel covering the floor with an original luminance of 0.08 cd m 2 should be mapped to a higher value than a pixel near the bulb edge with an original luminance of 5.0 cd m 2 . This basic observation implies that the way luminance is mapped to a pixel value on a CRT or printer should depend on the local illumination around the pixel. b a c Figure 2.1. Test image from [1], displayed by mapping white to luminance value of 1.0, 10.0 and 500.0, respectively. Ls Rs Ld Is Figure 2.2. Basic structure of TMOs based on illumination compensation. 13 To eliminate the effect of local illumination, most local TMOs rely on the same basic principle: the luminance Ls of a scene is the product of the illumination I s and the reflectance of the object surface R s . To account for the effects of uneven illumination, illumination and reflectance must be separated. First the illumination component I s is extracted. Then, the luminance Ls is divided by I s to get the reflectance component R s . Finally, I s is compressed using a nonlinear function and multiplied by R s to get the displayed luminance Ld . This basic approach is shown in Figure 2.2. The success of this approach depends entirely on the design of the illumination extraction filter. If the filter's smoothing is incomplete, the illumination compression may destroy some important scene details. However, if the filter blurs or distorts its illumination output, these distortions will escape compression as part of the reflectance signal. These errors can cause strange halo-like artifacts in the result, especially near strong lights and large strongly shaded regions. Most published methods are unable to completely avoid both types of errors. The outline of the remaining part of this chapter will be as follows: we will start by explaining TMOs that are based on Retinex theory and the zone system. Both categories are important for our research because they can be practically implemented in real-time hardware. A digital signal processor-based implementation of the single scale Jobson Retinex algorithm [10] and a programmable graphics hardware-based implementation of the Reinhard zone system [17, 18, and 53] are found in the literature. However, neither implementation reaches our target frame rate. Next, TMOs that use edge-preserving filters and image appearance models are described. TMOs in both categories are too complex to be practically implemented in real-time hardware. They are 14 included in the literature review to show how difficult it is to eliminate halo artifacts completely. Finally, we review additional TMOs that are based on general image processing techniques such as segmentation and wavelet decomposition. 2.1.1 Local TMOs based on Retinex theory Retinex theory was introduced by Land and McCann [2]. A key part of the Retinex algorithm is to generate an estimate of the local illumination around every pixel. In Land’s original algorithm this estimate is produced by exploring the image with a number of arbitrary paths through the pixel. The authors in [3, 4] introduced a Retinex implementation that uses the construction of Brownian random paths in the image to estimate illumination. The authors in [5] showed that the illumination estimation can be formulated as a Quadratic Programming optimization problem. A multi-resolution algorithm was proposed that exploits the spatial correlation in the reflectance and illumination images. However, the computational complexity of the illumination reconstruction algorithm is relatively high, since in the obtained Quadratic Programming problem, the whole illumination image is the unknown. The same authors then proposed in [6] a compromise to the full fledged solution of the theoretical model, presenting a variety of efficient yet limited computational methods for which they develop optimal solutions. Jobson et al. [7] suggested the use of a Gaussian surround centered on the pixel to estimate local illumination. Though simple, their Retinex suffered from halo artifacts. They then suggested a multi-scale Retinex [8, 9] to decrease halo artifacts. Their multiscale Retinex algorithm uses three differently-sized neighborhoods to estimate 15 illumination. It then divides each pixel luminance by its estimated illumination in the log domain, so that the transformed displayed luminance more closely represents reflectance. The Jobson Retinex does: Ld ( x, y ) = 1 3 ∑ log Ls ( x, y) − log Ls ( x, y ) ⊗ G( x, y,σ j ) 3 j =1 ( ( ) ( )) (2.1) where ⊗ represents convolution, and G ( x, y, σ j ) is a Gaussian filter used to weight the pixels in the neighborhood when computing the jth illumination estimate. The filter G ( x, y, σ j ) is given by G ( x, y, σ j ) = K j exp[−( x 2 + y 2 ) / σ 2 ] j (2.2) where σ j is the standard deviation of the Gaussian distribution and the scale factor K j is used to normalize G ( x, y, σ j ) . They used σ1 = 15, σ2 = 80, and σ3 = 250. While Equation 2.1 is expressed in the space domain, implementation of the computations is done in the frequency domain using the Fourier transform because convolution in the large neighborhoods indicated here is computationally complex. The design and development of a digital signal processor (DSP) implementation of the single-scale Jobson Retinex was presented in [10]. The target processor was a Texas Instruments TMS320C6711 floating-point DSP. NTSC video was captured using a dedicated frame-grabber card, Retinex processed, and displayed on a standard monitor. The best frame rate reached for a 256x256 grayscale image was 20 frames per second. A VLSI-based implementation of the algorithm was discussed in [11], but details about implementation and performance are not given. Finally, a modified version of the Jobson 16 Retinex with an improved global impression of brightness was presented in [12]. The operator was also investigated more deeply in [13]. 2.1.2 Local TMOs based on the zone system The zone system is a traditional method for tone mapping used for photographic printing. A zone is defined as a Roman numeral associated with an approximate luminance range in a scene as well as an approximate reflectance of a print. Zones relate logarithmically to scene luminances. Hence, dynamic range can be expressed as the difference between the highest and lowest distinguishable scene zones. A crucial part of the Zone System is its methodology for predicting how scene luminances will map to a set of print zones. The photographer first takes a luminance reading of a surface he perceives as a middle-gray. In a typical situation this will be mapped to zone V, which corresponds to 18% of the display range. Next the photographer takes luminance readings of both light and dark regions to determine the dynamic range of the scene. If the dynamic range of the scene does not exceed nine zones, an appropriate choice of middlegray can ensure that all textured detail is captured in the final print. For a dynamic range of more than nine zones, some areas will be mapped to pure black or white with a standard development process. In this situation, the photographer will withhold or add light to a portion of the print during development. This printing technique is called dodging-and-burning. It will lighten or darken that region in the final print relative to what it would be if the same development were used for all portions of the print. Reinhard et al. [14] used the zone system as a basis for addressing the tone mapping problem. They first show how to find the middle-gray luminance of the scene 17 L s middle − gray ∑ log δ + Ls ( x, y ) = exp x , y N ( ) (2.3) where Ls ( x, y ) is the scene luminance for pixel ( x, y ) , N is the total number of pixels in the image and δ is a small value to avoid the singularity that occurs if black pixels are present in the image. The middle gray scene luminance is typically mapped to the middle-gray of the displayed image, which again corresponds to 18% of the display range. Hence, Ld 1 ( x, y ) = 0.18 L s middle − gray Ls ( x, y ) (2.4) where Ld 1 ( x, y ) is the display luminance before dodging-and-burning. Reinhard’s method then applies dodging-and-burning for scenes of high dynamic range. Dodging-and-burning is typically applied over an entire region bounded by large contrasts. The size of a local region is estimated using a measure of local contrast, which is computed at multiple spatial scales. Such contrast measures frequently use a centersurround function at each spatial scale, often implemented by subtracting two Gaussian blurred images [30, 31]. The center-surround function used in [14] is defined by: V ( x, y , σ ) = V1 ( x, y, σ ) − V2 ( x, y, σ ) 2φ × 0.18 σ 2 + V1 ( x, y, σ ) (2.5) where center V1 and surround V2 responses are derived from Equations 2.6 and 2.7. Vi ( x, y, σ ) = Ld 1 ( x, y ) ⊗ Gi ( x, y, σ ) (2.6) x2 + y2 Gi ( x, y, σ ) = K i exp − (α σ ) 2 i (2.7) 18 Table 2.1. The values of the constants used in Reinhard’s implementation. ε σ α1 α2 φ 0.35 1.6 8 0.05 8 discrete levels increasing by 1.6 from 1 to 43 This constitutes a standard difference of Gaussians approach, normalized by 2φ × 0.18 σ 2 + V where φ is a sharpening parameter. For computational convenience, the center size of the next higher scale is set to be the same as the surround of the current scale. Starting at the lowest scale, they seek the first scale σ m where: V ( x, y , σ m ) < ε is true. Here ε is a threshold. Given a judiciously chosen scale for a given pixel, Ld 2 ( x, y ) = Ld 1 ( x, y ) 1 + V1 ( x, y, σ m ( x, y )) (2.8) (2.9) where Ld 2 ( x, y ) is the final display luminance for pixel (x, y). This function constitutes the local dodging-and-burning operator. Table 2.1 summarizes the values of the parameters in the above equations used in the first implementation by Reinhard. Realizing the difficulty of tuning these parameters, he presented another paper [15] that aims at automatic estimation. The Reinhard operator is particularly attractive for hardware implementation for two reasons. The method requires only one global computation: the middle gray luminance ( Lsmiddle − gray ) in Equation 2.3. Note that this value, which we refer to as the log average luminance, is the average of the base-2 logarithm of all the pixels in the image. Global computations are not particularly amenable to hardware, since they can not be completed until the entire image has been input into the system. Barladian et al. [16] suggested a sampling method to calculate this log average luminance. Second, while the 19 local dodging-and-burning technique can be computationally expensive, the process lends itself to adaptive refinement. In other words, we can vary the number of scales depending on the level of detail we wish to preserve. This allows us to trade-off efficiency and accuracy, which can be crucial for real-time applications. The authors in [17, 18] mapped the Reinhard operator to the pixel processor of a programmable graphics hardware unit. The Gaussian pyramid was implemented using convolution in the space domain. For a 256 × 256 grayscale image they maintained a frame rate of 20 frames/second. The main bottleneck was in the memory bandwidth, as one would expect in such algorithm. Another real-time implementation of the Reinhard method was suggested in [53]. The performance of the method was measured on a Pentium4 2 GHz processor and NVIDIA GeForce 6800GT graphics card. Using four scales of Gaussian pyramid, they were able to process only 14 frames per second for a 1024 × 768 image. 2.1.3 Local TMOs that use edge-preserving filters Local TMOs that use edge preserving filters fall into two broad classes: iterative methods and noniterative methods. They are generally characterized by complex computation and are difficult to implement in real-time hardware. The low curvature image simplifier (LCIS [19]) is an example of iterative method that follows the artistic technique of image rendering. LCIS offers a hierarchy of boundaries and shadings in the same way that linear filters such as wavelet filter banks and image pyramids offer a hierarchy of sinusoids. Each level of the hierarchy is made of a simplified version of the original scene containing sharp boundaries and smooth 20 shadings. This hierarchy is used to convert HDR image into a highly detailed display image. Though its results are appealing, LCIS smoothing is slow, and can sometimes result in excessive details in the display image. Another iterative method called the gradient domain technique was discussed in [20]. The authors compressed the magnitude of the large image gradients that are responsible for its HDR. They then iteratively solved a Poisson equation to find an image that best fits the compressed gradients in the least-squares sense. The gradient domain technique converges more quickly, requires fewer parameters, and avoids the often excessively busy or noisy appearance of the LCIS method. The gradient domain technique was used in [21] to construct a class of image fusion methods. The authors automatically combined images of a scene captured under different illuminations. They carried into the combined image the gradients from each input image that appear to be locally important. Another multi-scale image enhancement technique that is based on gradients is described in [22]. The method equalizes image details at multiple scales in the image, resulting in effective dynamic compression and sharpening of faint image details. A third iterative method is described in [23]. The authors adopt the level set framework technique, developed in [24], to extract the reflectance of the image. Then, they compress the illumination and add the reflectance to get the final displayable image. Computations are done in the logarithmic domain. This iterative method is more flexible than LCIS because the amount of details added to the final image can be controlled; thus, the excessive details sometimes seen in the LCIS-processed images can be avoided. 21 Nonlinear filters such as bilateral filters and trilateral filters perform a good quality edge preserving smoothing in a single pass. The bilateral filter is a nonlinear filter for which the output is a weighted average of the input in the space domain as well as the luminance domain. The value of the output pixel is influenced mainly by pixels that are spatially close and that have a similar luminance. Two Gaussian kernels are used to average the pixels in the space and luminance domains. The authors in [25] suggested a piecewise linear bilateral filter in order to accelerate computation. They discretize the set of possible pixel luminance into segment values, and compute a linear filter for each such value using FFT. The final output of the filter for a pixel is a linear interpolation between the outputs of the closest segment values to the pixel luminance. They also used some subsampling methods in the space domain to further accelerate the filter. Similar to gradient domain techniques, bilateral filters were also used for image fusion. The authors in [26] enhanced photographs shot in dark environments by combining a picture taken with the available light and one taken with the flash. They used the bilateral filter to decompose the images into reflectance and illumination. The image was reconstructed using the illumination of the image taken with available lighting and the reflectance of the image taken with the flash. A more sophisticated trilateral filter was introduced in [27]. The output is a weighted average of the input in the space domain, luminance domain and gradient domain. The major contributions of this filter are as follows. First, its filter window is skewed or tilted by the smoothed image gradient vector to track high gradient regions. Second, the local neighborhood or domain automatically adapts to local image features to 22 smooth the largest possible region with similar smoothed gradient values. Third, all its internal parameters can be derived from a single user-supplied value. A local model of eye adaptation based on bilateral filters was introduced in [28]. Eye adaptation is the mechanism that allows our eyes to respond to varying illumination levels. Thanks to this highly localized mechanism, our visual system is able to see luminances at extreme ends of the dynamic range. This model targets complex HDR images with simultaneously differently illuminated regions. It adjusts each pixel in the space and time domain to simulate adaptation of the human visual system. 2.1.4 Local TMOs based on image appearance models Image appearance models have greater flexibility than dedicated tone mapping algorithms as they are designed to predict how images appear perceptually, and not designed for the single purpose of dynamic range compression. Pattanaik et al. [29] developed a computational model of adaptation and spatial vision for realistic tone mapping. The model incorporates a multi-scale representation of luminance and pattern in human vision that accounts for the changes in threshold visibility, visual acuity, suprathreshold brightness and apparent contrast that occur with changes in scene illumination. The model was incorporated into a TMO that maps the vast ranges of luminances found in real and synthetic scenes into the small fixed ranges available on conventional display devices such as CRTs and printers. A second image appearance model called the image color appearance model (iCAM) was introduced in 2002 [32]. The goal of iCAM was to combine traditional color appearance capabilities along with spatial vision and image quality metrics. iCAM was 23 designed to be computationally simpler than Pattanaik’s method [29] with similar capabilities. The authors in [33] described the use of iCAM for tone mapping HDR images. 2.1.5 Local TMOs based on general image processing techniques Visual contrast depends heavily on the local contrast, which can be defined as the luminance ratio of a pixel to its local surround. The authors in [34] described the mathematical condition to preserve the local contrast as follows: Ld ( x, y ) Ls ( x, y ) = s Ld − average ( x, y ) Llocal − average ( x, y ) local (2.10) Here, Ls ( x, y ) and Ls − average ( x, y ) denote the scene luminance level and the scene local local average, Ld ( x, y ) and Ld − average ( x, y ) denote the displayed luminance level and the local displayed local average of each pixel (x,y), respectively. After mathematical derivations and approximations they found that: Ld ( x, y ) = p ( Ls ( x, y )) × r ( Ls ( x, y ), Ls − average ( x, y )) local where p is a first order differentiable global TMO , and L ( x, y ) r ( Ls ( x, y ), Ls − average ( x, y )) = s local L ( x, y ) local − average s (2.11) α × 1 − dp ( Ls ( x , y )) p ( L ( x , y )) dLs ( x , y ) Ls ( x , y ) s × (2.12) where α denotes a gain parameter for the local contrast enhancement. The local average Ls − average ( x, y ) is calculated by convolving the input image with a spatial averaging local filter. However, the proposed algorithm has high computation costs and requires large memory. This was pinpointed at the end of the paper by the authors themselves. 24 Rizzi et al. [35] defined a kind of perceived image as: ∑ L (i ) = d j∈image m(i, j ) log ε + Ls (i ) − log ε + Ls ( j ) d (i, j ) { } ∑ j∈image d (i, j ) (2.13) where ε is a small value avoiding the logarithm of 0, d (i, j ) is a distance function and m(i, j ) is a modification operator. Following the ratio rule of the Retinex algorithm, they computed the displayed luminance of each pixel by taking the difference of logarithm between the scene luminance of the pixel and the other pixels in the image. In addition, they multiplied the ratio with a function d (i, j ) that weights the contribution to final pixel displayed luminance giving stronger influence to pixels near i. Finally, the intensity of the pixels’ interaction is controlled by a modification operator m(i, j ) which can change its shape depending on local information. However, the computational cost of this method is quite high. Sa et al. [36] described a new technique that uses active scene illumination to perform foreground-background segmentation and recover partial HDR information. They show how to produce global tone-mapped images, where background and foreground receive different treatments. Segmentation was also used in [37] to define the regions for dynamic range compression in a chest radiograph. The segmentation algorithm employs the difference in X-ray transmission between the lung and mediastinum. A region-based TMO was defined using a warping transformation of correlated distribution between the pixel and its local average. This method was used to enhance the lung and mediastinum equally. The authors in [38] presented an approach to tone mapping of HDR images inspired from the anchoring theory. The key concept of this method is the decomposition 25 of a HDR image into areas (frameworks) of consistent luminance and the local calculation of the lightness values. The net lightness of an image is calculated via the merging of the frameworks proportionally to their strength. The approach does not affect the local contrast and preserves the natural colors of a HDR image by using only linear transformations on luminance. Tumblin et al. described two methods for display of HDR images [39]. The first builds a display image from several layers of lighting and surface properties. Only the lighting layers are compressed, drastically reducing dynamic range while preserving much of the image detail. This method is practical only for synthetic images where the layers can be retained from the rendering process. The second method interactively adjusts the displayed image to preserve local contrasts in a small foveal neighborhood. Unlike the first method, this technique is usable on any image. Unlike traditional single-scale techniques, wavelet-based algorithms offer the capability of modifying image components adaptively based on their spatial and frequency properties. The authors in [40, 41 and 42] proposed a novel image enhancement method, human vision system controlled color image enhancement and evaluation. The enhancement process takes place in the wavelet transform domain. A contrast sensitivity function (CSF) implementation is employed in the wavelet domain based on an invariant single factor weighting. The algorithm outperforms Land’s multiscale Retinex. However, dynamic range compression was not implemented in this algorithm. In summary, we have presented five different categories of local tone mapping operators. The only two categories that are important for our research are: Local TMOs 26 that are based on the Retinex theory and zone system. The remaining categories are characterized by complex computation and cannot be practically implemented in realtime hardware. They were mentioned in this section of related work for the sake of completeness. 2.2 High dynamic range display devices TMOs have been developed to compress the dynamic range of HDR images so that these images can be displayed on conventional 8-bit display devices including the cathode ray tube, liquid crystal display (LCD), and projector-based display. This section describes an alternative: to display the HDR images directly on a HDR display system. LCD screens cannot completely prevent light transmission. Hence, the dynamic range of an LCD is defined by the ratio between the light emitted at the brightest state and the light emitted in the darkest state. In order to maintain a reasonable black level of about 1 cd m 2 , the LCD brightness is limited to a maximum of about 300 cd m 2 . The authors in [43], inserted an additional light modulator between the backlight and the light modulator of the LCD screen. These two modulators in series provide an extremely dark state with a very low light emission, which then makes it possible to increase the brightness of the backlight dramatically without losing the black state. Optically, this series of modulators resulted in multiplication of the individual dynamic ranges. A wide field high dynamic range viewer was presented in [44]. The viewer is constructed by combining a bright uniform backlight (20 Kcd m 2 ) and layered transparencies to a set of LEEP ARV-1 optics. They used two transparencies, layered one on top of the other. By combining the ranges in this way, they doubled the dynamic range 27 in log terms. To avoid problems with transparency alignment and ghosting, they reduced the resolution of one layer using a Gaussian blur function. The viewer was able to reproduce the absolute luminance levels and full dynamic range of almost any visual environment. The authors in [45] presented a series of psychophysical experiments to validate a HDR device against a real scene. From the results of the studies, it was clear that in most circumstances, an image viewed on the HDR device is similar in visibility and contrast to the real scene. This is significant if we are to use these devices as tools to simulate real scenes. When the same HDR photograph is displayed on a typical monitor, the visibility is very different from reality. 2.3 Evaluation studies Every representation of display of a visual scene involves some degree of compromise. It is not yet possible to recreate the full range of spectral, spatial and luminous detail that is encountered by the HVS in normal experiences of scenes. Hence, it is essential to assess the perceptual fidelity of the displayed scene according to reasonable criteria such as visibility or discriminability of key features. Moreover, the criteria should be defined and tied to the purpose for which displayed image is intended. The authors in [47] discuss evaluations of the extent to which luminance, contrast, and visibility are preserved with three different methods for representing real world scenes. Method one involves using psychophysical data from contrast charts to select the best print from among a density-varied series of photographic prints. Method two uses tone mapping algorithms to compress the luminance information in the HDR image into 28 the luminance range available in the display, while preserving visible contrast as much as possible. Method three uses the wide field high dynamic range viewer, explained in Section 2.2, to present an image with a much wider dynamic range than is available in a photographic print or a CRT display. In general, each of these methods represents an improvement over a simple photographic representation. Given an understanding of the limitations of each approach, methods such as these can support practical decisions in visual design and reconstruction. The field of tone mapping assumes thorough knowledge of both the objective and subjective attributes of an image. However, the image attributes used in tone mapping are not well defined and consistent. The authors in [48] present an overview of image attributes that are used extensively in different tone mapping methods. The attributes include brightness, contrast, color reproduction, details reproduction, visual acuity simulation, glare simulation, and artifacts. They propose a scheme of relationships between these attributes, leading to the definition of an overall quality measure which they call naturalness. Finally, the authors in [49] present the results of a series of psychophysical experiments to validate six frequently used global and local tone mapping operators against linearly mapped High Dynamic Range scenes displayed on a novel HDR device. The four local TMO that were chosen for this investigation are the bilateral filter, the Reinhard method, the iCAM model and the local eye adaptation method. In the first experiment, participants were asked to make simple comparisons with the ideal reference image. Reinhard method performed best for gray scale images. In the second experiment, 29 participants were asked to judge specific attributes of details. The local eye adaptation was the best in preserving details. In this chapter we have presented the different local tone mapping operators found in the literature. In addition, high dynamic range display devices were presented as an alternative to tone mapping and as an evaluation tool for different tone mapping operators. Finally, we presented some evaluation criteria to judge the quality of the tone mapped images, as well as an evaluation study for six of the most famous tone mapping operators in the literature. In summary, the Reinhard method was the most appealing method for our research. It produced qualitatively well tone-mapped gray scale images, as shown by the evaluation study in [49], and has been implemented in real-time, although not at video frame rates, using very complex hardware such as graphics cards. Other local TMOs such as the eye adaptation model that was found successful in the same evaluation study require much more complex computations and are too difficult to be practically implemented in real-time hardware. The challenge of our research is to implement the Reinhard method on an embedded platform, to process large resolution images at video frame rates around 60 frames/second while producing qualitatively well tone-mapped images. 30 CHAPTER III DIFFERENT LOCAL ESTIMATES OF ILLUMINATION A key part of local tone mapping operators is computation of a local estimate of illumination around each pixel. Using two-dimensional FIR filters, we calculate a number of local averages in the neighborhood of every pixel in the image. Then, we apply an average selection criterion to choose the best local illumination estimate around each pixel from the number of local averages. The idea behind the average selection criterion is to choose the largest neighborhood around the pixel that does not contain a large illumination variation. This chapter describes the different FIR blocks and average selection criteria used by the tone mapping algorithms described in Chapter 5. Section 3.1 explains three different FIR blocks in detail. Section 3.2 explains two different average selection criteria in detail. The different FIR blocks are combined with the appropriate selection criterion to find three different local illumination estimates. 3.1 FIR blocks The three different FIR blocks have some common properties. First, the computation of the average inside every block can be done with simple accumulator structures. Second, symmetric extension is applied at the borders to preserve image quality; we use even symmetric extension, which presumes that the image is extended on 31 all four sides with a reflected copy of the pixels near the border. Third, the square windows inside every block are implemented using two one-dimensional filters. The vertical filter calculates the average of the input image in the vertical direction. The horizontal filter calculates the average of the image processed by the vertical filter in the horizontal direction. Because the pixels enter the system in a row-by-row order, it is not sufficient to use a single accumulator to calculate the output of the vertical filter. Instead, we use a set of accumulators equal to the number of pixels in every row. The set of accumulators is implemented by memory. For example if the image contains 512 pixels/row, there are 512 versions of the accumulator stored together in a single 512-word memory. The set of accumulators is updated row-by-row as the pixels enter the system. The three different FIR blocks use different windows to calculate the estimate of local illumination. The first FIR block uses one-dimensional window in the vertical direction, one-dimensional window in the horizontal direction, and one two-dimensional square window of constant weights. Its hardware complexity is low, but it gives a bad estimate of local illumination. The second FIR block uses four two-dimensional square windows of constant weights. Its hardware complexity is moderate and it gives a better illumination estimate than the first FIR block. The third FIR block uses four twodimensional square windows of Gaussian-like weights. Its hardware complexity is high compared to that of the previous two FIR blocks, but it gives the best local illumination estimate among the three. The following subsections will explain the hardware implementation of the different FIR blocks including: • Update formulas for the accumulators of the vertical and horizontal filters. 32 • Memory organizations that are used to prepare the data for the update formulas of the accumulators and to supply a delayed input pixel stream synchronized with its local average ready for normalization. • Block diagrams of some basic building blocks that are used to compute the local average, especially for the third FIR block. 3.1.1 FIR block 1 The first FIR block was suggested in [52]. It uses a vertical window of 30 × 1 pixels, a horizontal window of 1 × 30 pixels and a square window of 30 × 30 pixels. The weights of each window are constants with values of 1/32, 1/32, and 1/1024 for the vertical, horizontal and square window, respectively. The update of the horizontal window for an image of 512 × 480 pixels is given in Equation 3.1: 0 a [n − 1] + P[n] + P[n] H aH [n] = aH [n − 1] + P[n] − P[2 × 15 − n + 1] a [n − 1] + P[n] − P[n − 30] H aH [n − 1] + P[2 × 512 − n + 1] − P[n − 30] for for for for for n=0 n = 1, K ,15 (3.1) n = 16, K , 30 n = 31, K , 512 n = 513, K , 526 where aH is the accumulator of the horizontal window and P is the input pixel stream. Figure 3.1. The structure of the windows used by the first FIR block 33 Figure 3.2. Memory organization for the vertical filter of FIR block 1. The update of the vertical window is given by: 0 s [n − 1] + R[n] + R[n] V sV [n ] = sV [n − 1] + R[n] − R[2 × 15 − n + 1] s [n − 1] + R[n] − R[n − 30] V sV [n − 1] + R[2 × 480 − n + 1] − R[n − 30] for for for for for n=0 n = 1, K , 15 n = 16, K , 30 n = 31, K , 480 n = 481, K , 494 (3.2) where sV is the set of accumulators for the vertical window and R is a row of the input pixel stream. The update of the square window is given by: 0 a [n − 1] + P ver _ ave [ n] + P _ ave [ n ] ver S aS [n ] = aS [n − 1] + Pver _ ave [n] − Pver _ ave [2 × 15 − n + 1] a [n − 1] + P ver _ ave [ n] − P _ ave [ n − 30] ver S aS [n − 1] + Pver _ ave [2 × 512 − n + 1] − Pver _ ave [n − 30] for for for for for n=0 n = 1, K , 15 n = 16, K , 30 (3.3) n = 31, K , 512 n = 513, K , 526 where aS is the accumulator of the square window and Pver_ave is the vertical average around the pixel. The structure of the windows is shown in Figure 3.1. A small memory of 32 pixels is used as a delay block/stack to prepare the second and third term of Equation 3.1. The memory is always updated by the incoming pixels. The output ports of the memory act as a stack, delay block, or both depending on the position of the incoming pixel in the image. For the first fifteen pixels, aH takes its previous value plus twice the incoming pixel. As the fifteenth pixel enters the system, 34 symmetric extension from the left is complete. The system then starts producing outputs; the horizontal average is produced by dividing the value of the horizontal accumulator by thirty-two. For the next fifteen pixels, aH takes its previous value plus the incoming pixels minus the value at the top of the stack, to remove one symmetrically extended pixel. For the remaining pixels of the row, aH takes its previous value plus the incoming pixel minus the value at the end of the delay block, which represents the pixel leaving the sliding window. After all the pixels of the row enter the system, a buffer period is added to accomplish symmetric extension on the right. Here, aH takes its previous value plus the value at the top of the stack minus the value at the end of the delay block. A similar memory organization and accumulator updating scheme is used to calculate the average of the square window. Instead of the pixel stream, we use the output of the vertical filter to update the accumulator of the square window aS. The vertical filter, on the other hand, needs a special memory organization. As mentioned before, the set of vertical accumulators is updated row-by-row. Although the updating scheme operates similarly to the horizontal filters, the memory needed is much larger. Figure 3.2 shows the memory organization of the vertical filter. In addition to preparing the data for the vertical accumulator updates, the memory is also used to supply a delayed input pixel stream. Rather than implement a separate memory for each purpose, we use memory words more efficiently by developing a single memory organization capable of simultaneously supplying data to the vertical filter and also for the delayed pixel stream. We divide the thirty rows needed for the vertical window into three separate memories holding fourteen, one, and fifteen rows respectively. The physical memories are sized-up to hold a number of rows that is a power-of-two. The memory organization 35 insures that the second and third term of Equation 3.2 and the delayed pixel stream are in three different physical memories, so that they can be read simultaneously without conflict. The second port of the first two memories is used for memory writes as data is moved from one memory to the next. The physical memories used have two addresses. The first address is used to read/write in a certain location, while the second address is used only to read from a certain location. Each address is calculated using two counters: a row count that follows the rows in the image and a pixel count that follows the pixels in every row. The pixel counter is common for all addresses. Table 3.1 shows the states of the row counters used to generate the read/write and read address of every memory as the rows of the image enter the system. In the table ++ denotes that the row counter of the corresponding memory should be incremented, -- denotes that it should be decremented, and NA denotes that it should be left unchanged. The row counters have a minimum value of zero and a maximum value of fourteen, two, and fifteen for the first, second and third memory respectively. Each counter rolls over when it reaches its minimum or maximum value. A multiplexer is used to supply the data to be written in the third memory. The state of this multiplexer is also shown in Table 3.1. 3.1.2 FIR block 2 The second FIR block uses four square windows of size 8x8, 14x14, 28x28 and 56x56 pixels. The weights in each window are constant, with values of 1/64, 1/196, 1/784 and 1/3136 for the first, second, third and fourth window, respectively. Note that the weights in each window sum to 1; each window computes an average of the pixels 36 surrounding the pixel in the middle of the window. The structure of the windows is shown in Figure 3.3. The updates of the set of accumulators used to calculate the output of the vertical filters of the four different square windows are : 0 s 8[n − 1] + 2 × R[n] v 8 8 sv [n] = sv [n − 1] + R[n] − R[2 × 4 − n + 1] s 8[n − 1] + R[n] − R[n − 8] v 8 sv [n − 1] + R[2 × 480 − n + 1] − R[n − 8] for for for for for for for for for for n=0 n = 1, K , 4 n = 5, K , 8 n = 9, K , 480 n = 481, K , 483 n=0 n = 1, K , 7 n = 8, K ,14 n = 15, K , 480 n = 481, K , 486 (3.4) 0 s14 [n − 1] + 2 × R[n] v 14 sv [n] = s14 [n − 1] + R[n] − R[2 × 7 − n + 1] v s14 [n − 1] + R[n] − R[n − 14] v s14 [n − 1] + R[2 × 480 − n + 1] − R[n − 14] v 0 s 28[n − 1] + 2 × R[n] v sv28 [n] = sv28[n − 1] + R[n] − R[2 × 14 − n + 1] s 28[n − 1] + R[n] − R[n − 28] v28 sv [n − 1] + R[2 × 480 − n + 1] − R[n − 28] 0 s 56 [n − 1] + 2 × R[n] v 56 56 sv [n] = sv [n − 1] + R[n] − R[2 × 28 − n + 1] s 56 [n − 1] + R[n] − R[n − 56] v 56 sv [n − 1] + R[2 × 480 − n + 1] − R[n − 56] (3.5) for for for for for for for for for for n=0 n = 1, K , 14 n = 15, K , 28 (3.6) n = 29, K , 480 n = 481, K , 493 n=0 n = 1, K , 28 n = 29, K , 56 (3.7) n = 57, K , 480 n = 481, K , 507 8 56 where sv , s 14 , s v28 and s v correspond to the set of accumulators used to calculate the v vertical averages of size 8, 14, 28 and 56 respectively. R is a row of the input pixel stream. 37 Table 3.1. Different states of the row counters used for the read/write and read address of the memories and the state of the mux select for the vertical filter of FIR block 1. Number of memory 1 memory 2 memory 3 mux select row in delay block/stack delay block delay block/stack image read/write read read/write read read/write read pixel stream 1 2 to15 16 17 to 480 481 to 494 NA ++ ++ ++ -NA NA ++ ++ ++ NA NA ++ ++ ++ NA NA NA ++ ++ NA ++ ++ --NA ++ NA --pixel stream memory2 memory2 memory2 memory2 Figure 3.3. The structure of the windows used by the second FIR block. Symmetric extension is needed at the top border of the image. Symmetric 8 extension is accomplished by having sv take its previous value plus twice the input row 8 as the first four rows enter the system. As rows five to eight enter the system, sv takes its previous value plus the input row minus the contents of a stack; this stack holds the 38 extension row that is leaving the window on the top as the window shifts down. As the 8 remaining rows enter the system, sv takes its previous value plus the input row minus the contents of a delay block; the delay block holds the image row that is leaving the window on the top as the window shifts down. After all the rows enter the system, a buffer period 8 is added to accomplish symmetric extension on the bottom. For the buffer period, sv takes its previous value plus the contents of a stack minus the contents of a delay block. A similar update scheme is used for the three other sets of accumulators. The updating scheme for the set of accumulators used to calculate the four different vertical averages is now completely described with one exception: a memory organization for supplying data for the four differently-sized windows. Memory is also needed to supply a delayed input pixel stream, synchronized such that each input pixel appears simultaneously with its different averages. The memory organization described here for the second FIR block will also be used for the third FIR block. Figure 3.4. Memory organization of the vertical filter of the second and third FIR blocks. 39 As shown in Figure 3.4, we divided the fifty six rows of pixels needed for the update of the largest window into ten separate memories. Memories are numbered from left to right. Memory one holds fourteen rows of pixels and is used as a delay block as well as a stack. Memories two, three, four, and five hold seven, three, three, and one row of pixels, respectively, and are used as delay blocks. Memories six, seven, eight, nine and ten hold one, three, three, seven and fourteen rows, respectively, and are used as delay blocks as well as stacks. All these memories are implemented using two-port physical memories. The physical memories are sized up to hold a number of rows that is a powerof-two. A single-port physical memory of sixteen rows is added to the above organization to hold a copy of rows 454 to 466 of the image. The memory organization insures all pixels needed to update the four different sets of accumulators in any given clock cycle are found in different physical memories, so that they can be read simultaneously without conflict. This includes the addition and subtraction terms for each update equation in 3.4 to 3.7. The second port of the double port physical memories is used for memory writes as data is moved from one memory to the next. As rows one to fourteen enter the system they are written in memory one. At the same time, they are written in memories six to nine in the following manner: row one in memory six, rows two to four in memory seven, rows five to seven in memory eight, and rows eight to fourteen in memory nine. As rows fifteen to twenty-eight enter the system they are written in memory one and memory ten. At the same time the contents of memory one are moved sequentially to memory two, three, four and five. As the remaining rows enter the system, they are written to memory one. At the same time the contents of every memory is transferred to its neighbor. However, the 40 behavior of memories one to five is slightly different than the behavior of memories six to ten. While the rows are written and read from memories one to five in the forward direction, they are written and read in memories six to ten in the backward direction. This will help memories six to ten to act as a stack for rows twenty-nine to fifty-six and as a delay block for the remaining rows. During the first fourteen rows of the buffer period that follows the last row of the image, memory one is not written. Rather, memory one acts as a stack and delay block. The contents of memory five are written to memory six as well as the additional memory. All other memories keep functioning as before. During the last thirteen rows of the buffer period, memory one acts as a stack. The contents of memory five are no longer written to the additional memory. The additional memory acts as a stack. The remaining memories keep functioning as before. Figure 3.5. A block diagram of the hardware used to implement the four horizontal filters of the second FIR block. 41 The output of the vertical filters is fed to the input of the horizontal filters to calculate the average of the square windows. The updates of the accumulators used to calculate the horizontal average of the four different square windows are given by the following equations: 0 a 8 [n − 1] + 2 × P8 ver _ ave [ n ] S 8 8 8 8 aS [n ] = aS [n − 1] + Pver _ ave [n] − Pver _ ave [2 × 4 − n + 1] 8 a 8 [n − 1] + P8 S ver _ ave [ n] − P _ ave [ n − 8] ver 8 8 8 aS [n − 1] + Pver _ ave [2 × 512 − n + 1] − Pver _ ave [n − 8] 0 a14 [n − 1] + 2 × P14 [n] ver _ ave S 14 14 14 14 aS [n ] = aS [n − 1] + Pver _ ave [n] − Pver _ ave [2 × 7 − n + 1] a14 [n − 1] + P14 [n] − P14 [n − 14] S ver _ ave ver _ ave 14 14 14 aS [n − 1] + Pver _ ave [2 × 512 − n + 1] − Pver _ ave [n − 14] 0 a 28[n − 1] + 2 × P 28 [n] ver _ ave S 28 28 28 28 aS [n ] = aS [n − 1] + Pver _ ave [n] − Pver _ ave [2 × 14 − n + 1] a 28[n − 1] + P 28 [n] − P 28 [n − 28] ver _ ave ver _ ave S 28 28 28 aS [n − 1] + Pver _ ave [2 × 512 − n + 1] − Pver _ ave [n − 28] 0 a 56 [n − 1] + 2 × P 56 [n] ver _ ave S 56 56 56 56 aS [n ] = aS [n − 1] + Pver _ ave [n] − Pver _ ave [2 × 28 − n + 1] a 56 [n − 1] + P 56 [n] − P 56 [n − 56] ver _ ave ver _ ave S 56 56 56 aS [n − 1] + Pver _ ave [2 × 512 − n + 1] − Pver _ ave [n − 56] for for for for for for for for for for for for for for for for for for for for n=0 n = 1, K , 4 n = 5, K , 8 (3.8) n = 9, K , 512 n = 513, K , 515 n=0 n = 1, K , 7 n = 8, K , 14 (3.9) n = 15, K , 512 n = 513, K , 518 n=0 n = 1, K ,14 n = 15, K , 28 (3.10) n = 29, K , 512 n = 513, K , 525 n=0 n = 1, K , 28 n = 29, K , 56 (3.11) n = 57, K , 512 n = 513, K , 538 where as and Pver_ave are the accumulator of the horizontal window and the vertical average around the pixel, respectively. Figure 3.5 shows the hardware used to implement the horizontal filters. A set of registers at the input of every filter is used to prepare the data for the update equations. 42 The sets are used as delay blocks and stacks of size fifty-six, twenty-eight, fourteen and eight for horizontal filter four, three, two and one respectively. A set of registers of size fourteen, twenty-one and twenty-four is connected to the output of horizontal filters three, two and one respectively. The sets of registers are used as delay blocks so that the four different averages appear synchronously at the output. Another set of registers is used to delay the pixel stream so that each pixel appears synchronously with its four different square averages; in this way, it is ready for normalization. 3.1.3 FIR block 3 The third FIR block uses four square windows of size 8x8, 14x14, 28x28 and 56x56 pixels. A weighted average is computed in each window, with higher weights for the pixels in the middle and an approximately exponential decay of weights around the middle. Figure 3.6 shows the weights of these windows. Expressions for the base 2 logarithm of the weights are given in Table 3.2, where denotes the ceiling function. Note that these windows are an approximation of the Gaussian surrounds used in Reinhard’s local TMO [14] to calculate the local estimate of illumination, as described in Section 2.1.2. Like a Gaussian filter, higher weights are given to the pixels in the center of the window. The special relationship of the weights used ensures that computation of the averages can be done with simple accumulator structures. We demonstrate the idea using two one-dimensional windows that are used as described later to generate all four twodimensional windows. The first one-dimensional window is given by the weights: 43 Figure 3.6. The base-2 logarithm of the pixel weights for the four different windows used by the third FIR block. Table 3.2. Pixel weights for the four windows in the third FIR block. 8x8 window 14x14 window 28x28 window 56x56 window upper left quadrant upper right quadrant lower left quadrant lower right quadrant 2 2i − 8 × 22 j − 8 2i − 7 × 2 j − 7 2i −7 × 2 2 −7 2 j 4i −7 × 2 4 −7 2 j 2 2i −8 × 210 − 2 j 2 i − 7 × 28 − j 2 2 × 2 i −7 8− 2j 2 4 × 2 i −7 j 8− 4 210 − 2i × 22 j −8 28 − i × 2 j − 7 2 8− 2i × 2 2 − 7 j 2 8− 4i × 2 4 − 7 j 210 − 2i × 210 − 2 j 28 −i × 28− j 2 i 8− 2 ×2 8− 2j 2 8− 4i × 28− 4j 44 1 k −1 26 2 w [k ] = 1 6 214 − k 2 for k = 1, K , 7 (3.12) for k = 8, K ,14 These weights are shown on a logarithmic scale in Figure 3.7. The window has a length of fourteen pixels, and is characterized by even symmetry. The output of the basic window for pixel n can be expressed as: ∑ y[n] = 7 k =1 P [n − 8 + k ] × 2 k −1 + ∑k =8 P [n − 8 + k ] × 214−k 14 26 (3.13) Note that in the first half of the window, each weight is twice the one immediately preceding it. This half of the window can be considered a rising geometric series with a gain of two. Similarly, the second half of the window is a falling geometric series, also with a gain of two, so that the overall gain of the window is four. We can take advantage of the structure of the window in order to compute the window output using very simple accumulator-based hardware. One accumulator, a1, is devoted to the rising geometric series that is the first summation in Equation 3.13; a second accumulator, a2, tracks the falling geometric series that is the second summation in the equation. The output of the window can then be found by simply adding the two accumulators together and dividing by the gain factor of four. As we slide the window one pixel further along the data, we need only update the accumulators in the following simple way. For the first accumulator, the incoming pixel has the largest weight, of 26. All pixels already in the contents of the accumulator have a weight of one-half of whatever the weight was before the sliding of the window. Therefore, we can simply halve the entire contents of the accumulator, subtract the pixel on the extreme left hand side of the window weighted by 2-1 and then add the pixel in the 45 middle of the window weighted by 26. The second accumulator is responsible for the falling geometric series. Here, the incoming pixel has the smallest weight, 20, and the accumulator is doubled each time the window slides. Therefore, we can simply double the entire contents of the accumulator, add the newest pixel and subtract the pixel in the middle of the window weighted by 27. The accumulator updates described so far apply only for windows that lie completely within the borders of the image; modified updates are required near the image borders. These updates accomplish symmetric extension while still inputting each image pixel only once. The full form of the updates for the accumulators and the calculation of the window average at the output of the window are given in the following equations: Figure 3.7. The basic one-dimensional window in the logarithmic scale. 46 0 a [ n ] 2 a1 [n ] = a1 [n − 1] + 26 × P [n − 7 ] − 2−1 × P[2 × 7 − n + 1] 2 a1 [n − 1] + 26 × P [n − 7 ] − 2−1 × P[n − 14] 2 0 2 × a [ n − 1 ] + P [ n ] 2 a2 [n ] = 7 2 × a 2 [ n − 1 ] − 2 × P [ n − 7 ] + P [ n ] 2 × a2 [n − 1 ] − 2 7 × P [n − 7 ] + P [2 × 512 − n ] for n = 0, K , 6 for n = 7 for n = 8, K ,14 for n = 15, K , 518 (3.14) for n = 0 for n = 1, K , 7 for n = 8, K , 512 for n = 513, K , 518 (3.15) for n = 0, K , 6 0 Pave [n] = 1 a1 [n ] a2 [n ] + 7 for n = 7, K , 526 2 27 2 (3.16) For our one-dimensional example, both accumulators are reset at the beginning of the computation. As the first six pixels enter the system, a1 is disabled, and a2 takes twice its previous value plus the incoming pixel. During these first six pixels, the window is not yet full, and therefore the output is disabled. As the seventh pixel enters the window, symmetric extension of the left hand side of the row is accomplished by transferring the value of a2 to a1. The window is now full, and the first output can be produced; computation of the average, Pave, at the output of the filter according to Equation 3.16 is enabled. The first seven pixels are also sent to a stack. Pixels from the stack are used in updating a1 for the next seven pixels. The updates for the accumulators must also be modified as we near the end of the row of pixels, in order to accomplish symmetric extension of the right hand side of the image. Here, as the last six pixels of a row come in, we send them to a stack. Pixels from the stack are used in updating a2, thereby accomplishing the reflection of pixels needed at 47 the end of the row. During this phase, the computation of the average continues to be enabled, as we produce the last six window outputs. All the functions described in Equations 3.14 through 3.16 are implemented inside an arithmetic logic unit (ALU1). Figure 3.8 shows the hardware inside ALU1 for the computation performed in the middle of the image; acc1, acc2 and ave represent the first accumulator, the second accumulator, and the pixel average respectively. In the figure, each line is annotated with its bit width; the input bit width used in this example is 21 bits. ALU1 for this case constitutes mainly a number of right-shifts and left-shifts, as well as three adders and two subtracts. Similar hardware is used for the other cases; muxes are used to choose between the cases. ALU1 is used to generate the three windows of size 14x14, 28x28 and 56x56 pixels. Figure 3.8. Basic function of ALU1 when processing the middle of the image. 48 The second one-dimensional window is given by the weights 1 2k −2 26 2 w [k ] = 1 6 216 − 2 k 2 for k = 1, K , 4 . for k = 5, K , 8 (3.17) The window has a length of eight pixels, and is characterized by even symmetry. The output of the basic window for pixel n can be expressed as: ∑ y[n] = 4 k =1 P [ n − 5 + k ] × 2 2 k −2 + ∑k =5 P [ n − 5 + k ] × 216−2 k 8 26 . (3.18) Note that in the first half of the window, each weight is four times the one immediately preceding it. This half of the window can be considered a rising geometric series with a gain of 1.333. Similarly, the second half of the window is a falling geometric series, also with a gain of 1.333, so that the overall gain of the window is 2.666. We can take advantage of the structure of the window in order to compute the window output using very simple accumulator-based hardware. One accumulator, a1, is devoted to the rising geometric series that is the first summation in Equation 3.18; a second accumulator, a2 , tracks the falling geometric series that is the second summation in the equation. The output of the window can then be found by simply adding the two accumulators together and dividing by the gain factor of 2.666. The full form of the updates for the accumulators and the calculation of the window average at the output of the window is given in the following equations: 0 a [ n ] 2 a1 [n ] = a1 [n − 1] + 26 × P [n − 4] − 2 −1 × P[2 × 4 − n + 1] 4 a1 [n − 1] + 26 × P [n − 4] − 2 −1 × P[n − 8] 4 49 for n = 0, K , 3 for n = 4 for n = 5, K , 8 for n = 9, K , 515 (3.19) 0 4 × a [ n − 1 ] + P [ n ] 2 a2 [n ] = 8 4 × a 2 [ n − 1 ] − 2 × P [ n − 4 ] + P [ n ] 4 × a2 [n − 1] − 28 × P [n − 4 ] + P [2 × 512 − n ] for n = 0 for n = 1, K , 4 for n = 5, K , 512 for n = 513, K , 515 (3.20) for n = 0, K , 3 0 Pave [n] = 3 a1 [n ] a2 [n ] + 6 for n = 4, K , 515 8 26 2 (3.21) All the functions described in Equations 3.19 through 3.21 are implemented inside an arithmetic logic unit ALU2. Figure 3.9 shows the hardware inside the ALU2 for the computation performed in the middle of the image. Similar hardware is used for the other cases; muxes are used to choose between the cases. ALU2 is used to generate the smallest window of size 8x8 pixels. Figure 3.9. Basic function of ALU2 when processing the middle of the image. 50 The horizontal computation block (HCB) is used to compute one-dimensional window averages along the rows of an image. We have two kinds of HCBs each using a different ALU (ALU1 or ALU2), along with two registers to hold the values of acc1 and acc2. The general architecture of HCB1 and HCB2 is shown in Figure 3.10. An enable signal allows a particular horizontal computation block to be selectively enabled or disabled. The inputs to a horizontal computation block are the new pixel, the pixel in the middle of the window and the pixel at the extreme left end of the window. A similar block called the vertical computation block (VCB) is used to process one-dimensional window averages along the columns of an image. We have two VCBs each using a different ALU, along with two sets of registers to hold the values of acc1 and acc2. The length of the set of accumulators is equal to the number of pixels in a row. The general architecture of VCB1 and VCB2 is shown in Figure 3.11. The vertical computation block has two enable signals: enable2 is used to selectively enable or disable the updates of the accumulators, and enable1 is used to read accumulator values and use them to calculate the average at the output of the block. We implement a two-dimensional window by first processing the pixels vertically, down the columns of the image, and then processing the output of the vertical window horizontally. The smallest window uses one VCB2 followed by one HCB2 to get a window of size 8x8 pixels. If we were to use one VCB1 followed by one HCB1, we would get a two-dimensional window of 14x14 pixels. However, our algorithm relies on two additional windows of sizes 28x28 and 56x56. The weights for the pixels in the two additional windows, which were given in Table 3.1, can be easily implemented using 51 HCB1 and VCB1 as building blocks. We use the 56x56 pixel window as an example; construction of the smaller window is similar. The 56x56 pixel window weights are derived by repeating each window weight in the basic 14x14 window four times in the row and column direction. Accordingly, we use four of each VCB1 and HCB1 blocks. Figure 3.12 is a block diagram of the required hardware. As the rows of image pixels enter the system, we route them round-robin style to the four VCB1s, sending the first row to the first VCB1, the second row to the second VCB1, the third row to the third VCB1, the fourth row to the fourth VCB1, and then the fifth row back to the first VCB1, and so on. The overall vertical average is found by summing the outputs of the four VCB1s at every time step. A delay block/stack of 56 pixels is used to buffer the vertical averages on their way to the HCB1s. The pixels are routed round-robin style to the four HCB1s, and the outputs of the HCB1s are summed at each time step to produce the final output, a weighted average in a two-dimensional window. Figure 3.10. General architecture of horizontal computation block (HCB1 or HCB2). 52 Figure 3.11. General architecture of vertical computation block (VCB1 or VCB2). Figure 3.12. Hardware for the 56x56 pixel window. Figure 3.13. Block diagram for computation of the multi-scale average. 53 Figure 3.14. Memory organization to supply data to the four windows, and for the delayed pixel stream. Because the four windows, with their different sizes, require different processing latencies, delay blocks are used at the outputs of the smaller windows for the purposes of synchronizing the four different averages corresponding to a given pixel as shown by Figure 3.13. The hardware architecture of FIR block 3 is now completely described with one exception: a memory organization for supplying data to the four differently-sized windows. Memory is also used to supply a delayed version of the pixel stream to appear synchronously with the four different vertical averages. As shown in Figure 3.14, the memory organization is the same as the one used for FIR block 2 with one exception. The read port of memory six is used to generate the pixels in the middle of the four different windows. The control of the memory organization is described in detail in Section 3.1.2. 54 3.2 Average selection criteria This section describes two methods for estimating local illumination around a pixel. The first method calculates a composite average from the vertical, horizontal and square window of FIR block 1. The second method selects one of the outputs of the different square windows used by FIR block 2 and 3 as a local average for the pixel. The idea behind both methods is to choose the largest neighborhood around the pixel that does not contain a large illumination variation. 3.2.1 Average selection criterion 1 The FIR block 1 that was suggested in [52] and described in Section 3.1.1, computes three averages around each pixel: a vertical average ( Pver _ ave ), a horizontal average ( Phor _ ave ), and a square average ( Psquare _ ave ). The three averages are computed using a vertical window of 30 × 1 pixels, a horizontal window of 1× 30 pixels and a square window of 30 × 30 pixels. composite average: Pcom _ ave = wv × ( Pver _ ave − Psquare _ ave ) + wH ( Phor _ ave − Psquare _ ave ) 256 + Psquare _ ave (3.22) The three averages can be combined to create a where wv and wH represent vertical and horizontal weights respectively. To calculate the weights of the vertical and horizontal average we use the difference between two consecutive horizontal averages in the vertical direction (dH) and the difference between two consecutive vertical averages in the horizontal direction (dV), where 55 wV = d H × (16 − d V ) and wH = d v × (16 − d H ) . The values of dV and dH are truncated and clamped to a minimum of zero and a maximum of sixteen. The idea behind the composite average is to decrease the weight of the vertical average if the difference between two consecutive vertical averages is large and the difference between two consecutive horizontal averages is small. This would happen if the image had an abrupt vertical change in illumination. Similarly, if the difference between two consecutive horizontal averages is large and the difference between two consecutive vertical averages is small the weight of the horizontal average will be small. As an example, if the horizontal difference is equal to sixteen and the vertical difference is equal to zero then the local pixel average is equal to the average in the vertical direction only. 3.2.2 Average selection criterion 2 FIR blocks 2 and 3, described in Section 3.1.2 and 3.1.3, respectively, produce a number of local averages around each pixel using successively larger surrounds. This selection criterion selects one of the different local averages calculated in FIR block 2 or 3 as a suitable estimate of local illumination for the pixel. The difficulty is in determining how large a surrounding area to use in calculating the local average; if too small a surround is used, it does not give an accurate estimate of the local illumination, but if too large a surround is used, the surround may encompass a change in brightness that will throw off the estimate. Halo effects are caused when too large a surround encompasses a sudden change in brightness levels, causing a pixel to be normalized incorrectly. Reinhard’s solution [14] is to consider successively larger surrounds. A smaller surround 56 is eliminated from consideration if the relative difference between its estimate and the estimate of the next larger surround is no more than a set threshold; then, the larger surround is judged to include no additional sudden changes in brightness level, so that it is judged to be a better estimate of illumination. The algorithm used is similar to Reinhard selection criterion. The pseudocode of the algorithm is given in Table 3.3, where ave1, ave2, ave3 and ave4 are the four input averages and ε is a selection parameter. Two hardware blocks that are helpful in implementing this pseudocode are described in detail in Section 4.1. The first, so called convert-to-floating, converts fixed-point numbers to floating-point. The second, so called reciprocal, finds the reciprocal of the mantissa of a floating-point number. Notice that for special values of ε , we need only the exponents of the respective ratios to select the correct average. For example, if ε = 0.0625 = 0.5 × 2 −3 , any ratio that has an exponent larger than -3 will be larger than ε . The process of computing the exponent of the ratio1 will be discussed in detail below. The exponents of the other ratios are computed in a similar manner. Table 3.3. Pseudocode for average selection criterion 2. if ratio1 >= ε then abs (ave1 − ave2) ratio1 = ave1 abs (ave2 − ave3) ratio2 = ave2 abs (ave3 − ave4) ratio3 = ave3 ave = ave1 elseif ratio 2 >= ε then ave = ave2 elseif ratio3 >= ε then ave = ave3 else ave = ave4 end 57 Figure 3.15. The internal hardware for calculating the exponent of ratio1. The internal hardware for calculating the exponent of ratio1 is shown in Figure 3.15. We start by getting the absolute difference between ave1 and ave2. Both the absolute difference and ave1 are converted to floating-point representation using the convert-to-floating block. Then we find the reciprocal of the mantissa of ave1 using the reciprocal block. The ratio1 mantissa is found by multiplying the reciprocal of the mantissa of ave1 with the mantissa of the absolute difference. Finally, to get the ratio1 exponent ( exp ratio1 ), we subtract the exponent of the absolute difference exp abs _ diff from the exponent of ave1 ( exp ave1 ) and the complement of the most significant bit ( M ) of the ratio1 mantissa: expratio1 = expabs _ diff − (expave1 + M ) (3.25) The most significant bit of the ratio1 mantissa was added to equation 3.25 to make sure that the ratio1 mantissa is greater or equal to 0.5. 58 Pver _ ave Phor _ ave Psquare_ ave Figure 3.16. MGIP local illumination estimate. The three different FIR blocks, described in Section 3.1, and the two different selection criteria, described in Section 3.2, are used to implement three different local illumination estimates. The first estimate is implemented inside a system developed by Lockheed Martin for multi-gain image processing called MGIP [52]. The second and third estimates are two different fixed-point approximations of the Reinhard method that use two different FIR blocks. A high level block diagram of the MGIP local illumination estimate is shown in Figure 3.16. The input to the estimate is the HDR pixel stream. FIR block1 takes the HDR input pixel stream and calculates three different averages Pver _ ave , Phor _ ave , and Psquare _ ave . Then, average selection criterion 1 uses the three averages to calculate a composite average of the pixel and use it as a local illumination estimate. Figure 3.17. Four-scale constant weight local illumination estimate. Figure 3.18. Four-scale approximate Gaussian local illumination estimate. 59 The second estimate is called the four-scale constant weight pyramid local illumination estimate. A high level block diagram of the estimate is shown in Figure 3.17. FIR block2 takes the HDR input pixel stream and calculates four different averages using four square windows of constant weights around every pixel. Then, average selection criterion 2 selects the largest of these four windows that does not encompass a sudden change in illumination and uses its average as an appropriate estimate of local illumination for the pixel. The third estimate is called the four-scale approximate Gaussian pyramid local illumination estimate. A high level block diagram of the estimate is shown in Figure 3.18. FIR block 3 takes the HDR input pixel stream and calculates four different averages using four square windows of Gaussian-like weights, computed using power series, around every pixel. Then, average selection criterion 2 selects the largest of these four windows that does not encompass a sudden change in illumination and uses its average as an appropriate estimate of local illumination for the pixel. The three different local illumination estimates will be used by the local TMOs described in chapter 5. 60 CHAPTER IV LOG AVERAGE LUMINANCE AND NORMALIZATION METHODS The local illumination estimate, explained in the previous chapter, is combined with a logarithmic average of the luminance in the scene to normalize each pixel and find the display luminance. This chapter describes the computations performed by the log average luminance and the normalization methods, and explains their hardware implementation. The outline of the chapter is divided in the following way. First, two hardware blocks that are helpful in implementing both methods are described, the convert-to-floating block in Section 4.1 and the reciprocal block in Section 4,2. Section 4.3 describes the hardware implementation of the log average luminance. Finally, Section 4.4 describes the normalization method in detail. 4.1 Convert-to-floating block The convert-to-floating block takes as input a 32-bit fixed-point number. The output is a floating-point number with 17 bits of mantissa and 5 bits of exponent. The internal hardware of this block is shown in Figure 4.1. It consists of a number of registers and multiplexers. It starts by padding the input with 16 zeros from the right, for a total of 48 bits. If there is significant information in the top 16 bits, the most significant bit of the exponent is set to ‘1’, and the most significant 32 bits of the first register are transferred 61 to the second register; otherwise, the most significant bit of the exponent is set to ‘0’, and the least significant 32 bits of the first register are transferred to the second register. The process continues in a similar manner, considering fewer bits at each step, until the values of all the bits of the exponent are decided and the appropriate bits have been shifted to the output mantissa. The MSB of the mantissa is always set to one, and the value of the output exponent is shifted by one accordingly. Hence the smallest number that can be represented by this block is 0.5 × 21 and the largest number is (1 − 2 −18 ) × 2 32 . Figure 4.1. The internal hardware of the convert-to-floating block. 62 4.2 Reciprocal block The reciprocal block should produce the reciprocal of the mantissa. We borrow from the Newton-Raphson algorithm [55] to implement this block. The Newton-Raphson algorithm is used to solve the equation f ( x) = 0 iteratively, where X n +1 = X n − f (X n ) f ′( X n ) (4.1) To find the reciprocal of the mantissa b, we set f ( x) = 1 1 − b = 0 . Then f ′( x) = − 2 . X X After substituting the following recursive solution for the reciprocal is obtained: X n +1 = X n (2 − bX n ) The precision of the solution doubles with every iteration. The input to the reciprocal block is a 17 bit mantissa with the MSB set to one. We are guaranteed that the MSB of the reciprocal is also a one. The most significant eight bits of the mantissa, not counting the MSB, are used as an address to an 8-bit look-uptable to get a rough estimate of the reciprocal X0. This estimate is used as an initial value for the Newton-Raphson algorithm. After one iteration of the algorithm, we get a precision of sixteen bits; we have one additional bit from the MSB of the reciprocal. This is enough to represent the mantissa of the reciprocal. 4.3 Log average luminance block The log average luminance block calculates the log average luminance of the whole image. It does this by computing the average of the base-2 logarithms of all the 63 (4.2) pixels in the image. The log average luminance is estimated as 2 to the power of this average. Figure 4.2 shows the block diagram of the log average luminance. It is divided into three main subblocks: base-2 logarithm of the pixel stream, average computation and inverse base-2 logarithm of the average. For every subblock there are different approximation methods. The approximation methods introduce error in the estimation, while requiring varying amounts of hardware. In order to test the effects of the estimation on the final global average, we conducted a number of experiments in which different approximation methods for every subblock are compared to the floating-point calculation. Taking the base-2 logarithm of an integer x is an expensive computation. As a first approach, the logarithm can be done using a look-up table; however, if the integer x has many bits, the look-up table will be large and slow. We have experience with one system that uses synthesized images, fused from images taken under different conditions, with 21-bit input pixels. For such a system, a full look-up table with 221 entries is not feasible from a hardware perspective. An estimate of the logarithm can be found based on detection of the number of leading zeros in the integer x. The weight of the most significant one bit in x is the integer part of the base-2 logarithm. The remaining bits in x determine the fraction part of the logarithm. For example, if a pixel has luminance value 4092, recognizing that the most significant bit is in the eleventh position, we write the logarithm as: 2044 2044 log 2 (4092) = log 2 211 + 2044 = log 2 211 1 + 11 = 11 + log 2 1 + 11 2 2 ( ) (4.3) 64 Figure 4.2. The block diagram of the log average luminance. We can use the bits following the most significant one bit in a number of different ways to estimate the fraction part of the base-2 logarithm; we need to compute log 2 (1 + f ) , where f is a number between 0 and 1. One possibility is to simply use the bits themselves directly as an approximation of the fraction part [51]. Another is to use a Taylor series expansion of log 2 (1 + f ) . Unfortunately, the Taylor series converges slowly, so that a large number of terms of the series must be used to produce a reasonable approximation; we show twelve terms in our example. We propose a third method, to use a fixed number of bits following the most significant ‘1’ to look up the value of log 2 (1 + f ) in a table. The size of the look-up table and the error in the estimation depend on how many bits we choose to use as the address and how many fraction bits we choose to store in the table. 2044 For an address size of eight and word size of eight, log 2 1 + 11 ≅ 0.9960 , so that our 2 estimate of log 2 (4092) is 11.9960. The actual value is 11.9986. The three different estimation methods of the base-2 logarithm of the pixel stream were tested using grayscale fixed-point high dynamic range images derived from the Debevec library. We substituted the first subblock of the log average luminance by the three different estimation methods using 8-bit precision. The average computation and 65 inverse base-2 logarithm of the average were done in floating-point precision. Table 4.1 shows the percentage error of the log average luminance for the three different methods as well as the hardware cost. From the test, our proposed method gives a nice trade-off between the hardware cost and the percentage error at the output of the log average luminance. It is much lower in cost than the Taylor series method, and gives lower errors than either of the other two methods. Table 4.1. Comparison of methods for approximating the fraction part of the logarithm. Approximate the fraction part by using the bits after the most significant ‘1’ directly Taylor series Hardware cost none 11 adders, 11 dividers, and 12 multipliers % error Nave Memorial Rosette GroveC GroveD Vinesunset 4.55 4.06 4.55 4.56 4.47 4.69 0.98 0.46 1.05 0.91 0.91 0.88 0.85 0.34 0.88 0.83 0.82 0.83 256 byte memory look-up table (proposed) Table 4.2. The percentage error of the log average luminance with approximations of the base-2 logarithm and a fixed-point average calculation. Nave Memorial Rosette GroveC GroveD Vinesunset % error 2.17 4.14 1.59 2.06 2.24 2.19 66 Table 4.3. The percentage error of the log average luminance with a fully hardwarefriendly approximation. Nave Memorial Rosette GroveC GroveD Vinesunset % error 2.27 4.20 1.66 2.09 2.28 2.07 The base-2 logarithm of the pixel stream is added to a large accumulator that keeps a total for the image. The width of the accumulator depends on the resolution and dynamic range of the image. For example, for a 1024x768 image with 13 bits of precision for the base-2 logarithm (5 for the integer part and eight for the fraction part), we need an accumulator of 33 bits. The Log average is found by dividing the accumulator by the number of pixels N in the image. To compute this average, we should find the reciprocal 1/N and multiply it with the output of the accumulator. Table 4.2 shows the percentage error at the output of the log average luminance block when using our proposed method for approximating the base-2 logarithm, the fixed-point version of the average computation, and the floating-point version of the inverse base-2 logarithm of the average. The level of error is suitable for our application. The final step of the log average luminance is to compute the inverse base-2 logarithm of the average. We divide the average into an integer part, x, and a fraction part, f. Hence, 2 ave = 2 x × 2 f ,where 2f is a number between 1 and 2. To determine the fraction part of 2f, we can again use one of the three methods described in the estimation of the base-2 logarithm. As mentioned before, the look-up table is a good trade-off between hardware cost and the percentage error at the output of the log average luminance. The look-up table we used has an address size of eight and word size of eight. The output of the look-up table was added to ‘1’ and fed to a barrel shifter controlled by 67 the integer part x of the average. The final output of the log average luminance is computed by truncating the fraction bits of the output of the barrel shifter. Table 4.3 shows the percentage error at the output of the log average luminance block, now fully implemented in hardware, for the different input images. 4.4 Normalization block The normalization block takes the local illumination estimate ( Ls _ average ) and local the logarithmic average of the luminance ( Ls average ) as inputs to normalize the HDR log_ pixel stream. We add these two quantities to get the normalization value V for every pixel. The display luminance Ld is given in Equation 4.4: Ld = Ls Ls _ average + a × Ls average local log_ (4.4) where a is a weighting factor. The display luminance is represented in floating-point with 8 bits mantissa and 5 bits of exponent. The internal hardware for implementing Equation 4.4 is shown in Figure 4.3. Lsexp Ld exp M M Ls Lsmantissa M Ld mantissa Lslog_ average Vmantissa 1 Vmantissa L s local _ average Vexp Figure 4.3. Hardware internal to the normalization block for calculating the display luminance mantissa and exponent. 68 We start by getting V from Ls _ average and Ls average . Both V and scene luminance local log_ Ls are converted to floating-point representation using the convert-to-floating block. Then we find the reciprocal of the mantissa of normalization value 1 Vmantissa using the reciprocal 1 Vmantissa block. The mantissa of the display luminance Ld mantissa is found by multiplying with the mantissa of the scene luminance Lsmantissa . If the most significant bit M of the product is equal to ‘1’, the most significant 8 bits of the product are transferred to the output. If M = 0 , the next 8 bits of the product are transferred to the output .Finally, to get the display luminance exponent Ld , we subtract the exponent of the normalization value exp Vexp from the exponent of the scene luminance Ls and the complement of M: exp Ld = Ls − (Vexp + M ) exp exp (4.5) The final step of the stage is to convert the display luminance to fixed-point representation. The display luminance has gray levels between 0 and 255. Hence, it can be represented in 8 bits. We start by adding 8 to the exponent of the display luminance, which is equivalent to multiplying the display luminance by 256. Then we transfer the most significant bits from the mantissa of the display luminance to the output according to the algorithm shown in Table 4.4, where is the floor function. In this chapter, as well as in Chapter three, we have presented the basic building blocks needed to build the local tone mapping operators that we have implemented in hardware. All of the building blocks are simple in hardware and suitable for implementation on embedded platforms. The building blocks are the result of careful 69 trade-offs of both image processing and hardware performance aspects, as will be seen in the next chapter. Table 4.4. The algorithm to transfer the correct bits of the luminance mantissa to the output in the normalization block. d if Lexp < 1 then Ld = 0 elseif Ld > 8 then exp Ld = 255 else Ld = Ld mantissa × 2 Ld + 8 exp end 70 CHAPTER V RESULTS FOR LOCAL TONE MAPPERS IMPLEMENTED IN HARDWARE This chapter describes details of the synthesis and simulation of the three different local tone mappers that were implemented in hardware. The first local tone mapper was implemented inside a system developed by Lockheed Martin for multi-gain image processing called MGIP [52]. The second and third local tone mappers are two different fixed-point approximations of the Reinhard method that use two different FIR blocks. The three different local tone mappers were written in VHDL and simulated using Modelsim SE version 6.0. The hardware was synthesized using Quartus tools. The input images were of variable size and dynamic range. To accommodate this, we kept the number of rows in an image, the number of pixels in a row and the input bit width as generic parameters in the hardware descriptions, so that variations of a local tone mapper for different image sizes and ranges could be easily synthesized. The output images produced by the local tone mappers were written to text files, then transformed to bit map images using Matlab, for display and analysis. In addition, a fixed-point Matlab simulation was developed for each of the local tone mappers that were implemented in hardware. The hardware of each local tone mapper was thoroughly verified by comparing the output that it produces to the output of the corresponding fixed-point Matlab simulation, for a set of testbench images. 71 The outline of the chapter is as follows. The block diagrams of the different local tone mappers are described in Section 5.1. The simulation results of the different local tone mappers are given in Section 5.2. The hardware costs of the different local tone mappers are given in Section 5.3. Lslocal _ average Ls Ld Lslog_ average Figure 5.1. Block diagram of the MGIP mapper. Lslocal _ average Ls Ld Lslog_ average Figure 5.2. Block diagram of the four-scale constant weight pyramid mapper. Ls _ average local Ls Ld Lslog_ average Figure 5.3. Block diagram of the four-scale approximate Gaussian pyramid mapper. Table 5.1. The parameters of the MGIP mapper for different images. Rosette Memorial Nave GroveC GroveD Vinesunset dif_sel a -4 4 -6 4 -2 4 72 -5 4 -6 4 -8 0.5 5.1 Block diagram of the different local tone mappers A high level block diagram of the three different tone local tone mappers is shown in Figure 5.1 to 5.3, respectively. The only difference between the three different tone local tone mappers is the block used to estimate the local illumination estimate. The three different local illumination estimates used are the MGIP, the four-scale constant weight pyramid and the four-scale approximate Gaussian pyramid. All local illumination estimates were described in detail in Chapter three. The input to the different local tone mappers is the HDR pixel stream that represents the scene luminance ( Ls ). The local illumination estimate block takes the scene luminance and produces a local illumination estimate for every pixel ( Ls _ average ). Another copy of the scene luminance is sent to the local log luminance average block, described in Section 4.3, to generate the scene log average luminance ( Ls average ). Finally, the normalization block, described in Section 4.4, log_ takes Ls , Ls _ average , and Ls average to compute the tone mapped pixel stream that local log_ represents the display luminance ( Ld ). 5.2 Simulation results for the different local tone mappers The MGIP mapper was tested using HDR images from the Debevec library. Table 5.1 shows the values of dif_sel and the weights a of global average used for every image. These values were selected experimentally by qualitatively evaluating the tone-mapped images. Figure 5.4 shows the output of the mapper for the different images. A major limitation of the MGIP mapper is that its method of producing a local illumination estimate is not robust. We see this clearly in Table 5.1; the parameter dif_sel 73 must take on different values for different images in order to get high quality results. This poses difficulty to a user who would like to tone-map images in real-time, as there is no good way to determine appropriate values for the parameters on-line. In contrast, the selection parameter ε for the four-scale constant weight and approximate Gaussian pyramids can be chosen a priori in a way that works well for a wide variety of images. In addition, halo artifacts are present in all images processed by the MGIP mapper. These artifacts appear very clearly around the windows and light sources of Rosette, Memorial and Nave. On the other hand, the mapper does succeed in preserving the details and in enhancing the edges and contrast of all six images. The four-scale constant weight pyramid mapper was tested using HDR images from the Debevec library. We used a fixed ε = 0.5 × 2 −3 and a fixed a = 4 except for Vinesunset, where a = 0.5. Because Vinesunset is a very dark image, we need a smaller weight for the scene log average luminance, in order to see details clearly in the image. The output of the mapper is shown in Figure 5.5. The mapper succeeds in decreasing the halo artifacts around the window of Rosette, the windows of Memorial, and the windows and light source of Nave. Details are preserved in all six images; however, edges and contrast are slightly less enhanced than in the output of the MGIP mapper. Vinesunset in particular is more visually pleasing when processed by MGIP compared to the four-scale constant weight pyramid mapper. The four-scale approximate Gaussian pyramid mapper was tested using HDR images from the Debevec library. Again, we used a fixed ε = 0.5 × 2 −3 and a fixed a = 4 except for Vinesunset, where a = 0.5 . The output of the mapper is shown in Figure 5.6. Halo artifacts are almost completely absent in all six images. Details are preserved. Edges 74 and contrast are enhanced well. It is the approximate Gaussian weight distribution in the two-dimensional FIR filters used inside the local illumination estimate that gives this mapper its success. It is always difficult to judge the quality of an image quantitatively. The literature often uses measures like the peak signal-to-noise ratio (PSNR) of the processed image compared to an ideal original image. We next present the results of the PSNR of the three different local tone mappers compared to the output of a variation of the floating-point Reinhard tone mapping operator, which we consider a gold standard for comparison purposes. The variation of Reinhard’s tone mapping operator was built to more closely match our local tone mappers. A fixed ε = 0.5 × 2 −3 and a= 4 were used for the four-scale constant weight mapper, four-scale approximate Gaussian pyramid mapper, and the variation of Reinhard tone mapping operator. The dif_sel values of Table 5.1 and a fixed a= 4 were used for the MGIP mapper. Table 5.2 summarizes the results. From the results of the table we notice the following: the PSNR of the MGIP mapper was small for those images that contain large windows and significant halo artifacts. The MGIP mapper performed fairly well for Vinesunset. The four-scale constant weight pyramid mapper performed better than the MGIP mapper for all images. The four-scale approximate Gaussian pyramid mapper gave large PSNR for all images. However, its performance varies according to different images. The lowest PSNR was for Rosette; this is due to the fact that the rosette image contains a very large window where most of the halos exist. The highest PSNR was for Vinesunset; notice that Vinesunset does not contain any noticeable halo artifacts in the different local tone mappers. 75 Table 5.2. The PSNR of the three different local tone mappers for the set of input images. MGIP Four-scale constant Four-scale approximate weight pyramid Gaussian pyramid Nave Memorial Rosette GroveC GroveD Vinesunset 5.3 19.8 23.0 19.6 22.5 23.3 30.8 29.1 30.9 25.1 29.4 29.8 41.4 33.2 34.4 28.5 33.5 33.6 42.6 Hardware costs of the different local tone mappers The three different local tone mappers were implemented using Altera FPGAs from the Stratix II family. The architectures were sized in order to accommodate high resolution images of high dynamic range with 768 × 1024 pixels and 28 bits per pixel. It should be noted that memory requirements would be less for lower resolution images or images of smaller dynamic range. The architecture was described in VHDL, and synthesized for the Stratix device using Altera’s Quartus II v5.0 toolset. Table 5.3 summarizes the synthesis report from Quartus. It is important that although the systems have different complexities and costs, they all run at about the same high speed. The simplicity of hardware for all local tone mappers is reflected in the low utilization of look-up tables. All local tone mappers have used a significant percentage of the available embedded memory. It is clear that processing algorithms for high resolution images, in general, require significant amounts of memory. If they are to be implemented on a single chip, they need a specialized FPGA device with extended memory capabilities. 76 In summary, this chapter has shown the trade-offs of both the image processing and hardware performance aspects. The four-scale approximate Gaussian pyramid mapper gave large PSNR values and qualitatively good tone mapped images. However, the hardware complexity and the amount of memory used in this local tone mapper are larger than needed for the four-scale constant weight pyramid mapper and the MGIP mapper. Nevertheless, the algorithm was successfully synthesized using an FPGA platform with high operating frequency. This result has encouraged us to look at even more complex ideas that will be discussed in the following chapter. Our target is to reach the best possible quality of tone mapped images using the available memory and look-up tables of the new FPGA families. Table 5.3. The synthesis report from Quartus for the four different local tone mappers. Local tone mappers Total bits of Total Maximum operating memory logic cells frequency MGIP Four-scale constant weight pyramid Four-scale approximate Gaussian pyramid 1,071,360 / 2,544,192 2,548,736 / 4,520,448 2,952,960 / 4,520,488 2492 / 48352 14014 / 72768 17553 / 72768 85.11 MHZ 93.41 MHZ 83.83 MHZ 77 Rosette GroveC GroveD Vinesunset Nave Memorial Figure 5.4. HDR images from the Debevec library after being processed by the multi gain image processing mapper. 78 Rosette GroveC GroveD Vinesunset Nave Memorial Figure 5.5. HDR images from the Debevec library after being processed by the four-scale constant weight pyramid mapper. 79 Rosette GroveC GroveD Vinesunset Nave Memorial Figure 5.6. HDR images from the Debevec library after being processed by the four-scale approximate Gaussian mapper. 80 CHAPTER VI NINE-SCALE APPROXIMATE GAUSSIAN PYRAMID MAPPER The original Reinhard local tone mapper [14] uses a nine-scale Gaussian pyramid to estimate the local illumination for each pixel. Using more scales allows for better contrast because the pixel is normalized by using the largest surround possible from an increased set of surrounds. Chapter V showed a successful implementation of a four-scale approximate Gaussian pyramid mapper that operates at high speed. describes a nine-scale version of the mapper. The outline of this chapter is as follows. Our nine-scale approximate Gaussian pyramid mapper is described in detail in Section 6.1. Its FPGA implementation and results are given in Section 6.2. Finally, we show the simulation results of the algorithm using a set of HDR images from the Debevec library in Section 6.3. 6.1 Nine-scale Gaussian pyramid mapper The block diagram for our nine-scale Gaussian pyramid mapper is shown in Figure 6.1. The input to the mapper is the HDR pixel stream that represents the scene luminance ( Ls ). The local illumination estimate block takes the scene luminance and produces a local illumination estimate for every pixel ( Ls _ average ). Another copy of the local scene luminance is sent to the log luminance average block to generate the scene log 81 This chapter average luminance ( Ls average ). Finally, the normalization block takes Ls , Ls _ average , log_ local and Ls average to compute the tone-mapped pixel stream that represents the display log_ luminance ( Ld ). The log average luminance block and the normalization block were explained in details in Section 4.3 and Section 4.4, respectively. The local illumination estimate will be explained in details in this section. We next consider a hardware-friendly approximation to the Gaussian pyramid. Our approximation to Reinhard’s nine-scale Gaussian pyramid uses nine different two-dimensional windows to calculate nine different averages around a pixel, and selects the most suitable one as an estimate of the pixel’s local illumination. When calculating the local averages around the pixel, the original Reinhard operator weights the pixels in a surround according to the Gaussian function: g (i, j , σ ) = 1 − 2 i2 + j2 πσ e σ2 (5.1) where i and j are the indices indicating the distance from the pixel, and σ is the scale of the surround; this scale controls the rate of decay of the Gaussian surround around the pixel being processed. The required coefficients, with their exponential relationship, are in general difficult to implement in fixed-point mathematics. We replace the coefficients with ones that serve the same purpose but are much more hardware-friendly. Figures 6.2 and 6.3 show the windows for the original Gaussian surround and their hardware-friendly approximations, respectively. The intensity shown in both figures is proportional to the logarithm of the corresponding weight in the window. Expressions for the pixel weights in the upper left quadrant are given in Figure 6.3, where denotes the ceiling function; the other three quadrants can be obtained via symmetry. 82 Lslocal _ average Ls Ld Lslog_ average Figure 6.1. Block diagram of the nine-scale approximate Gaussian pyramid mapper. s =1 s = 1 .5 s=2 s = 2 .5 s=3 s=4 s =8 s = 12 s = 216 Figure 6.2. The windows for the original Gaussian surround. 83 wi , j = 16i − 2 × 16 j − 2 wi , j = 8i − 3 × 8 j − 3 wi , j = 4 i − 4 × 4 j − 4 wi , j = 3i − 5 × 3 j − 5 5 wi , j = 2 i −6 5 × 2 j −6 wi , j = 2 i −8 × 2 j −8 wi , j = 2 2i −8 × 2 2j −8 wi , j = 2 i −8 3 × 2 3 j −8 wi , j =2 i −8 4 j −8 × 24 Figure 6.3. The hardware approximation for the original Gaussian surrounds. As in a Gaussian surround, larger weights are given to the pixels in the center of the window; however, we approximate the steep exponential drop-offs of the six smaller windows by implementing filters with coefficients that are with powers of sixteen, eight, four, three, five over two, and two, respectively; the less steep drop-offs of the three largest windows are approximated by using filter coefficients that are powers-of-two repeated two, three, and four times, respectively. Notice that, for the four-scale approximate Gaussian pyramid, described in Section 3.1.3, we only used a small set of 84 these filter coefficients. This special relationship of the weights ensures that computation of the averages can be done with simple hardware. The two-dimensional square windows are implemented using two onedimensional filters. The vertical filter calculates the average of the input image along the columns. The horizontal filter calculates the average of the image processed by the vertical filter along the rows. In the remainder of this section, we first describe how the vertical filters are implemented, including details of the memory organization used to provide pixels to all nine scales. Structures for the horizontal filters are then described. The process for deciding which of the nine scales provides the best local estimate of illumination for a given pixel is similar to the average selection criterion 2 described in Section 3.2.2. Together, these subsections completely describe our approximation to the Gaussian pyramid and how it is used to provide a local estimate of illumination. 6.1.1 Structure of vertical filters All vertical filters are implemented using simple accumulator structures. One accumulator is devoted to the falling geometric series. A second accumulator tracks the rising geometric series. Taking the vertical filter for the fifth scale as an example, when the window slides one pixel further along the data, the accumulators are updated in the following simple way: acc1[n] = 2 2 5 acc1[n − 1] + M [n] − T [n] 5 5 2 6 5 (5.2) 5 5 acc 2[n] = acc 2[n − 1] + B[n] − M [n] 2 2 (5.3) 85 5 2 × a1 [ n ] + a 2 [ n ] ave[n] = 6 2 5 1− 2 1− (5.4) where B is the pixel entering the window at the bottom, M is the pixel in the middle of the window, and T is the pixel leaving the window at the top. While the accumulator updates described in Equations 5.2 to 5.4 apply only to windows that lie completely within the borders of the image, the updates can be easily modified using stacks to accomplish symmetric extension at the image borders while still inputting each image pixel only once. Figure 6.4. The ALU for the vertical filter for the fifth scale, given a pixel stream of 28 bits. 86 Figure 6.5. Hardware architecture for the vertical filter for the fifth scale. Table 6.1. The hardware details for the nine vertical filters, given an input pixel stream of 28 bits. Scale scale constant acc1 acc2 number of pair (acc1, acc2) bit bit of accumulators width width 1st 2nd 3 rd th th th 16 8 4 3 2.5 2 2, repeated twice 2, repeated three times 2, repeated four times 36 37 36 36 36 36 36 36 36 36 37 36 36 42 36 36 36 36 1 per column 1 per column 1 per column 1 per column 1 per column 1 per column 2 per column 3 per column 4 per column 4 5 6 7th 8th 9th Figure 6.4 shows the arithmetic logic unit (ALU) used in the vertical filter for the fifth scale for an example input pixel width of 28 bits. Each signal is annotated with its bit width. Data flows from top to bottom; when two signal vectors come together, they are concatenated (this is used to pad signals with zeroes at the top or bottom). When a signal is split into two, the number of bits indicated on the least significant end of the 87 signal is discarded. The simplicity of hardware is reflected in the number of shift-andadd operations. We use eight extra integer bits for both accumulators to eliminate the possibility of overflow. We also allocate six fraction bits for the second accumulator. These bits are necessary to make the computation for the second accumulator stable; since the computation has a gain larger than one, we need to be sure that we exactly remove all the contributions of a pixel as it exits the window. When computing the average, we are interested only in the integer portions of both accumulators, and so the fraction bits of the second accumulator are truncated. Figure 6.5 shows the hardware architecture of the vertical filter for the fifth scale. Because the pixels enter the system in a row-by-row order, it is not sufficient to use a single accumulator to calculate the output of the vertical filter. Instead, we use a set of accumulators, one for each column of the image. If, for example, the image contains 1024 pixels in a row, there is a set of 1024 accumulators for acc1 and a set of 1024 accumulators for acc2; each set is stored together in a single 1024-word memory. The accumulators in the set are updated one by one as the pixels enter the system. Figure 6.6. Memory organization for the nine-scale approximate Gaussian pyramid. 88 Similar hardware is used to implement the vertical filters for the other eight scales. The basic difference is in the scale of the geometric series, the number of accumulators, and the bits allocated for each accumulator. Table 6.1 shows the hardware details for the filters for all nine scales, given an input pixel stream of 28 bits. The hardware architecture used to calculate the nine different vertical averages is now completely described with one exception: a memory organization for supplying data for the nine differently-sized windows. Memory is also needed to supply a delayed input pixel stream, synchronized such that each input pixel appears simultaneously with its different averages. As shown in Figure 6.6, we divide the sixty-four rows of pixels needed for the update of the largest window into nineteen separate memories, each implemented using dual-port physical memories. The memory organization ensures that all pixels needed in a single time step are in different physical memories; these pixels include the nine pixels at the bottoms of the nine windows, the nine pixels at the tops, the one pixel that lies in the (common) middle of the nine windows, and the delayed version of the input pixel stream. Because they are in different physical memories, all pixels needed to update the accumulators in a given time step can be read simultaneously. 6.1.2 Structure of horizontal filters Although the accumulator structures described above are the simplest method to implement geometric series functions, they are not sufficient to implement the horizontal filters that use more complex scale constants. For the vertical filters, the accumulators were updated row by row; hence, we have a number of clocks between updates in which computations can be done; and an accumulator structure is fast enough regardless of the 89 complexity of the scale constant. For the horizontal filters the accumulators should be updated pixel by pixel, and therefore all computations must be done in a single clock cycle. For scales involving more complex update computations, the accumulator structure is not a good choice because the critical path of the system will necessarily be high, so that the maximum operating frequency and throughput are low. Instead, other structures are used. Table 6.2 Hardware details for the horizontal filters. scale constant implementation latency method (clock cycles) 16 8 4 3 2.5 2 2, repeated twice 2, repeated three times 2, repeated four times factored form FIR factored form FIR factored form FIR direct form FIR direct form FIR accumulator-based accumulator-based accumulator-based accumulator-based 2+4 3+7 4+7 5+7 6+9 8+2 16+3 24+6 32+4 scale hardware cost (logic cells) 194 269 326 562 1594 521 975 1459 1858 1st 2nd 3rd 4th 5th 6th 7 8 9 th th th Figure 6.7. Hardware architecture of the factored form FIR filter used for the horizontal filter for the first scale. 90 Table 6.3. The coefficients for the horizontal filters for the fourth and fifth scales, in canonical signed digit representation. fourth scale fifth scale coefficient 1 3 32 33 3 4 CSD representation 1 101 1001 100101 1010001 1 coefficient 1 CSD representation 2.5 2.52 2.53 2.5 4 10.1 1010.01 10000.101 101001.0001 10100010.10101 2.55 Figure 6.8. The ALU for the horizontal filter for the sixth scale, given an input pixel stream of 28 bits. Figure 6.9. The hardware architecture for the horizontal filter for the sixth scale. 91 Table 6.2 shows the implementation methods used for the horizontal filters for all nine scales. For each scale, the resulting latency in clock cycles and hardware cost in logic cells is given. Latency depends on half the length of the filter and on the degree of pipelining needed to achieve high throughput; in the table, it is shown as the sum of these two factors. The different implementation methods used are the factored form FIR, direct form FIR and the accumulator structure. Figure 6.7 shows the hardware architecture for the horizontal filter for the first scale as an example of the factored form FIR implementation. We divide the filter into two halves representing the rising and falling parts of the geometric series. To implement the filter taps of the rising geometric series, we delay the pixel stream and multiply by the gain factor. Conversely, the filter taps of the falling geometric series are implemented by delaying the pixel stream and dividing by the gain factor. Because the gain factor of the series is a power of two we implement multiplication and division by simply shifting the pixel stream to the left and right respectively. We then fold the two halves of the filter and add the outputs of the taps with equal gain. The final output of the filter is computed by adding all the outputs of the taps. A similar architecture is used to implement the horizontal filters for the second and third scales. Horizontal filters for the fourth and fifth scales are implemented using a direct form FIR structure. Both filters were realized using a tree of shift-and-add operations. We represent the coefficients of the filters in canonical signed digit (CSD) representation to decrease the depth of the tree; this decreases both the hardware cost and latency of the filter. Table 6.3 shows the CSD representation of the coefficients of the filters. 92 Coefficients are given in binary, with an overbar indicating that the corresponding bit should be subtracted. Finally, the accumulator-based structure is used to implement the horizontal filters for the sixth, seventh, eighth, and ninth scales. As an example, the arithmetic logic unit and hardware architecture for the horizontal filter for the sixth scale are shown in Figures 6.8 and 6.9, respectively. The signal P represents the incoming pixel stream; at every clock, we need the newest pixel P[n] entering the window, the pixel P[n-8] in the middle of the window, and the pixel P[n-16] leaving the window. Figure 6.8 shows that the number of operations required to update every accumulator is quite small; the update is simple enough to be implemented in a single clock. As shown in Figure 6.2, the windows for the seventh, eighth, and ninth scales are derived by repeating each window weight in the basic sixth scale window two, three and four times, respectively, in both the row and column directions. Accordingly, we use copies of the vertical and horizontal hardware architecture of filter six to implement the larger windows. As the rows of image pixels enter the system, we route them round-robin style to the different copies of the vertical filter. The overall vertical average is found by summing the output of the different copies. The vertical averages are routed round-robin style to the different copies of the horizontal filter, and the outputs of the copies are summed at each time step to produce the final output of the two-dimensional window. 6.2 Results of Hardware Synthesis The proposed architecture was described in VHDL, and synthesized for a StratixII FPGA device using Altera’s Quartus II v5.0 toolset; functionality of the system was 93 verified by simulating the processing of test images using Modelsim and comparing the output to the output of a fixed-point Matlab simualtion. The architecture was sized in order to accommodate high resolution images of high dynamic range with 1024×768 pixels and 28 bits per pixel. It should be noted that memory requirements would be less for lower resolution images or images of smaller dynamic range. Table 6.4 summarizes the synthesis report from Quartus. The simplicity of hardware is reflected in the clock speed achieved, and in the low utilization of logic cells. The implementation has used a significant percentage of the available embedded memory. It is clear that processing algorithms for high resolution images, in general, require significant amounts of memory. If they are to be implemented on a single chip, they need a specialized FPGA device with extended memory capabilities. Our architecture needs a horizontal blanking period of 64 pixels, and a vertical blanking period of 32 rows. The horizontal blanking period is divided into three parts. three clocks at the beginning of every row are necessary to check the start of the row and prepare the system to process the row according to its position in the frame. Thirty-two clocks at the end of every row are necessary to compensate for the latency of the largest horizontal filter. Twenty-nine extra clocks at the end of every row are necessary for the hardware latency. A vertical blanking period is necessary for the latency of the largest vertical filter. Given that we would like to achieve a video frame rate of 60 frames per second, and that there are (1024+64)*(768+32) or 870,400 pixels in the frame when we include the blanking periods, we need to be able to process 60*870,400 = 52.24 megapixels per second. Our architecture, which has a maximum operating frequency of 77.15 MHz, can accommodate this by taking in one pixel per clock. 94 Table 6.4. Summary of hardware synthesis report. Device EP2S130F780C4 Total bits of memory Total logic cells Max operating frequency Latency between rows Latency between frames 3,153,408/6,747,840 (46 %) 34,806/106,032 (32%) 77.15 MHz 3(controller)+32(filter_size/2)+29(hardware) 32 rows (filter_size/2) 6.3 Experiments and Results Figure 6.10 shows the set of HDR images from the Debevec library after processing by the nine-scale approximate Gaussian pyramid mapper. Evaluating the tone mapped images visually, we see that the mapper gives comparable results to the original method. In particular, for the Nave image the mapper is able to show details in the dark areas while still allowing the windows to be admired. We are also able to see the small dark details surrounding the bright window in the dome of the church of the Memorial image without losing the texture of this window. These two images present the largest challenge for the four-scale approximate Gaussian pyramid mapper, described in the previous chapter. Halo artifacts are absent in all images; these would manifest themselves as black or bright bands around the church windows and behind the trees in the natural images. Details in the dark areas can be seen, and edges look sharp. We have also conducted a study of the error introduced by our hardware approximations to the Reinhard operator. Our gold standard is a floating-point version of the Reinhard operator that uses a nine-level true Gaussian pyramid; the nine levels have scales of 1, 1.5, 2, 2.5, 3, 4, 8, 12 and 16, respectively. These scales were chosen because they correspond to the values chosen for our hardware approximation. Figure 6.11 show 95 the HDR images from the Debevec library after being processed by the floating-point version of the Reinhard operator. Considering the processed image from the gold standard to be the signal, and the difference between the processed images from the gold standard and the proposed hardware architecture to be the noise, the peak signal-to-noise ratio (PSNR) was calculated for each test image. Results are shown in Table 6.5. The size of the PSNR values shows that our fixed-point approximations match the floatingpoint well, and gives us confidence that the approximation is reasonable. Notice that, the PSNR values are more consistent than the results of the four-scale approximate Gaussian pyramid presented in Section 5.2. The nine-scale approximate Gaussian pyramid local tone mapper, with its simplicity and high operating frequency, is a promising algorithm for real-time display of high dynamic range images on standard LCD screens. While the example system implemented here is for 1024×768 gray scale images with 28 bits per pixel, the design can be easily parameterized to deal with images of different input dynamic ranges and displays of different resolutions. Extension to color will be discussed in the next chapter. Table 6.5. PSNR results for the test images. Rosette Nave GroveC GroveD 34.0 dB 35.6 dB 35.9 dB 35.5 dB Memorial 33.7 dB Vinesunset 33.2 dB 96 Rosette GroveC GroveD Vinesunset Nave Memorial Figure 6.10. The HDR images from the Debevec library after being processed by the nine-scale approximate Gaussian pyramid mapper for grayscale images. 97 Rosette GroveC GroveD Vinesunset Nave Memorial Figure 6.11. The HDR images from the Debevec library after being processed by the floating-point version of nine-scale Reinhard local tone mapper for grayscale images. 98 CHAPTER VII TONE MAPPING OF COLOR HIGH DYNAMIC RANGE IMAGES This chapter extends the nine-scale approximate Gaussian pyramid mapper to the color domain. First the luminance component is calculated from the HDR RGB triplet. This component forms a gray scale version of the image. The log average luminance of the scene and the local illumination estimate around every pixel are calculated from the luminance component. These two quantities are used to map the original RGB triplet to an RGB triplet that can be displayed with clear detail on standard display device without halo artifacts. The outline of this chapter is as follows. The nine-scale approximate Gaussian pyramid mapper for color images is described in detail in Section 7.1. Section 7.2 gives the results of implementing the mapper on an FPGA. Finally, we show simulation results for the mapper using the fixed-point color version of the set of images from the Debevec library in Section 7.3. 7.1 Nine-scale approximate Gaussian pyramid mapper for color images The block diagram of our nine-scale approximate Gaussian pyramid mapper for color images is shown in Figure 7.1 and is made up of five subblocks. Starting with the HDR RGB triplet, we calculate the luminance component using the luminance computation subblock. The luminance component enters the approximate Gaussian 99 pyramid and log average subblocks. At the same time, the HDR RGB triplet is buffered in the buffer subblock. Finally, the local estimate of illumination, the log average of the scene luminance, the luminance component, and the buffered RGB triplet are fed simultaneously to the color tone mapping subblock to get the mapped RGB triplet. The approximate Gaussian pyramid and log average subblocks are the same blocks used inside the mapper for gray scale images and were explained previously in Section 6.1 and Section 4.2, respectively. The luminance calculation, buffer and color tone mapping subblocks will be discussed in detail in this section. Figure 7.1. Block diagram of the nine-scale approximate Gaussian pyramid mapper for color images. 100 Table 7.1. Coefficients for the computation of scene luminance using Equation 7.1. Real coefficients Fixed-point approximation Decimal representation 0.27 0.67 0.06 0.269531250 0.669921875 0.060546875 Canonical signed digit representation 0.010001010 1.010101001 0.000100001 The HDR image is represented as an RGB triplet, where each element is a 32-bit fixed-point value, with 16 bits of integer and 16 bits of fraction. The luminance calculation subblock scales the RGB triplet by 216 in order to transform the numbers to integer and calculates the scene luminance component of a pixel in fixed-point mathematics, using the equation described in [14], as: Ls = 0.27 × R s + 0.67 × G s + 0.06 × B s . (7.1) Equation 7.1 was realized using a multiplierless tree of shift-and-add operations. We represent the coefficients of the equation in canonical signed digit representation to reduce the number of required terms, and to thereby decrease the depth of the tree. Implementing the coefficients requires approximating them in fixed-point. Table 7.1 shows the original coefficients and their approximations, both in decimal and canonical signed digit. Notice that the sum of the approximate coefficients is still equal to 1; this is important so that the luminance component is still an exact weighted average of the HDR RGB triplet. The luminance component of an RGB triplet is processed by the approximate Gaussian pyramid subblock to estimate the local illumination, and by the log average luminance subblock to compute the log average of the scene luminance. Meanwhile, the RGB triplet is buffered in the buffer subblock so that it can be presented to the color tone 101 mapping subblock together with its local illumination estimate and the log average of the scene luminance. The buffer subblock requires three separate external memories, one each for the red, green, and blue pixels of the triplet; each pixel is 32 bits. The latency of the approximate Gaussian pyramid is 32 rows of the input image, as shown in Table 6.4. Hence, for 1024 × 768 images, each of the three memories should be of size 32×1024×32 bits, to hold 32 rows. The control of the external memories is not implemented in the current version of the architecture. Each buffered RGB triplet is fed to the color tone mapping subblock simultaneously with its luminance component Ls , log average of the scene luminance Ls average and local illumination estimate Ls _ average . Tone mapped color components are local log_ computed using the approach developed by Hunt in [54] and adopted in [53], which considers the sensitivity of the rods in the human vision system and the blue shift of the subjective hue of colors for night scenes. The computations are given in equation 7.2 to 7.4. The equations give the base-2 logarithm of the mapped RGB triplet; the final calculations are inverse base 2 logarithms, to get the mapped RGB triplet. The base 2 logarithm and its inverse are computed using the approach previously described in Section 4.2. log 2 R d = log 2 ( R s + 2752) − log 2 ( Ls + 2621) + log 2 ( Ls ) − log 2 ( Ls _ average + a × Lslog_ average ) local log 2 G d = log 2 (G s + 2542) − log 2 ( Ls + 2621) + log 2 ( Ls ) − log 2 ( Ls _ average + a × Ls average ) local log_ log 2 B d = log 2 ( B s + 3328) − log 2 ( Ls + 2621) + log 2 ( Ls ) − log 2 ( Ls _ average + a × Lslog_ average ) local 102 ( ) (7.2) ( ) (7.3) ( ) (7.4) Table 7.2. Summary of hardware synthesis report. Device EP2S130F780C4 Total bits of memory Total logic cells Max operating frequency Latency between rows Latency between frames 7.2 Results of Hardware Synthesis The nine-scale approximate Gaussian pyramid mapper for color images was described in VHDL, and synthesized for a Stratix II FPGA device using Altera’s Quartus II v5.0 toolset; the mapper was sized in order to accommodate 3 × 32 bit color HDR input images of 1024×768 pixels. Table 7.2 summarizes the synthesis report from Quartus. Comparing the results to Table 6.4, which summarized the results of the synthesis of the mapper for grayscale images, we notice that the mapper for color images requires 12% more internal memory than the mapper for grayscale images; this is because we chose to use 32-bit pixels for the color images, rather than the 28 bits used for the grayscale images. The hardware latency of the mapper for color images was increased because of the new subblocks. The operating frequency is 14.6% less than the mapper for grayscale images due to additional complexity; however, it still meets our target video frame rate. Given that we would like to achieve a video frame rate of 60 frames per second, and that there are (1024+71) × (768+32) or 876,000 pixels in the image when we include the latency between rows and the latency between frames, we need to be able to process 60 × 876,000 that is 52.56 megapixels per second. Our mapper for color images, which has a maximum operating frequency of 65.86 MHz, can accommodate this by 103 3,586,304/6,747,840 (53 %) 44,572/106,032 (42%) 65.86 MHz 3(controller)+32(filter_size/2)+36(hardware) 32 rows (filter_size/2) taking in one pixel per clock. In contrast, the latest hardware implementation of the Reinhard operator [53] achieves 14 frames per second for a HDR color image of the same bit depth and pixel resolution. Thus, our hardware implementation is significantly better in terms of throughput. 7.3 Experiments and Results The functionality of the nine-scale approximate Gaussian pyramid mapper for color images was verified by simulating the processing of test images using Modelsim and comparing the output to the output of a fixed-point Matlab simulation. Figure 7.2 shows the color version of the set of HDR images from the Debevec library. For display before local tone mapping, the intensity of the pixels was linearly mapped using the following equation: s 1 Rmean R s Rd s d s G = 128 × 1 Gmean × G 1 B s B s Bd mean (7.5) s s s where Rmean , Gmean , and Bmean are the mean values of the original HDR RGB triplet. Figure 7.3 shows the set of images after being processed by our nine-scale approximate Gaussian pyramid mapper for color images. The fact that the colors look very natural is due to the Hunt approach that is implemented inside our color tone mapping subblock. Like the local tone mapped grayscale images, the mapped color images are free of halo artifacts, with preservation of details and enhancement of edges and contrast. The approximate Gaussian pyramid chooses its local illumination estimate from the different scales based on the selection parameter ε . It selects the largest surround 104 possible around the pixel that does not include drastic change in illumination. An experiment was conducted to determine the effect of this selection parameter. The Nave image, which with its windows is one of the most difficult images from the Debevec library to successfully tone map, was processed with six different values of ε : 2-1, 2-2, 2 −3 , 2-4, 2-5, and 2-6. Table 7.3 shows the percentage of the time that each of the nine scales was chosen for the six different values of ε . From the table, we see that when ε is large, large scales are selected more often than small scales. As ε gets smaller, small scales are selected more often than large scales. The resulting six images are shown in Figure 7.4, annotated with the ε value used. Using large values of ε gives high contrast images but halo effects appear as rings close to the windows. Using small values of ε gives images that are free from halo artifacts but suffer from low contrast. The highest contrast possible without visually distracting halo artifacts was obtained using ε =2-4, the value we have chosen to use in general. In summary, we have extended the nine-scale approximate Gaussian pyramid mapper to the color domain. The mapper for color images can process a 3 × 32 bit HDR colored image of 1024×768 pixels. It is compatible with a video rate of 60 frames per second. The simulation results showed qualitatively that the processed images have natural color. An experiment on the selection parameter ε used by the approximate Gaussian pyramid to choose its local illumination estimate from the nine different scales has provided insight in how to choose a value for ε that results in good contrast in processed images while avoiding halo artifacts. 105 Table 7.3. The percentage of the time that each of the nine scales was chosen for the six different values of ε . Scale Window size Selection parameter ε 2-1 1 1.5 2 2.5 3 4 8 12 16 4× 4 6×6 8×8 10 × 10 12 × 12 16 × 16 32 × 32 48 × 48 64 × 64 2-2 1.44 1.16 0.30 0.19 1.55 9.01 3.37 3.05 79.89 2-3 2.40 1.88 0.61 0.33 2.26 17.92 3.93 4.47 66.17 2-4 4.10 3.80 1.31 0.35 4.21 29.73 5.95 5.77 44.73 2-5 7.89 8.67 4.33 0.53 6.78 35.54 7.91 8.26 20.05 2-6 16.03 18.29 12.26 1.36 6.55 27.48 7.12 6.73 4.14 0.94 0.57 0.07 0.04 0.80 5.80 2.84 2.07 86.83 Vinesunset groveD Memorial groveC Rosette Nave Figure 7.2. The set of color HDR images used to test the nine-scale approximate Gaussian pyramid mapper for color images. 106 Nave Rosette GroveC GroveD Vinesunset Memorial Figure 7.3. The set of color HDR images after being processed by the nine-scale approximate Gaussian pyramid mapper for color images. 107 ε = 2 −1 ε = 2 −2 ε = 2 −3 ε = 2 −4 ε = 2 −5 ε = 2 −6 Figure 7.4. The effect of the selection parameter ε on the processed Nave image. 108 CHAPTER VIII CONCLUSIONS 8.1 Summary of Accomplished Research The purpose of this work was to implement local tone mapping operators in real- time on an FPGA platform, using approximations to existing algorithms that reduce complexity without impacting performance. The three different local TMOs that were implemented are the MGIP mapper, a four-scale constant weight pyramid mapper and a four-scale approximate Gaussian pyramid mapper. In addition, a nine-scale version of the approximate Gaussian pyramid mapper was developed. The nine-scale approximate Gaussian pyramid mapper was also extended to the color domain. The different local TMOs were described in VHDL, and synthesized for a Stratix II FPGA device using Altera’s Quartus II v5.0 toolset. Processing of test images by the hardware architectures was simulated using Modelsim. Functionality of all systems was verified by comparing the images produced by hardware simulation with those produced by fixed-point Matlab simulations. The different algorithms were compared in terms of both their hardware complexity, and their image processing performance, evaluated both qualitatively and quantitatively. The hardware complexity was given by the synthesis report from the 109 Quartus tool. Of main interest is the utilization of logic cells and memory bits. Performance was measured quantitatively in terms of the peak signal-to-noise ratio of the output images produce by the hardware architectures compared to a gold-standard floating point version of the Reinhard operator. Performance was measured qualitatively by noting the absence of halo artifacts, looking for contrast, edge enhancement, and the preservation of details in the processed images. We were also interested in the naturalness of color in images processed by the nine-scale approximate Gaussian pyramid mapper for color images. 8.2 Results of Hardware synthesis and Experiments All of the grayscale local TMOs were sized in order to accommodate high resolution images of high dynamic range with 1024×768 pixels and 28 bits per pixel. The nine-scale approximate Gaussian pyramid mapper for color images targeted a 3 × 32 bit color HDR image of 1024×768 pixels. The operating frequencies range from 93 MHz for the four-scale constant weight pyramid mapper to 65 MHz for the nine-scale approximate Gaussian pyramid mapper for color images. While the more complex mappers were slower, the operating frequencies of all of the local TMOs were compatible with a video rate of 60 frames per second. In conclusion, the complexity of hardware increased with the performance of the algorithm. For example, images processed by the simplest mapper, the MGIP mapper, suffered from halo artifacts. Moreover, the PSNR numbers of the MGIP mapper when compared to the Reinhard operator were the smallest. On the other hand, it had the least utilization of logic cells and memory bits when compared to the other local TMOs. 110 The nine-scale version of the Gaussian pyramid mapper gave us high contrast tone-mapped images without halo artifacts. Its performance was better than the four-scale version of the mapper. It should be noted that the nine-scale architecture did not require significantly more memory than the four-scale architecture; this was due to a clever organization of the internal memory available in the FPGA. However, the number of logic cells was increased due to the additional complexity of the filters. The color tone mapped images of the nine-scale approximate Gaussian pyramid mapper for color images produced output images whose color was natural. It was also designed to target larger dynamic range images. We believe that nine-scale approximate Gaussian pyramid for color images is a promising algorithm for real-time display of HDR color images on standard LCD screens. 8.3 Future Work The implemented local TMOs are yet to be embedded inside practical systems such as future HDR digital cameras or LCD screens. In order to do so, the design of all algorithms should be parameterized to deal with images of different input dynamic ranges and displays of different resolutions. Testing the local TMOs on HDR video streams instead of the current set of still images derived from the Debevec library is also a challenge. In order to tackle this challenge, the algorithms should be implemented on an FPGA board with specialized capabilities such as external memory, serial connections, and the FPGA chip that was used by our synthesis tool. It would also be necessary to find a source of HDR video steams. 111 Another area of interest is related to the size and storage of HDR color images. An HDR color image in uncompressed form requires a large amount of memory and compression of the image for storage is highly desirable. The JPEG2000 engine is a commonly used standard for compression of images of ordinary dynamic range. It would be possible to use the local tone mappers developed here in conjunction with JPEG200. Three channels of the JPEG2000 engine to compress a 3× 12 -bit local tone mapped RGB triplet. At the same time, it makes sense to store local illumination estimate information about the image, because without this information the HDR image can not be restored. A separate 12-bit channel of the JPEG2000 engine could be used to compress the local illumination estimate information. At the receiver side, the user can choose to get the mapped RGB triplet alone or combine data from both fields to get the original HDR RGB triplet. We are currently pursuing this idea, and the preliminary experiments are very promising [57]. The color Memorial image, which is 4.5 MB when uncompressed, can be reduced in size using the proposed encoder to 231 KB, while still having a high PSNR of 60 dB. A third area of interest is energy saving for LCD screens. Currently available LCD displays require two power sources, one for the backlight and one to drive the individual pixel elements. The display backlight is the single largest power consumer in a typical portable apparatus, accounting for almost 30-50% of the battery drain with the display at maximum luminance [56]. Following the concept described in [46], we propose a backlight scaling technique that is based on embedded local TMOs. The TMO maps the original image to a transformed image such that the perceived brightness of the image is preserved while its dynamic range is reduced. This reduction in the dynamic 112 range of the image increases the potential for backlight scaling, and therefore, for saving energy consumed by the backlight. We are also planning to pursue the idea of generalizing the design concepts that were used in this dissertation for other applications. The Gaussian pyramid implemented here is used not only for local tone mapping, but also for other image processing applications; it is a general tool for multiscale image representation. Thus, our hardwarefriendly approximation to the Gaussian pyramid is likely to also help in doing other kinds of image processing in real time. Also, our memory organization design can be generalized to other image processing applications that use multiple windows around a pixel and apply symmetric extensions on the borders of the images. Finally, we believe that our method for calculating the base 2 logarithm of a number, used in calculating the log average of the pixels in an image here, can be used for other real-time embedded applications that require logarithmic arithmetic. 113 REFERENCES [1] K. Chui, M. Herf, P. Shirley, S. Swamy, C. Wang, and K. Zimmerman, “Spatially Nonuniform Scaling Functions for High Contrast Images,” Proceedings of the Graphics Interface, pp. 245-253, 1993. E. H. Land and J. J. McCann, “Lightness and Retinex Theory,” Journal of Optical Society America, vol. 31, pp. 1-11, 1971. D. Gadia, D. Marini, and A. Rizzi, “Tuning Retinex for HDR image visualization,” Proceeding of the Second European Conference on Color in Graphics, Imaging, and Vision and Sixth International Symposium on Multispectral Color Science, pp. 326-31, 2004. D. Marini and A. Rizzi, “A computation approach to color adaptation effects,” Image and Vision Computing, vol. 18, pp. 1005-14, 2000. R. Kimmel, M. Elad, D. Shaked, R. Keshet, and I. Sobel, “A Variational Framework for Retinex,” International Journal of Computer Vision, vol. 52, pp. 7-23, 2003. M. Elad, R. Kimmel, D. Shaked, and R. Keshet, “Reduced complexity Retinex algorithm via the variational approach,” Visual Communication and Image Representation, vol. 14, pp. 369-388, 2003. D. J. Jobson, Z. Rahman, and G. A. Woodell, “Properties and performance of a center/surround Retinex,” IEEE Transactions on Image Processing, vol. 6, pp. 451-462, 1997. D. J. Jobson, Z. Rahman, and G. A.Woodell, “A multiscale Retinex for bridging the gap between color images and the human observation of scenes,” IEEE Transactions on Image Processing, vol. 6, pp. 965-76, 1997. Z. Rahman, D. J. Jobson, and G. A. Woodell, “Multi-scale Retinex for color image enhancement,” Proceedings of the International Conference on Image Processing, vol. 3, pp. 1003-1006, 1996. [2] [3] [4] [5] [6] [7] [8] [9] 114 [10] G. D. Hines, Z. Rahman, D. J. Jobson, and G. A. Woodell, “DSP implementation of the Retinex image enhancement algorithm,” Proceedings of the SPIE Visual Information Processing XIII, vol. 5438, pp. 13-24, 2004. L. Tao and V. K. Asari, “An integrated neighborhood dependent approach for nonlinear enhancement of color images,” Information Technology: Coding and Computing, vol. 2, pp. 138-139, 2004. M. Herscovitz and O. Yadid-Pecht, “A modified multi scale Retinex algorithm with an improved global impression of brightness for wide dynamic range pictures,” Machine Vision and Applications, vol. 15, pp. 220-228, 2004. K. Barand and B. Funt, “Investigations into multi-scale Retinex,” Color Imaging Vision and Technology, pp. 9-17, 1999. E. Reinhard, M. Stark, P. Shirley, and J. Ferwerda, “Photographic tone reproduction for digital images,” Proceeding of the 29th Annual Conference on Computer Graphics and Interactive Techniques, pp. 267 - 276, 2002. E. Reinhard, “Parameter Estimation for Photographic Tone Reproduction,” Journal of Graphics Tools, vol. 7, pp. 45-52, 2002. B. K. Barladian, A. G. Voloboi, V. A. Galaktionov, and E. A. Kopylov, “An effective tone mapping operator for high dynamic range images,” Programmirovanie, vol. 30, 2004. N. Goodnight, R. Wang, C. Woolley, and G. Humphreys, “Interactive timedependent tone mapping using programmable graphics hardware,” ACM International Conference Proceeding Series:14th Eurographics Workshop on Rendering, vol. 44, pp. 26-37, 2003. N. Goodnight, R. Wang, and G. Humphreys, “Computation on programmable graphics hardware,” IEEE Computer Graphics and Applications, vol. 25, pp. 1215, 2005. J. Tumblin and G. Turk, “LCIS: a boundary hierarchy for detail-preserving contrast reduction,” Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 83 -90, 1999. R. Fattal, D. Lischinski, and M. Werman, “Gradient domain high dynamic range compression,” Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques pp. 249 - 56, 2002. [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] 115 [21] R. Raskar, A. Ilie, and J. Yu, “Image fusion for context enhancement and video surrealism,” Proceeding of the 3rd international symposium on Nonphotorealistic animation and rendering pp. 85-93, 2004. K. V. Velde, “Multi-scale color image enhancement,” Proceeding of the International Conference on Image Processing, vol. 3, pp. 584-587, 1999. R. Xu and S. N. Pattanaik, “High dynamic range image display using level set framework “ Proceedings of the 11th International Conference in Central Europe on Computer Graphics, Visualization and Digital Interactive Media, vol. 11, pp. 530-537, 2003. S. Osher and J. A. Sethain, “Fronts propagating with curvature-dependent speed: algorithm based on Hamilton-Jacobi formulation,” Journal of Computational Physics, vol. 79, pp. 12-49, 1988. F. Durand and J. Dorsey, “Fast bilateral filtering for the display of high dynamic range images,” Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, pp. 257-266, 2002. E. Eisemann and F. Durand, “Flash photography enhancement via intrinsic relighting,” ACM Transactions on Graphics, vol. 23, pp. 673-678, 2004. P. Choudhury and J. Tumblin, “The trilateral filter for high contrast images and meshes,” Proceedings of the Eurographics Symposium on Rendering 2003, pp. 186-96, 2003. P. Ledda, L. P. Santos, and A. Chalmers, “A local model of eye adaptation for high dynamic range images,” Proceedings of the 3rd International Conference on Computer Graphics, Virtual Reality, Visualization and Interaction in Africa, pp. 151-160, 2004. S. N. Pattanaik, J. A. Ferwerda, M. D. Fairchild, and D. P. Greenberg, “A multiscale model of adaptation and spatial vision for realistic image display,” Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques pp. 287-98, 1998. P. J. Burt and E. H. Adeleson, “A multiresolution spline with applications to image mosaics,” ACM Transactions on Graphics, vol. 2, pp. 217-236, 1983. P. J. Burt and E. H. Adeleson,”The Laplacian Pyramid as a compact image code,” IEEE Transactions on Communications, vol. 31, no. 4, pp. 532-540, 1983. M. D. Fairchild and G.M. Johnson, “Meet iCAM: An image color appearance model,” IS&T/SID 8th Color Imaging Conference, pp. 33-38, 2002. 116 [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] G.Johnson and M. Fairchild, “Rendering HDR images,” Proceedings of IS&T/SID 11th Color Imaging Conference, pp. 36-41, 2003. Y. Monobe, H. Yamashita, T. Kurosawa, and H.Kotera, “Dynamic range compression preserving local image contrast for digital video camera,” IEEE Transactions on Consumer Electronics, vol. 51, pp. 1-10, 2005. A. Rizzi, C. Gatta, B. Piacentini, M. Fierro, and D. Marini, “Human-visualsystem-inspired tone mapping algorithm for HDR images,” SPIE - The International Society for Optical Engineering Human Vision and Electronic Imaging IX, vol. 5292, pp. 57-68, 2004. A. Sa, M. B. Vieira, P. C. Carvalho, and L. Velho, “Range-enhanced active foreground extraction,” Proceedings of the IEEE International Conference on Image Processing, vol. 2, pp. 81- 84, 2005. O. Tsujii, M. T. Freedman, and S. K. Mu, “Anatomic region-based dynamic range compression for chest radiographs using warping transformation of correlated distribution,” IEEE Transactions on Medical Imaging, vol. 17, pp. 407-418, 1998. G. Krawczyk, K. Myszkowski, and H.-P. Seidel, “Lightness perception in tone reproduction for high dynamic range images “ Proceeding of the Computer Graphics Forum European Association for Computer Graphics 26th Annual Conference, vol. 24, pp. 635-45, 2005. J. Tumblin, J. K. Hodgins, and B. K. Guenter, “Two methods for display of high contrast images,” ACM Transactions on Graphics, vol. 18, pp. 56-94, 1999. H. Kaiqi, W. Zhenyang, and W. Qiao, “Image enhancement based on the statistics of visual representation,” Image and Vision Computing, vol. 23, pp. 51-57, 2005. H. Kaiqi, W. Zhenyang, F. George and C. Francis, “ Color image denoising with wavelet thresholding based on human visual system model,” Signal Processing: Communication, vol. 20, pp. 115-127, April 2004. H. Kaiqi, W. Zhenyang, and W. Qiao, “Color image enhancement and evaluation algorithm based on human visual system,” IEEE international Conference on Acoustics, Speech, and Signal Processing, pp. 721-724, 2004. H. Seetzen, W. Heidrich, W. Stuerzlinger, G. Ward, L. Whitehead, M. Trentacoste, A. Ghosh, and A. Vorozcovs, “High dynamic range display systems,” ACM Transactions on Graphics, vol. 23, pp. 760-768, 2004. P. Ledda, G. Ward, and A. Chalmers, “A wide field, high dynamic range, stereographic viewer,” Proceeding of the 1st International Conference on 117 [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] Computer Graphics and Interactive Techniques in Australasia and South East Asia, pp. 237 - 244, 2003. [45] P. Ledda, A. Chalmers, and H. Seetzen, “HDR displays: a validation against reality,” Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 3, pp. 2777-2782, 2004. A. Iranli and M. Pedram, “DTM: dynamic tone mapping for backlight scaling,” Proceedings of the 42nd Design Automation Conference, pp. 612-616, 2005. E. Phillips, G. Ward, and T. J. Ayres, “Faithful presentation of luminance contrast: evaluation of photographic and computational display methods,” The International Society for Optical Engineering Human Vision and Electronic Imaging VII, vol. 5007, pp. 450-455, 2003. M. Cadfk and P. Slavik, “The naturalness of reproduced high dynamic range images,” Proceeding of the Ninth International Conference on Information Visualisation, pp. 920-925, 2005. P. Ledda, A. Chalmers, T. Troscianko, and H. Seetzen, “Evaluation of tone mapping operators using a high dynamic range display,” ACM Transactions on Graphics, vol. 24, pp. 640-648, 2005. P. E. Debevec and J. Malik, “ Recovering high dynamic range radiance maps from photographs,” Proceeding of the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 369-378, 1997. Hau, T.N., Li, .T, and Vijayan, K.A., “A nonlinear technique for enhancement of color images: an architectural perspective for real-time applications,” Proceedings of the 33rd Applied Imagery Pattern Recognition Workshop, pp. 124-129, 2004. H. J. Takto, D. R. Sebok and T. J. Sebok, “Dynamic range enhancement for imaging sensors,” Lockheed Martin Corporation, United States Patent no. US 6,501,504 B1, Dec. 2002. G. Krawczyk, K. Myszkowski and H-P. Seidel, “Perceptual effects in real-time tone mapping,” Proceedings of the 21st Spring Conference on Computer Graphics, pp. 195-202, 2005. R. Hunt, The Reproduction of Colour in Photography, Printing and Television: 5th Edition, Fountain Press, 1995. I. Koren, Computation Arithmetic Algorithms, 2nd Edition, A K Peters, 2002. [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] 118 [56] J. Williams, “A fourth generation of LCD backlight technology”, Linear Technology Application Note, vol. 65, Nov. 1995. F. Hassan and J. E. Carletta, “A high throughput encoder for high dynamic range images,” Proceedings of the IEEE International Conference on Image Processing, Sep. 2007. [57] 119