Performance Analysis of Overlapped Motion Compensated Temporal Interpolationusing Open Multiprocessing
Motion Compensation is an essential part of different video compression techniques. Video compression is always required especially for storage and transmission of videos. Motion Compensation is computationally complex and data intensive process. Overlapped Motion Compensated Temporal Interpolation (OMCTI) is a block based approach for the temporal interpolation of skipped frames. It generates interpolated frames with considerably improved quality. In this paper, Open Multiprocessing (OpenMP) based multithreaded solution is proposed to reduce the computation time of Overlapped Motion Compensated Temporal Interpolation. The OpenMP based solution is tested on multi-core processor for evaluation of performance. The paper is concluded with a discussion about the generated experimental results.
2012 1st International Conference on Future Trends in Computing and Communication Technologies Performance Analysis of Overlapped Motion Compensated Temporal Interpolation using Open Multiprocessing Madiha Sher1, Muhammad Asif Manzoor2, Munaza Sher1, Nasru Minallah1 1 Department of Computer Systems Engineering, University of Engineering & Technology, Peshawar, Pakistan 2 Department of Computer Engineering, Umm Al-Qura University, Makkah, Kingdom of Saudi Arabia email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org Abstract— Motion Compensation is an essential part of receiver end. Simple frame reconstruction methods such different video compression techniques. Video as frame repetition at the receiver end can produce compression is always required especially for storage and undesirable results in the output video sequence. transmission of videos. Motion Compensation is computationally complex and data intensive process. Motion Compensated Temporal Interpolation  Overlapped Motion Compensated Temporal Interpolation (MCTI) is a technique proposed by Chi-Kong Wong et (OMCTI) is a block based approach for the temporal al. for the generation of the skipped frames. MCTI is a interpolation of skipped frames. It generates interpolated block based Motion Compensation algorithm. In MCTI, frames with considerably improved quality. In this paper, block motions are compensated by tracking the blocks Open Multiprocessing (OpenMP) based multithreaded between successive received frames at the receiver end. solution is proposed to reduce the computation time of The trajectory of each block is calculated and then these Overlapped Motion Compensated Temporal blocks are placed at appropriate locations according to Interpolation. The OpenMP based solution is tested on the calculated trajectories in the interpolated frames. multi-core processor for evaluation of performance. The MCTI is a block based algorithm so the inserted frames paper is concluded with a discussion about the generated tend to be blocky. In order to reduce this blocky effect experimental results. an Overlapped Motion Compensated Temporal Interpolation  (OMCTI) technique is proposed by Index terms -- Motion Compensated Temporal Interpolation (MCTI), Overlapped Motion Compensated Chi-Kong Wong et al. The main disadvantage of these Temporal Interpolation (OMCTI), Block based search, motion compensation algorithms is the massive OpenMP, Multithreading. computational requirements. Parallel programming approaches can be used to I. INTRODUCTION overcome the massive computational requirements of motion compensation. Multithreading method is an Many modern multimedia applications require low efficient parallel programming model used for the bit rate transmission of the video sequences. This improvement of the computing capability of underlying limitation of low bit rate transmission of video sequence system. OpenMP (OMP) is a parallel programming API is imposed by the limited bandwidth of the transmission which is used to develop multithreaded applications for channel. Video compression performs a very significant shared memory architecture systems. These systems can role in these multimedia applications. Every digitized have one or multiple cores. In this paper, we discuss the video contains substantial amount of redundant data and OpenMP based multithreaded implementation of compression can be achieved by taking advantage of Overlapped Motion Compensated Temporal these redundancies. The redundant data contained in a Interpolation for the construction of skipped frames. video is generally classified into two categories: The rest of the paper is organized as follow: In statistical redundancy and subjective redundancy. The section-II we provide some discussion about the state-of- objective of video compression is to get rid of both the-art literature review. Furthermore, an overview of spatial and temporal domain redundancies. Motion some motion compensated algorithms and some Compensation is widely used in video compression, OpenMP applications are provided in this section. because of its capability to exploit high temporal Moreover MCTI, OMCTI and the parallel correlation between consecutive frames of a video implementation is discussed in Section-III. The sequence. To achieve low bit rate encoding requirement, discussion about our generated experimental results is temporal subsampling is a very valuable technique and it provided in Section-IV. Finally Section-V is used to can also be combined with any video compression provide the conclusion of this work. method to gain very high compression ratio. The skipped frames are required to be reconstructed at the receiver end in order to achieve high frame rate at the 20 2012 1st International Conference on Future Trends in Computing and Communication Technologies II. LITERATURE REVIEW have discussed a multithreaded PDE-Based Image In , zonal algorithm is used for motion Registration process using OpenMP on a dual core compensation. This algorithm requires less computation computer. In this work, the image is divided into four and produce efficient result. The predicted frame is sub images and every sub image is treated as an constructed using the current frame and previous frame independent task to be performed on separate core to by tracking the Macroblocks of the previous frame in the improve the efficiency of the underlying system. current frame. To improve the performance of the III. METHODOLOGY interpolate frame, Multihypothesis Motion Compensated Temporal Interpolation is used. The proposed technique A. Motion Compensated Temporal Interpolation reduces the artifacts by identifying whether the MCTI is the technique used for the up-sampling of the examined frame can be predicted or not. received video sequence at the destination in order to Wavelet based compression produces better interpolate the skipped frames. Generally, for the sake of subjective quality as compare to block transformation improved compression efficiency, the frames are coding, as it does not produce blocking artifact . In skipped to accomplish the low bit rate requirement of the this approach frames are split into blocks and then 2D motion vector is computed per block. For motion transmitted video sequence. The interpolated frames are estimation the center frame is split in the group into inserted in-between the successively received video moving regions that have similar motion. Prediction of frames. This insertion increases the effective frame rate frames that are adjacent to central frame is based on of the received video; which results in smoother object these segments. motion. MCTI algorithm assumes that the motion of an The used of background frame as reference frame for object is translational and this motion is quite slow improving the accuracy of motion compensation for hence in-between the successive video frames, the object video compression is performed in . Among certain motion is almost linear with respect to time. For any k number of continuous frames the blocks that have kept frame, the objective of temporal interpolation is to unchanged are constructed as background frame. construct a new frame which has to be added between Moving objects do not influence the background frames, the present kth and previous (k-1)th received frames at the so better prediction is obtained when it is used as a decoder side as shown in Fig. 1 . reference along with traditional previous frame. This In MCTI, all the three frames, the inserted frame, the algorithm also increases the coding efficiency. present kth frame and the previous (k-1)th frames are Temporal scalability is not supported where there is partitioned into blocks of fixed size N x N. Motion no motion as it results in blurring of the low frequency vector is calculated for every block. To find this motion frames . Computational complexity of Motion vector, backward and forward block based motion Compensated Temporal Filtering (MCTF) can be estimation is performed. The Mean Absolute Difference reduced by applying MCTF to only those blocks which (MAD) is selected to find the motion estimation for each have some motion, and apply temporal filtering to remaining blocks. Temporal filtering is applied along the block. This selection is made due to the simple nature of motion trajectory to remove the blurring effect in low Mean Absolute Difference (MAD). For a block A frequency frames and to reduce the energy in high situated at (x, y) position in the currently kth received frequency frame which results in compression frame, a search area is defined in the previously received efficiency. frame (k-1)th of size (2W + 1) x (2W + 1). A motion In , simulation results for various algorithms (blot search is performed in this search area to calculate MAD search, cross search, logarithmic search) for motion as distortion measure. The MAD between the currently compensation are obtained. On the bases of results received frame block at position (x, y) and the obtained from software reference model, blot search previously received frame block at position (x + m, y + proves to be ideal for hardware implementation on n) is defined as: FPGA. In  RS232 interface with maximum baud rate of 115200 bps is used for the considered experimental setup. The designed motion estimation unit is able to process 55 frames/sec of High Definition TV (HDTV) of resolution 1920x 1080 pixels, at frequency 116 MHz and power consumption of 543 mW. (1) In , pipeline implementation of Gauss-Jordan algorithm is done using OpenMP programming. In this implementation, among p threads each core is assigned [n/p] contiguous rows of a Matrices, where n is the number of cores. Each thread executes n successive steps of algorithms. Pipelined implementation proves to be better as compare to other parallel versions of Gauss- Jordon like, Rowblock and RowCyclic. With the multicore technology being widely used, multithreading method are used for improving the computation capacity by executing multiple threads of same application on different cores. Chen Lin et al  21 2012 1st International Conference on Future Trends in Computing and Communication Technologies Figure 1. Motion Estimation and Interpolation Figure 2. Diamond Search Algorithm The block having the minimum of MAD (x, y) (m, n) for all location (m, n) with in the defined (2W + 1) x (2W + 1) search area is consider as best matched block. Thus the motion vector is calculated for every block of the inserted frame and then the Motion Compensation is utilized for the construction of the interpolated frame. B. Block Based Search for Motion Vector Estimation Block based search algorithm determines the extent of displacement of an object on block by block basis. For every block in the current kth frame, a block is matched from the previous (k-1)th frame and this matching is done according to certain criterion. In the case of MCTI, Mean absolute difference is used as a selection criterion. Full search is very preliminary algorithm for the block based Motion Compensation. In full search, every block of the current kth frame is compared with every Figure 3. Sub Blocks for overlapped MCTI possible block in the previous (k-1)th frame within the fixed size widow and the best match is found. 2. The MAD point found in the previous search Although, the full search algorithm gives best Peak- step is re-positioned as the center point to form a new Signal-to-Noise-Ratio among the entire block based LDSP. If the new MAD point obtained is Located at the search algorithm and is very simple in nature but full center position, go to Step 3; otherwise, recursively search is worst algorithm in terms of computation repeat step 2. required. 3. Switch the search pattern from LDSP to There are several other algorithms present that SDSP. The MAD point discovered at this Step is the reduce the computation time as compared to full search final result of the motion vector which refers to the best algorithm at the expense of degraded quality. These are Matched block. sub-optimal block based search algorithm. One of these algorithms is Diamond Search algorithm . The C. Overlapped Motion Compensated Temporal diamond shape is used as basic search pattern in Interpolation diamond search algorithm. Algorithm can take The interpolated frames using MCTI produces unlimited steps to come across the solution. Two fixed defect in the inserted frames. As the MCTI is a block patterns are used in it; Large Diamond Search Pattern based temporal interpolated scheme; it depends upon (LDSP) and Small Diamond Search Pattern (SDSP). only one motion vector for every block and this motion Fig. 2  shows the two patterns and search method. vector is estimated by using MAD between the best The Diamond Search algorithm works as follows : matched blocks of currently received frame and 1. Initially LDSP is centered at origin of fixed previously received frame, as discussed in the previous size search window, and 9 points of LDSP are tested. If section. So every block is created without considering the MAD calculated is located at the center location, go the neighboring blocks. This single motion vector to Step 3; otherwise, go to Step 2. dependent motion compensation yields the blocky effect in the newly constructed frame. To overcome this problem, overlapped motion compensated temporal interpolation is used. OMCTI utilizes the motion vector of block itself along with the motion vectors of its neighboring blocks to interpolate the inserted frame blocks. A block A is accompanied by 4 neighboring blocks B, C, D and E as shown in fig. 3 . Let’s assume that their corresponding motion vectors are calculated using MCTI and are represented by Va, Vb, Vc, Vd and Ve respectively. Every block is portioned into 4 sub blocks. Block A is divided into A1, A2, A3 and A4. Now its own motion vector and motion vectors of two adjacent blocks are used for the interpolation of a sub block in the inserted frame. Weighted masks are used in such a way that the weight of its own motion vector is highest among all and sum of all the weights for a sub block is unity. Va, Vb and Vc motion vectors 22 2012 1st International Conference on Future Trends in Computing and Communication Technologies are used for the prediction of sub block A1. Similarly Several sets of test videos were used to evaluate the Va, Vc and Vd motion vectors along with their performance of the parallel algorithm. To compare the respective weights are used for the interpolation of sub parallel algorithm, the practical execution time was block A2. This overlapping scheme reduces the blocky used as a measure. Practical execution time is the total effect in the resulting video. time in seconds an algorithm needs to complete the computation. To decrease random variation, the D. Parallel Implemenation execution time was measured as an average of 10 runs. Video coding techniques are very important for removing the redundant information from the video to B. Evaluation and Analysis of Performance Model reduce the memory requirements of the video for The proposed technique is applied on various video transmission and storage needs. Motion Compensation sequences of different formats like QCIF and CIF to is a central element of different video codecs and these depict the actual performance improvement. The Motion Compensation algorithms are based on different performance is improved because of the use of OpenMP searching algorithm to find the best match between two based multithreaded implementation of block search consecutive frames received at the destination and algorithm. Both search algorithms, Full search and generate intermediate frame in order to increase the diamond search, are used for the evaluation of the frame rate of the received video. Searching algorithms results. Creation of thread also incurs some for block matching are computationally expensive computational cost. Different numbers of threads are portion of Motion Compensation and these generated to further evaluate the impact of computational requirements made it impractical , multithreading over the performance. The performance . In this paper, we have proposed a parallel enhancements of different video sequences are shown implementation of search algorithms using the OpenMP in the figure 4 – figure 7. in order to reduce the computational time. OpenMP is a parallel programming API which is used to develop multithreaded applications for shared memory architecture systems. In this approach multiple threads are created for searching and each thread performs search on different blocks to find out the best match and results are combined after the completion of all threads. In this way search algorithm is performed simultaneously resulting in overall reduction in computational time. The main theme of this work is to implement the basic algorithm of Overlapped Motion Compensated Temporal Interpolation using OpenMP API to reduce the computation time of this algorithm. Performance in Figure 4. CIF format, Diamond search terms of timing requirement is improved by using this OpenMP based technique without making any modification to the algorithm. As discussed, the search algorithms used for temporal interpolation are the computationally extensive part of the algorithm hence only the search algorithms are implemented using OpenMP directives. Searching process is handled by multiple threads while the rest of the algorithm (before searching and after searching) is executed by only one thread. The number of threads varies from two threads to five threads. The time required for this multithreaded approach is compared with the time required for sequential single threaded approach. The change in required time is expressed as percentage to indicate the performance improvement. Figure 5. QCIF format, Diamond search IV. EXPERIMENTAL RESULTS A. System Platform and Experimental Process For our experimental evaluation we used an Intel (R) Core i3-2310M CPU with two processor cores, a 2.10 GHz clock speed and 3 GB of memory. The system ran Windows 7 and Microsoft Visual Studio 2010 Ultimate. All programs were implemented in C using OpenMP interface. 23 2012 1st International Conference on Future Trends in Computing and Communication Technologies improvement while implemented using full search algorithm. The performance is approved in every case but the magnitude of this improvement differs and the main reason of this difference is number of threads. When the number of generated threads is less, the application cannot fully utilize the multi-core architecture. But when the number of threads is increased, the major portion of time is consumed in the creation of the threads and context switching between these threads making multithreading approach an overhead. Hence the number of threads must satisfy both the problems; the number of threads can neither be very low nor very high. The optimal number of threads Figure 6. CIF format, Full Search for our work is three to four. V. CONCLUSION In this paper, we have proposed OpenMP based multithreaded implementation of Overlapped Motion Compensated Temporal Interpolation. Motion Compensated Temporal Interpolation produces blocky effect in the constructed video. Overlapping reduces this blocky effect; which results in a better video output. With the inclusion of multithreading approach, the computation time is reduced. Although some of the time is required for the creation of threads but the overall performance enhancement is quite sufficient in order to bear this overhead. The proposed scheme is tested for the various video sequences of different Figure 7. QCIF format, Full Search formats and the results are presented in the paper. The percentage performance improvement is shown REFERENCES on y-axis while the number of threads created is shown  Chi-kong Wong, and Oscar C. Au, “Fast Motion Compensated on x-axis. It is obvious from the figures that the Temporal Interpolation for Video”, Proc. SPIE: Visual Communication and Imgae Processing, 1995. proposed solution gives the best performance when  Chi-kong Wong, Oscar C. Au, and C. W. Tang, “Motion three or four threads are created. First two figures Compensated Temporal Interpolation for video”, Proc. of IEEE shows the performance improvement while using ISCAS, 1996. diamond search algorithm where the next two figures  Alexis M. Tourapis, Hye-Yeon Cheongf, Ming L. Liouf, Oscar shows the performance improvement while using full C. Au, “Temporal Interpolation Of Video Sequences Using search algorithm. As full search is computationally Zonal Based-Algorithms”, Proc. of Conference on Image Processing, 2001. more expensive, the performance is more improved.  Danny Lazar and Amir Averbuch, “Wavelet Video Compression Fig. 4 shows the result of the proposed solution for Using Region Based Motion Estimation And Compensation”, CIF video format while using diamond search Proc. of IEEE International confference on Acoustics, Speech algorithm. and Video Processing, 2001. Fig. 5 shows the result of the proposed solution for  Rong Ding, Qionghai Dai, Wenli Xu, Dongdong Zhu and Hao QCIF video format while using diamond search Yin, “Background-frame Based Motion Compensation for Video Compression”, Proc. of IEEE Internatinal Conference on algorithm. Multimeda and Expo, 2004. Fig.6 shows the result of the proposed solution for  Karunakar A. K. and Manohara Pai M. M.; “Computationally CIF video format while using full search algorithm. Efficient MCTF for Scalable Video Coding”, Proc. of IEEE Fig.7 shows the result of the proposed solution for International Conference on Advance Computing and Communication, 2006. QCIF video format while using full search algorithm.  Vikram Arkalgud Chandrasetty, Shridhar R Laddh; “A Novel The figures show the percentage improvement of the Dual Processing Architecture for Implementation of Motion OpenMP based approach while executing it with Estimation Unit of H.264 AVC on FPGA”, IEEE Symposium on different number of threads. This percentage is Industrial Electronics and Applications, 2009. calculated as the change in the execution time required  Panagiotis D. Michailidis and Konstantinos G. Margaritis, by the multithreaded application and sequential / time “Open Multi Processing (OpenMP) of Gauss-Jordan Method for Solving System of Linear Equations”, Proc. of IEEE 11th required by the single threaded application. The best International Conference on Computer and Information results are achieved by using three or four threads. The Technology, 2011. proposed approach shows approximately 30%  Lin Yang, Leiguang Gong, Hong Zhang, John L. Nosher, and improvement when its implemented using diamond David J. Foran, “A Multicore Based Parallel Image Registration Method,” Proc. of 31stAnnual International Conference of the search as searching algorithm and about 50% IEEE EMBS, 2009. 24 2012 1st International Conference on Future Trends in Computing and Communication Technologies  Shan Zhu and Kai-Kuang Ma, “A New Diamond Search Algorithm for Fast Block-Matching Motion Estimation”, IEEE Transactions on Image Processing, 2000.  Sherief M. Hashimaa, Imbaby I. Mahmoud and Atef A. Elazm, “Hardware Implementation of Diamond Search Algorithm for Motion Estimation and Object Tracking”, Proc. of the 7th Conference on Nuclear and Particle Physiscs, 2009.  C. W. Tang and O. C. Au, “ Unidirectional Motion Compensated Temporal Interpolation”, IEEE International Symposim on Circuits and Systems, 1997. 25