Extraction of Time Space Information by sanmelody

VIEWS: 6 PAGES: 93

									        Methods and Standards for Lossless Compression




                                                                CHAPTER 6
Department of Electronic Engineering, FJU




                                                         VLSI Architectures for Motion
                                                                 Estimation



                                                                                                                     1
                                                         Video Coding techniques and Hardware Architectures Design
        Methods and Standards for Lossless Compression




                                                                        1-D Systolic Array
Department of Electronic Engineering, FJU




                                                         A Family of VLSI Designs for the Motion Compensation
                                                                       Block-Matching Algorithm

                                                                   K. M. Yang, M. T. Sun, and L. Wu
                                                         IEEE Transactions on Circuits and Systems, vol. 36, no.
                                                                     10, pp. 1317-1325, Oct. 1989

                                                                                                                           2
                                                               Video Coding techniques and Hardware Architectures Design
                                                                              Main Features
        Methods and Standards for Lossless Compression




                                                          They allow full search capability which is the optimal solution in
                                                           block-matching.
                                                          They allow sequential inputs to save pin counts but perform
Department of Electronic Engineering, FJU




                                                           parallel processing.
                                                          They use common busses for data transfers and save silicon
                                                           area.
                                                          They are very flexible and modular designs, capable of
                                                           processing different block sizes, e.g. 8 X 8, 1 6 x 1 6 or 32x32
                                                          They are cascadable, i.e., cascaded chips allow a larger
                                                           tracking area.
                                                          They contain testing circuitry for increasing the testability.
                                                          The first chip design for block matching motion estimation in the
                                                           world.

                                                                                                                                3
                                                           Video Coding Techniques and Hardware Architectures Design
                                                                         Architecture Design
        Methods and Standards for Lossless Compression




                                                          In order to utilize fully the processing power of the
                                                           PE’s, a special data flow has to be derived to keep
Department of Electronic Engineering, FJU




                                                           the PE’s as busy as possible.
                                                          The data are repeatedly used at different searching
                                                           positions.
                                                          In the following, two data-flow techniques which allow
                                                           the designs to achieve 100 percent efficiency are
                                                           described. One broadcasts previous frame data and
                                                           the other broadcasts current block data.


                                                                                                                        4
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                                         Notations
        Methods and Standards for Lossless Compression




                                                                                                                b(Ib, Jb+15)
                                                                                           b(Ib, Jb)        0                      31
                                                         a(Ia, Ja)             16                      0    X          X
                                                                          0         15
                                                                                                                              …
Department of Electronic Engineering, FJU




                                                                     0    X

                                                                              a(i, j)
                                                                     15                                             b(k, l)
                                                                                c
                                                                                                       31
                                                                                                                p             p’




                                                                                                                                        5
                                                         Video Coding Techniques and Hardware Architectures Design
                                                           Broadcasting the Previous Frame Data
        Methods and Standards for Lossless Compression




                                                          While b(Ib, Jb+15) is being inputted it can be
                                                           broadcasted to all processors that need it.
Department of Electronic Engineering, FJU




                                                          This relieves the burden of repeated access of the
                                                           same data from the previous frame.




                                                                                                                        6
                                                            Video Coding Techniques and Hardware Architectures Design
                                                             Broadcasting Reference Frame
        Methods and Standards for Lossless Compression




                                                                                          The 16 PE columns represent the
                                                                                           calculation of the error measurement
                                                                                           for 16 search positions.
                                                                                          Except for a very short initial delay, all
Department of Electronic Engineering, FJU




                                                                                           the PE’s are busy all the time, so that
                                                                                           the utilization is 100%.
                                                                                          The address generator generates the
                                                                                           address by summing up a base
                                                                                           address and a running index.
                                                                                          The base address, (Ia, Ja) or (Ib, Jb)
                                                                                           which is defined as the upper left
                                                                                           corner of a block, remains the same
                                                                                           for the entire processing of that
                                                                                           blocks and the running indexes (i, j)
                                                                                           and (k, l) are identical sequence for all
                                                                                           blocks.



                                                                                                                                  7
                                                         Video Coding Techniques and Hardware Architectures Design
                                                                        Methods and Standards for Lossless Compression
                                                                Department of Electronic Engineering, FJU

                                                                                                                         Basic Data Flow




Video Coding Techniques and Hardware Architectures Design
                                                            8
                                                                                   Architecture of PE
        Methods and Standards for Lossless Compression




                                                                             a-b                 |a-b|
                                                          a                           Absolute
                                                                             Latch     Value     Latch   Accumulator   Latch
                                                          b     Subtractor
                                                                                      Function
Department of Electronic Engineering, FJU




                                                          These sub-operations are performed in a pipeline
                                                           fashion and thus reduce the cycle time.
                                                          The accumulator in the last stage of the PE has 16-bit
                                                           precision to accommodate the largest possible error
                                                           measurement.



                                                                                                                               9
                                                              Video Coding Techniques and Hardware Architectures Design
                                                              Broadcasting the Current Frame Data
        Methods and Standards for Lossless Compression




                                                         Parallel-in-parallel-
                                                         output shift registers
Department of Electronic Engineering, FJU




                                                                                                                Parallel-in-parallel-
                                                                                                                output shift registers
                                                                                                                with multiplexers




                                                                                                                               10
                                                              Video Coding Techniques and Hardware Architectures Design
                                                         Basic Dataflow for Broadcasting Current
                                                                       Block Data
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                      11
                                                          Video Coding Techniques and Hardware Architectures Design
                                                                         Flexible Block Size
        Methods and Standards for Lossless Compression




                                                          Different motion-compensation schemes may
                                                           use different block sizes and require large
                                                           tracking ranges. It is very desirable to have a
Department of Electronic Engineering, FJU




                                                           chip flexible enough for use in different systems.
                                                          Consider a block-size of 8  8, the required
                                                           computations for each block is ¼ of the
                                                           computation required for a block-size of 16  16.
                                                          However, in each frame, the number of blocks is
                                                           4 times the number of the block-size of 16  16.

                                                                                                                       12
                                                           Video Coding Techniques and Hardware Architectures Design
                                                                    Flexible Block Size (Cont.)
        Methods and Standards for Lossless Compression




                                                          The computational load for each frame is the same for
                                                           different block-sizes except that the internal dynamic-
Department of Electronic Engineering, FJU




                                                           range is slightly different (tracking range is fixed).
                                                          Both architectures discussed are flexible enough to
                                                           process 8  8, 16  16 or 32  32 blocks as long as the
                                                           tracking range is fixed to 16 searches in one coordinate.
                                                          The same hardware containing 16 PE’s can be
                                                           reconfigured to process different block sizes by a very
                                                           simple control signal (address generator).
                                                          The above discussion can be generalized to other block
                                                           sizes of power of 2.

                                                                                                                        13
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                    Larges Tracking Ranges
        Methods and Standards for Lossless Compression




                                                          The tracking range is basically limited by the
                                                           computation power of the PE's. If the tracking
Department of Electronic Engineering, FJU




                                                           range of -16 to +15 is needed, the computation
                                                           load is increased by 4 times.
                                                          Assuming each PE is already operating at the
                                                           limit of its capability, 4 times the number of PE's
                                                           will be needed.
                                                          In this connection, essentially two chips are
                                                           cascaded to provide 32-stage input registers and
                                                           32 PE’s for the doubled horizontal tracking range.

                                                                                                                       14
                                                           Video Coding Techniques and Hardware Architectures Design
                                                         Block Diagram for Cascading Four Chips to
                                                            Achieve Tracking Range of -16 to +15
        Methods and Standards for Lossless Compression




                                                                                            Motion Vector
Department of Electronic Engineering, FJU




                                                                       CHIP A                       CHIP C
                                                                                       CMP
                                                              C1                                                   C2

                                                                       CHIP B                       CHIP D

                                                              p1                                                   p2
                                                              p1’                                                  p2’
                                                                                                                         15
                                                           Video Coding Techniques and Hardware Architectures Design
                                                                                  Overlapped Search Area
        Methods and Standards for Lossless Compression




                                                              0         16       32 47         0         16       32 47
                                                          0                                0

                                                         16                               16
Department of Electronic Engineering, FJU




                                                                                                                               0   16   32 47
                                                         32                               32                               0

                                                                  Sub-tracking area I              Sub-tracking area III   16
                                                         47                               47

                                                              0         16       32 47         0         16      32 47     32
                                                          0                                0
                                                                  Sub-tracking area III            Sub-tracking area IV    47
                                                         16                               16

                                                         32                               32

                                                         47                               47

                                                                                                                                                16
                                                                  Video Coding Techniques and Hardware Architectures Design
                                                              Overlapped Search Area (Cont.)
        Methods and Standards for Lossless Compression




                                                          The cascaded chip design can also be easily done by
                                                           assigning each chip to process one portion of the
                                                           tracking area.
Department of Electronic Engineering, FJU




                                                          While these data from the overlapped area are
                                                           inputted, they can be broadcasted to two chips to
                                                           save the bandwidth. This avoids proportional
                                                           increase of the memory requirement in a cascaded
                                                           chips system.




                                                                                                                       17
                                                           Video Coding Techniques and Hardware Architectures Design
                                                         Motion Estimation with Fractional Precision
        Methods and Standards for Lossless Compression




                                                          Quarter-pel precision
Department of Electronic Engineering, FJU




                                                                                                                       18
                                                           Video Coding Techniques and Hardware Architectures Design
                                                         Fractional Motion Estimation Chip-Pair
                                                                         Design
        Methods and Standards for Lossless Compression




                                                         Video in                Current Frame
                                                                                   Storage         Motion Compensation
                                                                                   Memory                 Chip I

                                                                                                         Integer
Department of Electronic Engineering, FJU




                                                         Reconstructed          Previous Frame          Precision
                                                         Video in                  Storage
                                                                                   Memory
                                                                                                               (mi, mj)
                                                                                 Tracking Area
                                                                                    Storage        Motion Compensation
                                                                                    Memory                Chip F

                                                                                                        Fractional
                                                                                Current Block
                                                                                                        Precision
                                                                                  Storage
                                                                                  Memory




                                                                                                                          19
                                                         Video Coding Techniques and Hardware Architectures Design
                                                         Block Diagram of a Fractional Motion
                                                                   Estimation CHip
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     20
                                                         Video Coding Techniques and Hardware Architectures Design
                                                                              Interpolation
        Methods and Standards for Lossless Compression




                                                          The combination of IP1 and IP2 eases the input rate
                                                           and keeps the PE’s performing operations every
                                                           cycle.
Department of Electronic Engineering, FJU




                                                          The interpolated values at the output of the IP1 and
                                                           the IP2 can be expressed as




                                                                                                                       21
                                                           Video Coding Techniques and Hardware Architectures Design
                                                         Basic Data Flow for Fractional Motion Vector
                                                                          Estimator
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                        22
                                                            Video Coding Techniques and Hardware Architectures Design
                                                         Basic Data Flow for Fractional Motion Vector
                                                                       Estimator (cont.)
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                        23
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Schematic Diagram of IP1




Video Coding Techniques and Hardware Architectures Design
                                                            24
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Schematic Diagram of IP2




Video Coding Techniques and Hardware Architectures Design
                                                            25
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Chip Layout




Video Coding Techniques and Hardware Architectures Design
                                                            26
                                                                                Testability
        Methods and Standards for Lossless Compression




                                                          The motion vector calculated by the chip is a function
                                                           of the current block data and the data in the previous
                                                           frame within the tracking range. Since the number of
Department of Electronic Engineering, FJU




                                                           possible combinations of these input data are
                                                           extremely large, exhaustive testing of the chip is
                                                           impossible.
                                                          In order to be able to test the chip, it is highly
                                                           desirable to have a testing circuit inside the chip
                                                           without using excessive chip area, or degrading
                                                           performance.
                                                          The chip proposed operates in two modes, the
                                                           normal mode and the test mode, which are selected
                                                           by an external signal named “test.”
                                                                                                                        27
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                           Testability (Cont.)
        Methods and Standards for Lossless Compression




                                                          By using tri-state buses and a decoder, the testing
                                                           vectors for the whole chip are reduced to much
Department of Electronic Engineering, FJU




                                                           smaller sets of functionally divided modules.
                                                          In the test mode, a test pattern is inputted from some
                                                           data pins, which are normally used for inputting one
                                                           of the previous frame data, and then is decoded by
                                                           the Test Pattern Decoder.
                                                          Only one of the modules will be tested at a time and
                                                           only its results are routed to an output bus and
                                                           observed from the output pins.

                                                                                                                        28
                                                            Video Coding Techniques and Hardware Architectures Design
        Methods and Standards for Lossless Compression




                                                            Array Architectures for Block
Department of Electronic Engineering, FJU




                                                                Matching Algorithms

                                                                   T. Komarek and P. Pirsch

                                                         IEEE Transactions on Circuits and Systems, vol. 36, no.
                                                                    10, Oct. 1989, pp. 1301-1308


                                                                                                                      29
                                                          Video Coding techniques and Hardware Architectures Design
                                                                      Block Matching Algorithm
        Methods and Standards for Lossless Compression




                                                                                                      (motion vector)
Department of Electronic Engineering, FJU




                                                          The BMA is defined over a four-dimensional index space due to
                                                           its four indexes i, k, m, and n.
                                                          As an example, the BMA is decomposed into two parts which
                                                           are defined over two-dimensional index spaces.
                                                             – The first one is spawn by the indexes i and k and consists of
                                                                the addition of the sum s(m, n).
                                                             – In the rest, which is defined over m and n, the minimum
                                                                search and the selection of the displacement vector
                                                                components is performed.
                                                                                                                               30
                                                            Video Coding Techniques and Hardware Architectures Design
                                                         Derivation of Systolic Arrays for Full Search
                                                                             BMA
        Methods and Standards for Lossless Compression




                                                          The addition of s(m, n ) starts with the index k, and is
                                                           continued over the index i for fixed m and n.
Department of Electronic Engineering, FJU




                                                                                           m and n fixed

                                                          The second part of the decomposed BMA is given by




                                                                                                                        31
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                          DG Spawn in the i, k Plane
        Methods and Standards for Lossless Compression




                                                                                             Subtraction
                                                                                             magnitude operation,                      DG displayed for a block
                                                                                             addition
                                                                      i                                                                size of N = 3 and a
                                                                              0              0              0                          maximum displacement
                                                               k                        1              2            3
                                                                                                                                       of p = 2’ in the i, k-plane of
Department of Electronic Engineering, FJU




                                                                             AD              AD             AD
                                                                                                                            Time       the decomposed full search
                                                                                                                   4        schedule   BMA.
                                                                             AD              AD             AD
                                                                                                                  5
                                                         y(i+m,k+n)          AD              AD             AD      x(i, k)
                                                                                  s1(m, n)       s2(m, n)    s3(m, n) 6
                                                                      0       A              A              A
                                                                                                                            s(m,n)

                                                                                                      m,n                       7
                                                                          addition                                      M
                                                                                                      s(m-1,n)

                                                                                                      minimum,search
                                                                                                      displacement
                                                                                                                                                                    32
                                                                                                      vector
                                                             Video Coding Techniques and Hardware Architectures Design
                                                         Systolic Architecture AB1 for N = 3, p = 2
        Methods and Standards for Lossless Compression




                                                                           Search area data               Reference data
                                                                                                   0
                                                                               41 31 21 31 21 11   AD   11 21 31 11 21 31
Department of Electronic Engineering, FJU




                                                                             42 32 22 32 22 12     AD      12 22 32 12 22 32


                                                                          43 33 23 33 23 13        AD          13 23 33 13 23 33


                                                         Number of time instance necessary
                                                         to determine a displacement vector        AD      M

                                                         N  (2p+1)(2p+1+N-1)                            Displacement
                                                         = N  (2p+1)(2p+N)                              Vector
                                                                                                                            33
                                                           Video Coding Techniques and Hardware Architectures Design
                                                         Three-Dimensional Index Space Spawn by
                                                                  the Index i, k, and m
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                       34
                                                           Video Coding Techniques and Hardware Architectures Design
                                                                            Systolic Array AS2
        Methods and Standards for Lossless Compression




                                                          Systolic architecture AS2 with processing elements AD, A , and M
                                                           derived from the previous DGwith the indexes of input data x ( i , k )
                                                           and y(i + m , k + n). The indexes enclosed by the dashed lines belong
                                                           to data of one search area line and one reference block.
Department of Electronic Engineering, FJU




                                                                                                                  projection onto
                                                                                                                  the i, m plane




                                                                                                                                    35
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                     Systolic Architecture AB2
        Methods and Standards for Lossless Compression




                                                          Systolic architecture AB2 with the indexes of search area data
                                                           y(i + m, k + n). The reference block data x ( i , k) remain fixed
                                                           in the PE's AD. The indexes of one search area line data are
                                                           enclosed by the dashed line.
Department of Electronic Engineering, FJU




                                                         Projection along the i, k-plane
                                                                                                                               36
                                                           Video Coding Techniques and Hardware Architectures Design
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Processing Element




Video Coding Techniques and Hardware Architectures Design
                                                            37
                                                                      Bit-Level Cell Array
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                              4x4 PE array
                                                                                                                     38
                                                         Video Coding Techniques and Hardware Architectures Design
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Bit-Level PE Array (Cont.)




Video Coding Techniques and Hardware Architectures Design
                                                            39
                                                                                    Systolic Array AS1
        Methods and Standards for Lossless Compression




                                                          Systolic architecture AS1 for N = 3 and p = 2 with the
                                                           indexes of search area data y ( i + m, k + n) and
Department of Electronic Engineering, FJU




                                                           reference block data x ( i , k).
                                                                   Reference data
                                                         .. .. 32 22 12 .. .. .. .. .. 31 21 11
                                                         52 42 32 22 12 21 61 51 41 31 21 11          D   D   D   D   D
                                                                   Search area data


                                                                                                      A   A   A   A   A   Displacement
                                                                                                                          Vector

                                                                                                  0   M   M   M   M   M     M


                                                                                                                                  40
                                                              Video Coding Techniques and Hardware Architectures Design
        Methods and Standards for Lossless Compression




                                                         Efficient Hybrid Tree/Linear Array
                                                                 Aarchitectures for
Department of Electronic Engineering, FJU




                                                         Block-Matching Motion Estimation
                                                                     Algorithms
                                                                M.-J.Chen, L.-G. Chen, K.-N.Cheng, M.C.Chen


                                                          IEE Proc.-Vis. Image Signal Process., vol. 143,
                                                                  no. 4, pp. 217-222, Aug. 1996


                                                                                                                        41
                                                            Video Coding techniques and Hardware Architectures Design
                                                         Illustration of One-Dimensional Full Search
                                                                           Algorithm
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                       42
                                                           Video Coding Techniques and Hardware Architectures Design
                                                         Tree-Type Array Architecture with N = 4
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                      43
                                                          Video Coding Techniques and Hardware Architectures Design
                                                           Hybrid Tree/Linear Architecture
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     44
                                                         Video Coding Techniques and Hardware Architectures Design
                                                            Tree-Cut Technique: Direct Form
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     45
                                                         Video Coding Techniques and Hardware Architectures Design
                                                           Image pel Distribution for Memory
                                                                     Interleaving
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     46
                                                         Video Coding Techniques and Hardware Architectures Design
                                                             Chip Layout and Characteristics
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     47
                                                         Video Coding Techniques and Hardware Architectures Design
        Methods and Standards for Lossless Compression




                                                         Analysis and Architecture Design of Variable
                                                         Block Size Motion Estimation for H.264/AVC
Department of Electronic Engineering, FJU




                                                             Ching-Yeh Chen, Shao-Yi Chien, Yu-Wen Huang, Tung-
                                                                            Chien Chen, Tu-Chih
                                                                         Wang, and Liang-Gee Chen


                                                                  IEEE Trans. Circuits Syst. Video Technology



                                                                                                                           48
                                                               Video Coding techniques and Hardware Architectures Design
                                                                                 Abstract
        Methods and Standards for Lossless Compression




                                                          Variable block size motion estimation (VBSME) has
                                                           become an important video coding technique, but it
                                                           increases the difficulty of hardware design.
Department of Electronic Engineering, FJU




                                                          We use inter/intra-level classification and various
                                                           data flows to analyze the impact of supporting
                                                           VBSME in different hardware architectures.
                                                          We propose two hardware architectures, which can
                                                           support traditional fixed block size motion estimation
                                                           as well as VBSME with the less chip area overhead
                                                           compared to previous approaches.

                                                                                                                       49
                                                           Video Coding Techniques and Hardware Architectures Design
                                                                            Abstract (Cont.)
        Methods and Standards for Lossless Compression




                                                          By broadcasting reference pixel rows and
                                                           propagating partial SADs, the first design has the
                                                           fewer reference pixel registers and a shorter critical
Department of Electronic Engineering, FJU




                                                           path.
                                                          The second design utilizes a 2-D distortion array and
                                                           one adder tree with the reference buffer which can
                                                           maximize the data reuse between successive
                                                           searching candidates.
                                                          We demonstrate a 720p, 30fps solution at 108 MHz
                                                           with 330.2K gate count and 208K bits on-chip
                                                           memory.

                                                                                                                        50
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                         Introduction (Cont.)
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                          The row (column) SAD is the summation of N distortions
                                                           in a row (column).
                                                          Although FSBMA provides the best quality among
                                                           various ME algorithms, it consumes the largest
                                                           computation power. In general, the computation
                                                           complexity of ME is from 50% to 90% of a typical video
                                                           coding system. Hence a hardware accelerator of ME is
                                                           required.

                                                                                                                        51
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                                  VBSME
        Methods and Standards for Lossless Compression




                                                          Variable block size motion estimation (VBSME) is a
                                                           new coding technique and provides more accurate
                                                           predictions compared to traditional fixed block size
Department of Electronic Engineering, FJU




                                                           motion estimation (FBSME).
                                                          With FBSME, if a MB consists of two objects with
                                                           different motion directions, the coding performance of
                                                           this MB is worse.
                                                          On the other hand, for the same condition, the MB can
                                                           be divided into smaller blocks in order to fit the different
                                                           motion directions with VBSME.
                                                          VBSME has been adopted in the latest video coding
                                                           standards, including H.263, MPEG-4, WMV9.0, and
                                                           H.264/AVC.
                                                                                                                        52
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                               VBSME (Cont.)
        Methods and Standards for Lossless Compression




                                                          In H.264/AVC, a MB with variable block size can be divided into
                                                           seven kinds of blocks including 4 × 4, 4 × 8, 8 × 4, 8 × 8, 8 × 16, 16
                                                           × 8, and 16 × 16.
Department of Electronic Engineering, FJU




                                                          Although VBSME can achieve higher compression ratio, it not
                                                           only requires huge computation complexity but also increases the
                                                           difficulty of hardware implementation for ME.
                                                          Traditional ME hardware architectures are designed for FBSME,
                                                           and they can be classified into two categories.
                                                            – One is an inter-level architecture, where each processing
                                                                element (PE) is responsible for one SAD of a specific
                                                                searching candidate.
                                                            – The other is an intra-level architecture, where each PE is
                                                                responsible for the distortion of a specific current pixel in the
                                                                current MB for all searching candidates.
                                                                                                                               53
                                                             Video Coding Techniques and Hardware Architectures Design
                                                               Yang, Sun, and Wu’s Architetures
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                          An 1-D inter-level hardware architecture (1DInterYSW).
                                                          The number of PEs is equal to the number of searching candidates
                                                           in the horizontal direction, 2Ph.
                                                          The most important concept is data broadcasting. With
                                                           broadcasting technique, the memory bandwidth which is defined
                                                           as the number of bits for the required reference data in one cycle is
                                                           reduced significantly, although some global routings are required.
                                                                                                                             54
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Yeo and Hu’s Architectures




Video Coding Techniques and Hardware Architectures Design
                                                            55
                                                                    Lai and Chen’s Architeture
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                          Reference pixels are propagated with propagation registers, and
                                                           current pixels are broadcasted into PEs.
                                                          The partial SADs are still stored and accumulated in PEs.
                                                          Besides, 2DInterLC has to load reference pixels into
                                                           propagation registers before computing SADs. The latency of
                                                           loading reference pixels can be reduced by partitioning the
                                                           search range in 2DInterLC.
                                                                                                                             56
                                                            Video Coding Techniques and Hardware Architectures Design
                                                             Vos and Stegherr’s Architecture
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     57
                                                         Video Coding Techniques and Hardware Architectures Design
                                                          Vos and Stegherr’s Architecture (Cont.)
        Methods and Standards for Lossless Compression




                                                          A 2-D intra-level architecture.
                                                          The number of PEs is equal to the block size. Each
                                                           PE is corresponding to a current pixel. And current
Department of Electronic Engineering, FJU




                                                           pixels are stored in PEs, respectively.
                                                          The important concept of 2DIntraVS is the scanning
                                                           order in searching candidates, snake scan.
                                                          The computation flow is as follows.
                                                            – First, the distortion is computed in each PE, and N partial
                                                              row SADs are propagated and accumulated in the horizontal
                                                              direction.
                                                            – Second, an adder tree is used to accumulate the N row
                                                              SADs to be SAD. The accumulations of row SADs and SAD
                                                              are done in one cycle. Hence no partial SAD is required to
                                                              be stored.

                                                                                                                            58
                                                           Video Coding Techniques and Hardware Architectures Design
                                                            Komarek and Pirsch’s Architecture
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                         Komarek and Pirsch’s               Hsieh and Lin’s
                                                         Architecture
                                                                                                                      59
                                                          Video Coding Techniques and Hardware Architectures Design
                                                         Komarek and Pirsch’s Architecture (Cont.)
        Methods and Standards for Lossless Compression




                                                          Komarek and Pirsch contributed a detailed systolic
                                                           mapping procedure by the dependence graph (DG).
                                                           AB2 (2DIntraKP) is a 2-D intra-level architecture.
Department of Electronic Engineering, FJU




                                                          Current pixels are stored in corresponding PEs.
                                                           Reference pixels are propagated PE by PE in the
                                                           horizontal direction.
                                                          The N partial column SADs are propagated and
                                                           accumulated in the vertical direction, first.
                                                          After the vertical propagation, these N column SADs
                                                           are propagated in the horizontal direction.

                                                                                                                       60
                                                           Video Coding Techniques and Hardware Architectures Design
                                                                   Hsieh and Lin’s Architecture
        Methods and Standards for Lossless Compression




                                                          2DIntraHL consists of N PE arrays in the vertical direction, and
                                                           each PE array is composed of N PEs in a row.
                                                          In 2DIntraHL, reference pixels are propagated with propagation
Department of Electronic Engineering, FJU




                                                           registers one by one, which can provide the advantages of serial
                                                           data input and increasing the data reuse.
                                                          Current pixels are still stored in PEs. The N partial column SADs
                                                           are propagated in the vertical direction from bottom to up.
                                                          In each computing cycle, each PE array generates N distortions
                                                           of a searching candidate and accumulates these distortions with
                                                           N partial column SADs in the vertical propagation.
                                                          After the accumulation in the vertical direction, N column SADs
                                                           are accumulated in the top adder tree in one cycle. The longer
                                                           latency for loading reference pixels and large propagation
                                                           registers are the penalties for the reduction of memory bandwidth
                                                           and memory bandwidth.

                                                                                                                           61
                                                            Video Coding Techniques and Hardware Architectures Design
                                                            Proposed Propagate Partial SAD
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     62
                                                         Video Coding Techniques and Hardware Architectures Design
                                                          Proposed Propagate Partial SAD (Cont.)
        Methods and Standards for Lossless Compression




                                                          The architecture is composed of N PE arrays with 1-D adder tree
                                                           in the vertical direction.
                                                          Current pixels are stored in each PE, and two sets of N
Department of Electronic Engineering, FJU




                                                           continuous reference pixels in a row are broadcasted to N PE
                                                           arrays at the same time.




                                                                                                                         63
                                                            Video Coding Techniques and Hardware Architectures Design
                                                          Data Flow of Propagate Partial SAD
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     64
                                                         Video Coding Techniques and Hardware Architectures Design
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Proposed SAD Tree




Video Coding Techniques and Hardware Architectures Design
                                                            65
                                                            Scan Order and Memory Access
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     66
                                                         Video Coding Techniques and Hardware Architectures Design
                                                         Variable Block Size Motion Estimation
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     67
                                                         Video Coding Techniques and Hardware Architectures Design
                                                         The Impact of Variable Block Size Motion
                                                          Estimation in Hardware Architectures
        Methods and Standards for Lossless Compression




                                                          There are many methods to support VBSME in
                                                           hardware architectures.
                                                          For example, we can increase the number of PEs or
Department of Electronic Engineering, FJU




                                                           the operating frequency to do ME for different block
                                                           sizes, respectively. One of them is to reuse the SADs
                                                           of the smallest blocks, which are the blocks partitioned
                                                           with the smallest block size, to derive the SADs of
                                                           larger blocks.
                                                          By this method, the overhead of supporting VBSME is
                                                           only the slight increase of gate count, and the other
                                                           factors, such as frequency, hardware utilization,
                                                           memory usage, and so on, are the same as those of
                                                           FBSME.
                                                                                                                        68
                                                            Video Coding Techniques and Hardware Architectures Design
                                                           Data Flow I–Storing in PEs (Inter-Level
                                                                       Architecture)
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                    FBSME, N = 16             VBSME, N = 16, n = 4

                                                          The number of bits for the data buffer in each PE is increased
                                                           from log2N2+8 to n2×(log2(N/n)2+8), where N2 and (N/n)2 are the
                                                           number of pixels in one block, and 8 is the wordlength of one
                                                           pixel.                                                            69
                                                            Video Coding Techniques and Hardware Architectures Design
                                                         Data Flow II–Propagating with Propagation
                                                            Registers (Intra-Level Architecture)
        Methods and Standards for Lossless Compression




                                                          In intra-level architectures, partial SADs can be
                                                           accumulated and propagated with propagation registers.
                                                          Each PE computes the distortion of one corresponding
Department of Electronic Engineering, FJU




                                                           current pixel in current MB.
                                                          By propagation adders and registers, the partial SAD is
                                                           accumulated with these distortions.
                                                          When supporting VBSME, more propagation registers
                                                           are required to store partial SADs of the smallest blocks.
                                                           In each propagating direction, the number of
                                                           propagation registers are n times of that in the original
                                                           for the n smallest blocks in the other direction.

                                                                                                                        70
                                                            Video Coding Techniques and Hardware Architectures Design
                                                         The Proposed Propagate Partial SAD
                                                            Architecture with Data Flow II
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     71
                                                         Video Coding Techniques and Hardware Architectures Design
                                                              Data Flow III–No Partial SADs
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                            The proposed SAD Tree architecture with Data Flow III,
                                                            where N = 16 and n = 4.
                                                                                                                     72
                                                         Video Coding Techniques and Hardware Architectures Design
                                                            Data Flow III–No Partial SADs (Cont.)
        Methods and Standards for Lossless Compression




                                                          In intra-level architectures, it is possible that no partial SADs are
                                                           required to be stored, such as SAD Tree.
                                                          Each PE computes the distortion of one current pixel for a
Department of Electronic Engineering, FJU




                                                           searching
                                                           candidate, and the total SAD is accumulated by an adder tree in
                                                           one cycle, as shown in Fig. 5(a).
                                                          Because there is no partial SAD in this architecture, there is no
                                                           registers overhead to store partial SADs when supporting
                                                           VBSME.
                                                          The adder tree is the one to be reorganized to support VBSME
                                                          That is, we partition the 2-D adder tree in order to get the SADs
                                                           of the smallest blocks first, and then based on these SADs, to
                                                           derive the SADs of large blocks. Although there is no additional
                                                           register overhead, the adder tree additions required to support
                                                           VBSME do require additional area,
                                                                                                                                   73
                                                            Video Coding Techniques and Hardware Architectures Design
                                                         THE PARALLELISM, CYCLES, LATENCY, AND
                                                             DATA FLOW OF EIGHT HARDWARE
                                                                    ARCHITECTURES
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                       74
                                                           Video Coding Techniques and Hardware Architectures Design
                                                         THE DATA BUFFER AND MEMORY BITWIDTH
                                                          OF EIGHT HARDWARE ARCHITECTURES
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                       75
                                                           Video Coding Techniques and Hardware Architectures Design
                                                                                An Example
        Methods and Standards for Lossless Compression




                                                          The specifications of ME are as follows. The MB size is 16×16,
                                                           and the search range is Ph = 64 and Pv = 32.
                                                          The frame size is D1 size, 720 × 480.
Department of Electronic Engineering, FJU




                                                          When VBSME is supported, a MB can be partitioned at most to
                                                           16 4×4 blocks.
                                                          We use Verilog-HDL and SYNOPSYS Design Compiler with
                                                           ARTISAN UMC 0.18um cell library to implement each hardware
                                                           architecture.
                                                          Because the timing of the critical path in some architectures is
                                                           too long, which means the maximum operating frequency is
                                                           limited without modifying the architecture, the frame rate is set
                                                           as only 10 frames per second (fps).

                                                                                                                               76
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                  Area and Required Frequency
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                          Among these eight hardware architectures, all inter-level
                                                           architectures with Data Flow I increase gate count dramatically.
                                                           The chip area is five times of that in FBSME at least.
                                                                                                                              77
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                                 Latency
        Methods and Standards for Lossless Compression




                                                          The latency is defined as the number of start-up cycles
                                                           that a hardware takes to generate the first SAD.
                                                          If a module has a long latency and it cannot be
Department of Electronic Engineering, FJU




                                                           shortened by parallel architectures, the effect of
                                                           parallel computation is reduced. That is, a shorter
                                                           latency is better for video coding systems.
                                                          There are two factors to affect the latency.
                                                            – Hardware architecture
                                                            – Memory bandwidth
                                                          Compared to these hardware architectures, the other
                                                           intra-level architectures, such as proposed Propagate
                                                           Partial SAD and SAD Tree, have shorter latencies. 78
                                                           Video Coding Techniques and Hardware Architectures Design
                                                                                 Utilization
        Methods and Standards for Lossless Compression




                                                          In general, inter-level architectures can continuously
                                                           compute MB by MB, so the initial cycles can be
                                                           neglected and the utilization will be 100%.
Department of Electronic Engineering, FJU




                                                          Therefore, we defined the utilization as Computing
                                                           cycles / Operating cycles for a MB.
                                                          The operating cycles include three parts, latency,
                                                           computing cycles, and bubble cycles. Computation
                                                           cycles are the number of cycles when we can get one
                                                           SAD at least. That is, if the utilization is 100%, we can
                                                           get one SAD in each cycle at least. Fewer operating
                                                           cycles will less the penalty of the latency be apparent.
                                                          The more bubble cycles are, the lower the utilization is.
                                                                                                                        79
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                            Memory Usage
        Methods and Standards for Lossless Compression




                                                          Memory usage consists of two parts, memory bitwidth and
                                                           memory bandwidth.
                                                          Memory bitwidth is defined as the number of bits which a
Department of Electronic Engineering, FJU




                                                           hardware has to access from memory in each cycle, and
                                                           memory bandwidth is re-defined as the number of bits
                                                           which a hardware has to access from memory for a MB.
                                                          Memory bandwidth affects the loading of system bus
                                                           without on-chip memory or the power of on-chip memory,
                                                           and memory bitwidth is the key to the data arrangement of
                                                           on-chip memories.
                                                          Memory bitwidth and bandwidth are affected by the data
                                                           reuse scheme and operating cycles.
                                                                                                                        80
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                              Hexagonal Plot
        Methods and Standards for Lossless Compression




                                                          The closer the point is to the
                                                           center, the worse the
                                                           performance is.
Department of Electronic Engineering, FJU




                                                          Note that, in various video
                                                           coding systems or hardware
                                                           system platforms, the
                                                           weighting of each axis will be
                                                           very different.
                                                          We can use these hexagonal
                                                           plots to select the optimal
                                                           architecture based on
                                                           different constraints for the
                                                           system integration.
                                                                                                                        81
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Hexagonal Plots




Video Coding Techniques and Hardware Architectures Design
                                                            82
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Hexagonal Plots




Video Coding Techniques and Hardware Architectures Design
                                                            83
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Hexagonal Plots




Video Coding Techniques and Hardware Architectures Design
                                                            84
                                                                         Methods and Standards for Lossless Compression
                                                                 Department of Electronic Engineering, FJU

                                                                                                                          Hexagonal Plots




Video Coding Techniques and Hardware Architectures Design
                                                            85
                                                           Hardware Architecture of H.264 Integer
                                                                    Motion Estimation
        Methods and Standards for Lossless Compression




                                                          Based on the above analysis, we propose a ME hardware for
                                                           H.264/AVC integer-pixel motion estimation (IME) as an example.
                                                          Our specification is that two frame sizes are supported in our
Department of Electronic Engineering, FJU




                                                           specification.
                                                            – One is D1 Format with four reference frames, 30 fps. In
                                                              the previous frame, the search range is [-64,64) and [-
                                                              32,32) in the horizontal and vertical directions. In the
                                                              rest frames, the search range is [-32,32) and [-16,16)
                                                              in the horizontal and vertical directions.
                                                            – The other is 720p with one reference frame, 30 fps.
                                                              The search range is the same as that of the previous
                                                              frame in D1 Format.
                                                                                                                            86
                                                            Video Coding Techniques and Hardware Architectures Design
                                                           Hardware Architecture of H.264 Integer
                                                                Motion Estimation (Cont.)
        Methods and Standards for Lossless Compression




                                                          In our specification, the computation complexity of H.264 is 2.4
                                                           tera instructions per second and 3.8 tera bytes per second in D1
                                                           Format and dominated by IME, which is estimated by instruction
                                                           profiling of reference software, JM7.3.
Department of Electronic Engineering, FJU




                                                          The ultra large computation complexity can be solved by the
                                                           parallel computation, but the huge external memory bandwidth
                                                           can not. Therefore, the huge memory bandwidth is a difficult
                                                           challenge for hardware design.
                                                          There are still two problems.
                                                            – First, because of VBSME and Lagrangian mode decision,
                                                               the data dependency of motion vector predictor prohibits
                                                               from the parallel computation between the smaller blocks in
                                                               a MB.
                                                            – Secondly, when the high processing ability is necessary, the
                                                               hardware cost of ME hardware architectures with high
                                                               degrees of parallelism is also required to be discussed.
                                                                                                                              87
                                                            Video Coding Techniques and Hardware Architectures Design
                                                                          Modified Algorithm
        Methods and Standards for Lossless Compression




                                                          First, we divide the computation of ME into two parts,
                                                           integer-pixel ME and fractional-pixel ME (FME), and
                                                           propose two individual hardware accelerators for IME
Department of Electronic Engineering, FJU




                                                           and FME, respectively. The utilization of hardware
                                                           accelerators can be significantly improved by this way.
                                                          Second, in the original Lagrangian mode decision, the
                                                           MV predictor of a block is the medium MV among the
                                                           MVs of top, top-right, left neighboring 4×4 blocks but in
                                                           the parallel computation of hardware architectures, the
                                                           coding modes of the neighboring 4×4 blocks can not
                                                           be decided in parallel, especially when the block size
                                                           is 4×4.
                                                                                                                        88
                                                            Video Coding Techniques and Hardware Architectures Design
                                                         The motion vector predictor for (a) the 4×8 block,
                                                         (b) the 16×16 block, and (c) the modified motion
                                                                   vector predictor for all blocks.
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                       89
                                                           Video Coding Techniques and Hardware Architectures Design
                                                         Hardware Architecture with M-parallelism
        Methods and Standards for Lossless Compression




                                                          In our specification, we require eight sets of Propagate Partial
                                                           SAD or SAD Tree to achieve the realtime computation.
                                                          Eight sets of Propagate Partial SAD and SAD Tree, which can
Department of Electronic Engineering, FJU




                                                           process eight successive candidates in a row at the same time,
                                                           are combined as Eight-Parallel Propagate Partial SAD and Eight-
                                                           Parallel SAD Tree, respectively.




                                                                                                                         90
                                                           Video Coding Techniques and Hardware Architectures Design
                                                         Hardware Architecture of H.264 Integer
                                                                  Motion Estimation.
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     91
                                                         Video Coding Techniques and Hardware Architectures Design
                                                         Comparison of RD Curves Between JM7.3
                                                              and Our Proposed Encoder
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                       92
                                                           Video Coding Techniques and Hardware Architectures Design
                                                            Memory Reduction of H.264 IME
        Methods and Standards for Lossless Compression
Department of Electronic Engineering, FJU




                                                                                                                     93
                                                         Video Coding Techniques and Hardware Architectures Design

								
To top