High Speed VLSI Architecture for Bit Plane Encoder of JPEG2000

Document Sample
High Speed VLSI Architecture for Bit Plane Encoder of JPEG2000 Powered By Docstoc
					    High Speed VLSI Architecture for Bit Plane Encoder of JPEG2000
                          Amit Kumar Gupta, David Taubman, Saeid Nooshabadi
                                             University of New South Wales
                                                   Sydney, Australia

  Abstract–The Bit Plane Coder is a part of the                          C onte xt w ind ow

JPEG2000 embedded block coder. Its throughput
plays a key role in deciding the overall throughput of a
JPEG2000 encoder. In this paper we present a paral-                  S
lel pipeline VLSI architecture for the bit plane encoder             R
which processes a complete stripe column concurrently                P
during every pass. The hardware requirements and
the critical path delay of the proposed technique are
                                                                                              C ode B lo ck w id th
compared with the existing solutions. The experimen-
tal results show that the proposed architecture has 2.6
times greater throughput than existing architectures,
                                                               Fig. 1. Stripe Based Scanning Order and Context Window for a sam-
with a compratively small increase in hardware cost.           ple location (The context window shown is for the sample location
                                                               represented by empty circle).
                    I. Introduction

  The embedded block coding algorithm [5] of JPEG2000          stripe fashion, with 4 rows per stripe, column-by-column
is a part of EBCOT algorithm, proposed in [2]. It consists     order from left to right (Fig. 1). Each bit at every location
of Bit Plane Coding (BPC) and Arithmetic Coding (AC)           is coded in one of three non-overlapping passes: the Signif-
modules. BPC module encodes the so-called code-block of        icance Propagation Pass (SP); the Magnitude Refinement
quantized wavelet coefficients, and provides context-data        Pass (MR); and the Cleanup Pass (CP). A significance
pair to be encoded by the arithmetic coder. BPC works se-      state is defined for every bit location which, along with
quentially on each bit-plane of the code-block. The bit at     the significance state of the eight neighboring locations
each sample location in a bit-plane is encoded in one of the   in the context window (Fig. 1), governs its pass member-
three non overlapping passes. The coding pass member-          ship. All locations are assigned insignificant state initially.
ship of every location is determined dynamically depend-       A location becomes significant immediately after its first
ing on the image statistics. This dynamic nature imposes       non-zero bit has been coded. A sample location belongs
restrictions in terms of hardware cost and critical path, in   to:
realizing a high throughput BPC architecture.                    SP pass, if it is currently insignificant but at least one
  The existing VLSI architectures [3,4] for BPC generate       of its neighbors (neighboring locations in the context win-
at most one context-data pair in a single clock cycle. In      dow) is significant;
this paper, we present the VLSI architecture of BPC mod-         MR pass, if the location is significant and has not been
ule which generates upto 10 context-data pairs in a single     coded in the SP pass; and
clock cycle. The architecture exploits the fact that the
                                                                 CP pass, if the location has not been coded in either the
coding pass membership of multiple sample locations can
                                                               SP or MR passes.
be calculated concurrently. We also review the coding pass
membership tests described in [3] and propose necessary          One of the three different coding primitives, Zero Cod-
amendments. An APEX20KE FPGA is used as a com-                 ing, Run-length Coding and Magnitude Refinement Cod-
mon platform to examine the hardware cost and critical         ing, is used to generate the context for a data bit, depend-
path performance of proposed architecture in comparison        ing on the sample location’s coding pass membership. The
to existing architectures.In Section II. we explain the BPC    context for coding the sign bit of each sample location is
algorithm and existing architectures.                          generated immediately after the sample location becomes
                                                               significant, using the Sign Coding primitive.
  In Section III. we introduce the proposed algorithm and
architecture. The results are presented in Section IV. and       Two main architectures have been proposed to date.
we discuss the impact of the proposed architecture on the      Both employ stripe column based processing, as suggested
overall throughput of the embedded block coder in Section      in [1]. The first architecture is a simple timing architecture
V..                                                            [4] which checks each sample location for its membership
                                                               to the current pass and, if the sample location belongs
  II. Bit Plane Coding And Existing Solutions                  to the current pass, its context is generated. The second
                                                               architecture [3] is more efficient as it employs a sample
  The BPC module encodes the data bit-plane by bit-            skipping strategy to skip those samples, within a stripe
plane, starting with the most significant bit-plane for a       column, which do not belong to the current pass. Both
given code block. Within a bit-plane the data is scanned in    the architectures generate at most one context-data pair
in a clock cycle but the first architecture wastes a clock cy-           A     G   I
                                                                                          A      G    I
                                                                                                               B   0    J
                                                                        B     0   J                            C   1    K
cle whenever it encounters a sample which does not belong               C     1   K       B      0    J        D   2    L
to the current coding pass.
                                                                                          C      1    K

                                                                                          D      2    L
   III. Proposed Algorithm and Architecture
                                                                        C     1   K       E      3    M       D    2    L
                                                                        D     2   L                           E    3   M
  Our proposed algorithm processes a complete stripe col-               E     3   M
                                                                                          F      H    N
                                                                                                              F    H   N
umn in a single clock cycle during each coding pass. Thus
it generates anywhere between 0 and 10 context-data pairs
in a single clock cycle. The extreme case of 10 context-
                                                                Fig. 2. Context window for stripe column processing and respective
data pairs happens only during the CP pass when the             context window for each sample location in stripe column
run-mode condition is satisfied but a run-interrupt occurs
immediately at the first sample location in the stripe col-      ing every pass, we calculate
umn.                                                            1. Pass flags for each sample location.
  To explain the algorithm, we define 3 state bits, a Signif-    2. A Personalized Set of SS Bits (PSSB) for each sample
icant Status (SS) bit, a Magnitude Refinement (MR) bit           location. The meaning and significance of the PSSB is
and a Coding State (CS) bit, for each bit location. For all     explained later in this section.
locations, the SS and MR bits are initialized to zero before    3. Run-mode eligibility, using the Kσ[j] bits of all sample
coding the first bit-plane for the code block, while the CS      locations in the current stripe column.
bit is initialized to zero before coding each bit-plane. The    4. Run-interruption, using the data bits of all sample lo-
SS bit maintains the significance status of the location; it     cations in the current stripe column.
becomes 1 after the first non-zero bit of the location has       5. The context of each sample location, from its own set
been coded. The MR status bit maintains a delayed ver-          of state bits.
sion of the SS bit. More specifically the MR bit is set to 1       Concurrent pass membership testing of the stripe col-
for a location when it is first coded in the MR pass. The        umn’s sample locations is the fundamental block in the
CS bit maintains the coding state of a location; it will be     proposed algorithm. The strategy suggested in [3] has an
set to 1 when a location has been coded in a bit-plane.         inherent assumption that the significance state of none
  The overall context window for stripe column based            of the sample locations in the stripe column will change
processing is shown in Fig. 2. The shaded column in Fig.        during the current pass under consideration. Thus this al-
2 is the current stripe column to-be-coded. Pass mem-           gorithm fails to take the effect of significance propagation
bership of all the 4 sample locations in the current stripe     into account. In the proposed algorithm we also consider
column is required to be able to process a complete stripe      the data bits to be coded for the sample locations {0, 1, 2}
column concurrently.This requires data, CS and SS bits          (Fig. 2) in order to correctly deduce the coding pass mem-
for the 4 sample locations {0 − 3} of the current stripe        bership for all four sample locations in the stripe column
column, and SS bits for all 14 neighboring sample loca-         concurrently.
tions {A − N} as shown in Fig. 2. We also need access to
all 18 sign bits in the stripe column’s context window in       A. Concurrent membership testing
order to calculate the sign contexts concurrently. We use         To contrast the proposed pass membership testing
the following terminology:                                      methodology with that in [3], we first present the pseudo
j- location in the current stripe column’s context window       code to concurrently generate the membership of all
{0 − 3, A − N }.                                                sample locations in a stripe column during the SP
ν[j]- data bit at location j.                                   pass. While generating the membership for location
σ[j]- significance state of location j before coding the cur-    j ∈ {1, 2, 3}, we also consider the possibility that the
rent stripe column during the current pass.                     (j − 1)th location may belong to the SP pass and may
σ [j]- significance state of location j after coding the cur-    turn significant in this bit-plane.
rent stripe column during the current pass.This value may
                                                                Location 0:
differ from σ[j] only for locations {0 − 3}
                                                                if (σ[0]) Pass=MR;
π[j]- coding state at location j .                                     −
Kσ[j]- single bit variable which signifies if any sample lo-     elseif ( K σ[0]|σ[1]) Pass=SP;
cation in the context window of sample location j is sig-       else Pass=CP;
nificant or not.                                                 Location 1:
K σ[j]- single bit variable which signifies if any sample lo-    if (σ[1]) Pass=MR;
cation in the context window of location j, excluding the       elseif ( K σ[1]|σ[0]|σ[2]) Pass=SP;
locations from current stripe, is significant or not. For ex-            −
                                                                elseif ( K σ[0]&υ[0]) Pass=SP;
ample, for location 0 the considered locations are A, G, I,     else Pass=CP;
B, J, C, and K only.
                                                                Location 2:
Pflag[j]- pass flag of location j. A single bit variable with
                                                                if (σ[2]) Pass=MR;
value 1 means the location j belongs to the current pass.               →
|- bit-wise OR operation.                                       elseif ( K σ[2]|σ[1]|σ[3]) Pass=SP;
                                                                          →                 −
&- bit-wise AND operation.                                      elseif ((( Kσ[0]&υ[0])&σ[0]| K σ[1])&υ[1]) Pass=SP;
˜- bit-wise negation.                                           else Pass=CP;
  In the proposed algorithm, for every stripe column, dur-      Location 3:
if (σ[3]) Pass=MR;                                                         Stage 0                            Stage 1                            Stage 2
elseif ( Kσ[3]|σ[2]) Pass =SP;                                                                                                                        Coding pass

           →                →
                            −                  −
                                               →                        State, Data
elseif ((( K σ[0]&υ[0]|σ[0]| Kσ[1])|υ[1])|σ[1]| K σ[2])&υ[2])            and Sign
                                                                                                                                        ZC context
                                                                                                                    Sample Sigma
Pass=SP;                                                                                         Pass Flags          Generation         MR context
else Pass=CP;                                                           CONTROL       Boundary
                                                                                                 Generation                             Generation

                                                                          UNIT         Handler                     Run Mode Signal     Sign context
  It is important to point out that the above pseudo-code                                           Kσ               Generation         Generation
intentionally contains an extra separate ‘elseif’ condition                                                              State Regs
                                                                                                                                        Context Active
                                                                                                                                       Signal Generation
emphasizing the effect of significant propagation, which                                                                  Update Logic
may originate from any of the previous samples in the                                                                                      context
stripe column. The logic equations for current pass mem-
bership tests based on the pseudo-code are:
SP Pass:
                        −                                          Fig. 3. Block diagram of proposed VLSI architecture for BPC mod-
             Pflag[0] = (K σ[0] | σ[1]) & (˜σ[0].            (1)    ule

 Pflag[1] = (K σ[1] | σ[2] | (Pflag[0] & υ[0])) & (˜σ[1]).
 Pflag[2] = (K σ[2] | σ[3] | (Pflag[1] & υ[1])) & (˜σ[2]).           for generating intermediate variables; it consists of the
                                                       (3)         following blocks:
 Pflag[3] = (K σ[3] | σ[4] | (Pflag[2] & υ[2])) & (˜σ[3]).             Boundary Handler : This block takes care of boundary
                                                       (4)         conditions. For the first stripe column in every stripe, the
MR pass:                                                           SS bits of column 1 in the stripe column’s context block
                                                                   should be taken to be 0. Similarly for the last stripe col-
           Pflag[j] = σ[j] & (˜π[j]); j ∈ {0, 1, 2, 3}.      (5)    umn in every stripe, the SS bits in column 3 are taken
                                                                   to be 0. This block reads from the SS bit register and
CP pass:                                                           generates SS bits for columns 1 and 3, in accordance with
                                                                   the boundary conditions. The BPC module reads a new
               Pflag[j] = ˜π[j]; j ∈ {0, 1, 2, 3}.           (6)    stripe column every clock cycle (assuming there is no stall
                                                                   generated from the AC module). This helps to avoid in-
B. Generation of the Personalized Set of SS Bits                   vesting an extra clock cycle at stripe boundaries and aids
                                                                   in creating smooth memory access patterns for data, sign
  By PSSB of a sample location, we mean the set of SS
                                                                   and state bits. The other boundary condition correspond-
bits for locations in its context window, generated while
                                                                   ing to the case where the last stripe does not contain all 4
taking care of the possibility that the significance state of
                                                                   rows, is handled by assigning 1 to the CS state bits of the
an immediate vertical neighbour may also change during
                                                                   out-of-bound sample locations.
the current pass. Specifically, for sample locations 1, 2,
and 3, we must take into account the possible change of              Pflag Generation: This block generates the pass mem-
significance in location 0, 1, and 2, respectively during           bership flags, using logic equations given in Section B..
the current pass. The PSSBs are as follows:                          KSig ( Kσ[j]) Generation: This block generates Kσ[j],
PSSB[0] = {σ[B], σ[J], σ[G], σ[1], σ[A], σ[I], σ[K], σ[C]};        for all sample locations in the stripe column. It is used by
PSSB[1] = {σ[C], σ[K], σ0 [0], σ[2], σ[B], σ[J], σ[L], σ[D]};      the Run Mode Signal Generation block and MR Context
PSSB[2] = {σ[D], σ[L], σ0 [1], σ[3], σ[C], σ[K], σ[M], σ[E]};      Generation block.
PSSB[3] = {σ[E], σ[M ], σ0 [2], σ[H], σ[D], σ[L], σ[N ], σ[F ]};     Sample Sigma Generation: This block generates the
where, σ0 [j] represents the updated SS bit for location j         PSSB for all sample locations in the stripe column.
and can be generated as
                                                                     Regs Update Logic: This block updates the state bits of
       σ0 [j] = ((Pflag[j] & υ[j]) | σ[j]).j ∈ {0, 1, 2}     (7)    all sample locations, depending on their pass membership.
                                                                     Run Mode Signal Generation: This block generates run-
The proposed algorithm generates a maximum of 10 out-              mode and run-interrupt signal, using Kσ[j] and the data
put context-data pairs, each with their corresponding ac-          bits. The run-mode signal specifies whether the cur-
tive signal. The context-active signals are calculated based       rent stripe column will enter run-mode or not. The run-
on the current coding pass, the data bits of the stripe col-       interrupt signal signals run interruption in the current
umn and the SS bits of stripe column’s context window.             stripe column.
                                                                     Stage 2 is responsible for calculating the contexts, de-
C. Architecture
                                                                   pending on the PSSB, MR, sign and data bits. It contains
  The block diagram of the proposed architecture is shown          processing blocks for generating ZC contexts, MR con-
in Fig. 3. We use a 3 stage pipelined architecture to              texts, sign contexts and run-length contexts. It also gen-
optimize the critical path. Stage 0 is the control unit,           erates the context-active signal, depending on the current
providing the interface to the main control unit of the            pass, run-mode signal, run-interrupt signal, pass flags, and
JPEG2000 encoder system. It also controls the memory               data bits to identify the active context-data pairs. A mul-
read and write address generation for data, sign and state         tiplexer is also required to select between the ZC and MR
bits and coding pass information. Stage 1 is responsible           contexts, depending on the current pass.
   Architecture                Cycles                        Throughput         the bit plane and arithmetic coder as well as the intermedi-
                             /code block                    (106 samples/sec)   ate buffering used to couple them. Traditional arithmetic
   Simple Timing                156590                             1.17         coder architectures can only accept at most one context-
  Sample Skipping                89170                             1.77         data pair per clock cycle. It is possible, however, to realise
     Proposed                    46080                             4.59         concurrent symbol processing in the arithmetic coder [5]
                                   TABLE I                                      with some extra hardware cost. Our preliminary simula-
                            T h ro u g h p u t an aly sis
                                                                                tion results show that the proposed BPC architecture can
                                                                                fulfill the high input demands of a multiple symbol arith-
                                                                                metic coder resulting in nearly a 33% increase in overall
      Architecture                Clk(Mhz)                  Area(Gates)         throughput. However existing BPC architectures cannot
      Simple Timing                  51.7                       631             realize such improvements, due to the absence of concur-
     Sample Skipping                 38.6                       710             rent symbol processing. Even with a single symbol/clk-
        Proposed                     51.7                       956             cycle arithmetic coder, the proposed architecture achieves
                                  TABLE II                                      approx.10% better overall throughput due to fewer num-
            C ritic a l p a th d e lay an d h ard w are co st an aly sis        ber of stalls at the arithmetic coder. Additionally our ar-
                                                                                chitecture can also be augmented with speedup techniques
                                                                                like multiple column skipping [3] and parallel context mod-
                              IV. Results
                                                                                elling [6]. The concurrent symbol processing technique can
  We used an APEX20KE FPGA to implement the pro-                                also be scaled to be able to process multiple stripe column
posed architecture. To compare the performance of our                           concurrently. Our ongoing work addresses the impact of
architecture, we also implemented simple timing [4] and                         employing such speedup techniques, intermediate buffer
sample skipping [3] architecture on the same platform.                          length, and extra hardware cost involved, in improving
Critical path delay and hardware cost (number of cells)                         the overall throughput of the embedded block coder.
are used as the basis for comparison. The state, data
and sign memories are not included in the hardware cost                                               VI. Conclusion
analysis, since they are present in all three architectures.                      We have presented an algorithm and VLSI architecture
                                                                                for a BPC engine which processes a complete stripe col-
A. Throughput Analysis
                                                                                umn during each pass concurrently. The architecture is ef-
  We use the following parameters for throughput analysis:                      ficiently designed to handle the boundary conditions and is
Size of code block = [64, 64].                                                  pipelined to enhance the clock rate. The results show that
Bit depth of wavelet coefficients = 16 (1 sign,15 data bits).                     the proposed architecture has 2.6 times greater through-
Average # of empty columns (from [1]) = 23634.                                  put with little extra hardware cost. Also, the proposed
  In the simple timing architecture [4] each location                           architecture enables the use of multi-symbol arithmetic
is checked three times for its membership once during                           encoders to realize a complete high throughput embed-
each pass. The stripe column based sample skipping                              ded block encoder. Additionally, the proposed architec-
architecture [3] processes only those samples in a stripe                       ture takes care of the boundary conditions in an efficient
column which belong to the current pass, by calculating                         way, which in turn results in a smooth memory access
the pass membership of all 4 samples in a stripe-column                         pattern.
concurrently. The proposed architecture consumes only
single clock cycle for each stripe column during each
pass. Table I presents the average throughput figures for                        [1] D. S. Taubman and M. W. Marcellin, JPEG 2000: Image
                                                                                    Compression, Fundamentals, Standards and Practice. Norwell,
the BPC module calculated using the above mentioned                                 MA:Kluwer, 2002.
parameters and the operating frequency listed in Table II.                      [2] D. Taubman,” High performance scalable image compression
The results show that the proposed BPC architecture has                             with EBCOT,” IEEE Trans. Image Processing, vol. 9, pp. 1158-
approximately 2.6 times greater throughput than existing                            1170, July 2000.
                                                                                [3] Chung-Jr Lian, Kuan-Fu Chen, Hong-Hui Chen, Liang-Gee
architectures.                                                                      Chen,“ Analysis and architecture design of block-coding engine
                                                                                    for EBCOT in JPEG2000,“ in IEEE Tran. Circuits and Systems
                                                                                    for Video Technology, vol. 13, pp. 219-230, March 2003.
B. Critical path delay and hardware cost analysis                               [4] Kishore Andra, Tinku Acharya, Chaitali Chakrabarti, ”Effi-
                                                                                    cient VLSI implementation of bit plane coder of JPEG2000”, in
  As shown in Table II, the hardware cost of the proposed                           Proc. SPIE Int. Conf. Applications of Digital Image Processing
architecture is just 1.3 times that of the sample skipping                          XXIV, vol. 4472, pp. 246-257.
                                                                                [5] D. Taubman, E. Ordentclich, M. Weinberger, G. Seroussi, I.
architecture, while the proposed architecture has a 34%                             Ueno, and F. Ono,"Embedded block coding in JPEG2000," in
improvement in operating frequency. The main reasons                                Proc. 2000 International Conference on Image Processing, vol.
behind critical path improvement are the efficient use of                             2, pp. 33-36, Sept 2000.
pipelining and the absence of extra sequential logic for the                    [6] Yijun Li, Ramy E. Aly, Beth Wilson, Magdy A. Bay-
multiplexers used to skip samples [3].                                              oumi,“ Analysis and Enhancements for EBCOT in high speed
                                                                                    JPEG2000 architectures”, in Midwest Symp. on Ckts. and Sys-
                            V. Discussion                                           tems, vol.2, pp. 207-210, Aug. 2002.

 Upto this point we have ignored the impact of arithmetic
coder on the overall throughput of embedded block coder.
The overall throughput depends on the throughput of both

Shared By:
Description: High Speed VLSI Architecture for Bit Plane Encoder of JPEG2000