# Arithmetic coding

Document Sample

```					                           CODING THEORY

A

PRESENTATION

ON

CODING THEORY

From the department of Computer Science

College of Natural Sciences (COLNAS)

University of Agriculture, Abeokuta (UNAAB)

INFORMATION AND COMMUNICATION THEORY   BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 1
CODING THEORY

1.0 Coding Theory
1.1   Data compression (or source coding)

1.1.2 Principle of source coding

1.2   Error correction (or channel coding')

2.0 Arithmetic coding
2.1   Arithmetic coding and elementary number theory

2.1.1 Example

2.2   Theoretical limit of compressed message

2.3   Using probabilities instead of frequencies

2.4   Implementation details for the probability concept

2.4.1 Defining a model

2.4.2 A simplified example

2.4.3 Encoding and decoding

2.5   Precision and renormalization

2.6   Connections between arithmetic coding and Huffman coding

2.7   US patents on arithmetic coding

2.8   Benchmarks and other technical characteristics
INFORMATION AND COMMUNICATION THEORY           BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 2
CODING THEORY

3.0 Huffman coding
3.1      History

3.2      Problem definition

3.2.1 Basic technique

3.2.2 Example
3.3      Main properties

3.4      Variations

3.4.1 n-ary huffman coding

3.4.3 Huffman template algorithm

3.4.4 Length-limited huffman coding

3.4.5 Huffman coding with unequal letter costs

3.4.6 The canonical Huffman code

3.4.7 Model reconstruction

3.5      Applications

4.0 Implementation
5.0 References
INFORMATION AND COMMUNICATION THEORY           BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 3
CODING THEORY

1.0 Coding Theory
Coding theory is studied by various scientific disciplines — such as information theory,
electrical engineering, mathematics, and computer science — for the purpose of designing
efficient and reliable data transmission methods. This typically involves the removal of
redundancy and the correction (or detection) of errors in the transmitted data. It also includes the
study of the properties of codes and their fitness for a specific application.

Thus, there are essentially two aspects to Coding theory:

1. Data compression (or, source coding)
2. Error correction (or, channel coding')

1.1      DATA COMPRESSION (OR SOURCE CODING)

It deals with the properties of codes and with their fitness for a specific application. Source
encoding attempts to compress the data from a source in order to transmit it more efficiently.
This practice is found every day on the Internet where the common "Zip" data compression is
used to reduce the network load and make files smaller. The second, channel encoding, adds
extra data bits to make the transmission of data more robust to disturbances present on the
transmission channel. The ordinary user may not be aware of many applications using channel
coding. A typical music CD uses the Reed-Solomon code to correct for scratches and dust. In
this application the transmission channel is the CD itself. Cell phones also use coding techniques
to correct for the fading and noise of high frequency radio transmission. Data modems, telephone
transmissions, and NASA all employ channel coding techniques to get the bits through, for
example the turbo code and LDPC codes.

There are different coding methods used for data compression which include

1.0      Arithmetic coding
2.0      Huffman coding
3.0      Range coding
4.0      Cyclic codes
5.0      Hamming coding

INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 4
CODING THEORY

1.1.2 PRINCIPLE OF SOURCE CODING

Entropy of a source is the measure of information. Basically source codes try to reduce the
redundancy present in the source, and represent the source with fewer bits that carry more
information.

Data compression which explicitly tries to minimize the average length of messages according to
a particular assumed probability model is called entropy encoding.

Various techniques used by source coding schemes try to achieve the limit of Entropy of the
source. C(x) ≥ H(x), where H(x) is entropy of source (bitrate), and C(x) is the bitrate after
compression. In particular, no source coding scheme can be better than the entropy of the source.

1.2      ERROR CORRECTION (OR CHANNEL CODING')

The aim of channel coding theory is to find codes which transmit quickly, contain many valid
code words and can correct or at least detect many errors. While not mutually exclusive,
performance in these areas is a trade off. So, different codes are optimal for different
applications. The needed properties of this code mainly depend on the probability of errors
happening during transmission. In a typical CD, the impairment is mainly dust or scratches. Thus
codes are used in an interleaved manner. The data is spread out over the disk. Although not a
very good code, a simple repeat code can serve as an understandable example. Suppose we take a
block of data bits (representing sound) and send it three times. At the receiver we will examine
the three repetitions bit by bit and take a majority vote. The twist on this is that we don't merely
send the bits in order. We interleave them. The block of data bits is first divided into 4 smaller
blocks. Then we cycle through the block and send one bit from the first, then the second, etc.
This is done three times to spread the data out over the surface of the disk. In the context of the
simple repeat code, this may not appear effective. However, there are more powerful codes
known which are very effective at correcting the "burst" error of a scratch or a dust spot when
this interleaving technique is used.

Other codes are more appropriate for different applications. Deep space communications are
limited by the thermal noise of the receiver which is more of a continuous nature than a bursty

INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 5
CODING THEORY

nature. Likewise, narrowband modems are limited by the noise, present in the telephone network
and also modeled better as a continuous disturbance. Cell phones are subject to rapid fading. The
high frequencies used can cause rapid fading of the signal even if the receiver is moved a few
inches. Again there are a class of channel codes that are designed to combat fading. The other
cases of code theory are evident in other coding techniques such as

.

2.0 Arithmetic coding
Arithmetic coding is a form of variable-length entropy (which is the process of representing
information in the most compact form) encoding used in lossless data compression. When we
consider all the different entropy-coding methods, and their possible applications in compression
applications, arithmetic coding stands out in terms of elegance, effectiveness and versatility,
since it is able to work most efficiently in the largest number of circumstances and purposes.
Features
Among its most desirable features we have the following.
1. When applied to independent and identically distributed (i.i.d.) sources, the compression
of each symbol is provably optimal (Section 1.5).
2. It is effective in a wide range of situations and compression ratios. The same arithmetic
coding implementation can effective code all the diverse data created by the different
processes such as modeling parameters, transform coefficients, signaling etc.
3. It simplifies automatic modeling of complex sources, yielding near-optimal or
significantly improved compression for sources that are not i.i.d.
1. Its main process is arithmetic, which is supported with ever-increasing efficiency by all
general purpose or digital signal processors (CPUs, DSPs).
2. It is suited for use as a \compression black-box" by those that are not coding experts or do
not want to implement the coding algorithm themselves.
Even with all these advantages, arithmetic coding is not as popular and well understood
as other methods. Certain practical problems held back its adoption.
1. The complexity of arithmetic operations was excessive for coding applications.
2. Patents covered the most efficient implementations. Royalties and the fear of patent
infringement discouraged arithmetic coding in commercial products.
3. Efficient implementations were difficult to understand.
1. First, the relative efficiency of computer arithmetic improved dramatically, and new
techniques avoid the most expensive operations.
2. Second, some of the patents have expired (e.g., [11, 16]), or became obsolete.
INFORMATION AND COMMUNICATION THEORY                   BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 6
CODING THEORY

3. Finally, we do not need to worry so much about complexity-reduction details that obscure
the inherent simplicity of the method. Current computational resources allow us to
implement simple, efficient, and royalty-free arithmetic coding.

When a string is converted to arithmetic encoding, frequently-used characters will be
stored with fewer bits and not-so-frequently occurring characters will be stored with more
bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of
entropy encoding such as Huffman coding, in that rather than separating the input into
component symbols and replacing each with a code, arithmetic coding encodes the entire
message into a single number, a fraction n where (0.0 ≤ n < 1.0).


2.1 Arithmetic coding and elementary number theory
Arithmetic coding may be examined from the perspective of number theory. It can be interpreted
as a generalized change of radix. The best way to introduce the concept is to consider an
elementary example. We may look at any sequence of symbols

DABDDB

as a number in a certain base presuming that the involved symbols form an ordered set and each
symbol in the ordered set denotes a sequential integer A=0, B=1, C=2, D=3 and so on. If we
make a table of frequencies and cumulative frequencies for this message it looks like the
following

Symbol Frequency of occurrence Cumulative frequency
A     1                       1
B     2                       3
D     3                       6

Cumulative Frequency:

The total of a frequency and all frequencies below it in a frequency distribution. It is the 'running
total' of frequencies.

So it is 1 for A, 3 for B and 6 for D.

INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 7
CODING THEORY

In positional numeral system radix or base numerically equal to number of different symbols
used to express the number, for example, in decimal system the number of symbols is 10, they
are 0,1,2,3,4,5,6,7,8,9. The base or radix is used to express any finite integer in presumed
multiplier in polynomial form. For example, the number 457 is actually 4*102 + 5*101 + 7*100,
where base 10 is presumed but not shown explicitly. When speaking about different radix we
have to introduce different set of symbols, but, for convenience, we can simply use subset of
familiar decimal digits, such as 0, 1, 2, 3, 4, and 5 for radix 6.

If we choose radix=6 equal to the size of the message and convert the expression DABDDB into
a decimal number we first map the letters into digits 301331 and then we shall have

65 * 3 + 64 * 0 + 63 * 1 + 62 * 3 + 61 * 3 + 60 * 1 = 23671.

The result 23671 has a length of 15 bits and does not make a message close to the theoretical
limit computed via entropy, which must be near 9 bits.

[− 3 / 6 * log2 (3 / 6) − 2 / 6 * log2 (2 / 6) − 1 / 6 * log2 (1 / 6)] * 6 = 8.75 bits

In order to make it in accordance with information theory we need to slightly generalize the
classic formula for changing the radix. We have to compute LOW and HIGH limits and choose a
convenient number between them. For the computation of the LOW limit we multiply each next
term in the above expression by the product of the frequencies of all previously occurred
symbols, so it is turned into the following expression

LOW = 65 * 3 + 3 * [64 * 0 + 1 * [63 * 1 + 2 * [62 * 3 + 3 * [61 * 3 + 3 * [60 * 1]]]]] = 25002.

The HIGH limit must be the LOW limit plus the product of all frequencies. HIGH = LOW + 3 *
1 * 2 * 3 * 3 * 2 = 25002 + 108 = 25110. Now we can choose any number to represent the
message from the semi-closed interval [LOW, HIGH), which we can take as a number with the
longest possible trail of zeros, for example 25100. In case of a long message this trail of zeros
will be much longer and can be either dropped or presented as an exponent. The number 251,
received after truncation of zeros, has a length of 8 bits, which is even less than the theoretical
limit. In order to represent the computation of the LOW limit in a simple, easy to remember
format, we can offer a table. Each row contains the factors for every term in the above formula.
We can clearly see the part that distinguishes this computation of the LOW limit from the
classical change of the base. It is column 'Part 2', containing the products of frequencies for all
previously occurred symbols.

Symbol Part 1 Part 2                Total
D      65 * 3                       23328
A      64 * 0 3                     0
INFORMATION AND COMMUNICATION THEORY                        BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 8
CODING THEORY

B        63 * 1   3*1               648
D        62 * 3   3*1*2             648
D        61 * 3   3*1*2*3           324
B        60 * 1   3 * 1 * 2 * 3 * 3 54
25002

In order to complete the topic we have to show how to convert the number 25100 back to the
original message. The reverse process can be also shown in a table. It has two logical steps:
identification of the symbol, and subtraction of the corresponding term from the result.

Remainder Identification Identified symbol Corrected remainder
25100     25100 / 65 = 3 D                 (25100 - 65 * 3) / 3 = 590
590       590 / 64 = 0 A                   (590 - 64 * 0) / 1 = 590
590       590 / 63 = 2 B                   (590 - 63 * 1) / 2 = 187
187       187 / 62 = 5 D                   (187 - 62 * 3) / 3 = 26
26        26 / 61 = 4    D                 (26 - 61 * 3) / 3 = 2
2         2 / 60 = 2     B

In identification we divide the result by the correspondent power of 6. The fractional part of the
division is discarded. The result is then matched against the cumulative intervals and the
appropriate symbol is selected from look up table. When the symbol is identified the result is
corrected. The process is continued for the known length of the message or until the remaining
result is positive. As we can see the only difference compared to the classical formula is the
identification of the symbol that is not a sequential integer but the integer associated with the
interval. A is always 0 and B is either 1 or 2. D is any of 3, 4, and 5. This is in exact accordance
with our intervals that are determined by the frequencies. When all intervals equal to 1 we have a
special case of classic base change, but the computational part is the same.

The generic formula for computation of the LOW limit for the message of n symbols may be
expressed as follows

And the HIGH limit is computed as

INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 9
CODING THEORY

,

Where Ci is the cumulative frequencies and fk are the frequencies of occurrences. Indexes denote
the position of the symbol in a message. In case all frequencies fk are equal to 1 the formulas turn
into the special case used for expressing the number in a different base.

2.2        Example

Suppose we have a message that only contains the characters A, B, and C, with the following
frequencies, expressed as fractions:

A:   0.5
B:   0.2
C:   0.3
To show how arithmetic compression works, we first set up a table, listing characters with their
probabilities along with the cumulative sum of those probabilities. The cumulative sum defines
"intervals", ranging from the bottom value to less than, but not equal to, the top value. The order
in which characters are listed in the table does not seem to be important, except to the extent that
both the coder and decoder have to know what the order is.
letter   probability   interval
________________________________

C:       0.3           0.0 : 0.3
B:       0.2           0.3 : 0.5
A:       0.5           0.5 : 1.0
________________________________

Now each character can be coded by the shortest binary fraction whose value falls in the
character's probability interval:
letter   probability   interval    binary fraction
_______________________________________________________

C:       0.3           0.0 : 0.3   0
B:       0.2           0.3 : 0.5   0.011 = 3/8 = 0.375
A:       0.5           0.5 : 1.0   0.1   = 1/2 = 0.5
_______________________________________________________

This shows how single characters can be assigned minimum-length binary codes. However,
arithmetic coding doesn't stop there and simply translate the individual characters in a message
as these binary codes. It takes a subtler approach, assigning binary fractions to complete
messages.

INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 10
CODING THEORY

To start, let's consider sending messages consisting of all possible permutations of two of these
three characters. We determine the probability of the two-character strings by multiplying the
probabilities of the two characters, and then set up a series of intervals using those probabilities.

string   probability   interval       binary fraction
_____________________________________________________________

CC:      0.09          0.00 : 0.09    0.0001 = 1/16 = 0.0625
CB:      0.06          0.09 : 0.15    0.001 = 1/8    = 0.125
CA:      0.15          0.15 : 0.30    0.01   = 1/4   = 0.25
BC:      0.06          0.30 : 0.36    0.0101 = 5/16 = 0.3125
BB:      0.04          0.36 : 0.40    0.011 = 3/8    = 0.375
BA:      0.10          0.40 : 0.50    0.0111 = 7/16 = 0.4375
AC:      0.15          0.50 : 0.65    0.1    = 1/2   = 0.5
AB:      0.10          0.65 : 0.75    0.1011 = 11/16 = 0.6875
AA:      0.25          0.75 : 1.00    0.11   = 3/4   = 0.75
_____________________________________________________________

The higher the probability of the string, in general the shorter the binary fraction needed to
represent it.

Let's build a similar table for three characters now:

string   probability   interval        binary fraction
______________________________________________________________________

CCC         0.027             0.000   :   0.027      0.000001     =   1/64     =   0.015625
CCB         0.018             0.027   :   0.045      0.00001      =   1/32     =   0.03125
CCA         0.045             0.045   :   0.090      0.0001       =   1/16     =   0.0625
CBC         0.018             0.090   :   0.108      0.00011      =   3/32     =   0.09375
CBB         0.012             0.108   :   0.120      0.000111     =   7/64     =   0.109375
CBA         0.03              0.120   :   0.150      0.001        =   1/8      =   0.125
CAC         0.045             0.150   :   0.195      0.0011       =   3/16     =   0.1875
CAB         0.03              0.195   :   0.225      0.00111      =   7/32     =   0.21875
CAA         0.075             0.225   :   0.300      0.01         =   1/4      =   0.25

BCC         0.018             0.300   :   0.318      0.0101       =   5/16     =   0.3125
BCB         0.012             0.318   :   0.330      0.010101     =   21/64    =   0.328125
BCA         0.03              0.330   :   0.360      0.01011      =   11/32    =   0.34375
BBC         0.012             0.360   :   0.372      0.0101111    =   47/128   =   0.3671875
BBB         0.008             0.372   :   0.380      0.011        =   3/8      =   0.375
BBA         0.02              0.380   :   0.400      0.011001     =   25/64    =   0.390625
BAC         0.03              0.400   :   0.430      0.01101      =   13/32    =   0.40625
BAB         0.02              0.430   :   0.450      0.0111       =   7/16     =   0.4375
BAA         0.05              0.450   :   0.500      0.01111      =   15/32    =   0.46875

ACC         0.045             0.500 : 0.545          0.1          =   1/2      = 0.5
ACB         0.03              0.545 : 0.575          0.1001       =   9/16     = 0.5625

INFORMATION AND COMMUNICATION THEORY                     BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 11
CODING THEORY

ACA      0.075         0.575 : 0.650   0.101      = 5/8     = 0.625
ABC      0.03          0.650 : 0.680   0.10101    = 21/32 = 0.65625
ABB      0.02          0.680 : 0.700   0.1011     = 11/16 = 0.6875
ABA      0.05          0.700 : 0.750   0.10111    = 23/32 = 0.71875
AAC      0.075         0.750 : 0.825   0.11       = 3/4     = 0.75
AAB      0.05          0.825 : 0.875   0.11011    = 27/32 = 0.84375
AAA      0.125         0.875 : 1.000   0.111      = 7/8     = 0.875
______________________________________________________________________

Obviously this same procedure can be followed for more characters, resulting in a longer binary
fractional value. What arithmetic coding does is find the probability value of an entire message,
and arrange it as part of a numerical order that allows its unique identification.

* Let's stop here and send one of the binary strings defined in the table above to a decoder. We'll
arbitrarily select the binary string with the decimal value of 0.21875 from the table above.

This value was obtained using the probability values and intervals defined earlier:

string   probability   interval
________________________________

C:       0.3           0.0 : 0.3
B:       0.2           0.3 : 0.5
A:       0.5           0.5 : 1.0
________________________________
The value 0.21875 clearly falls into the interval for "C", so "C" must be the first character. We
can then "zoom in" on the characters that follow the "C" by subtracting the bottom value of the
interval for "C", which happens to be 0, and dividing the result by the width of the probability
interval for "C", which is 0.3:
(0.21875 - 0) / 0.3         =    0.72917
This is a simple shift and scaling operation.

The result falls into the probability interval for "A", and so the second character must be "A". We
can then zoom in on the next character by the same approach as before, subtracting the bottom
value of the interval for "A", which is 0.5, and dividing the result by the width of the probability
interval for "A", which is also 0.5:

(0.72917 - 0.5) / 0.5           = 0.4583
This clearly falls into the probability interval for "B", and so the string has been correctly
uncompressed to "CAB", which is the correct answer.

Unfortunately, this leaves behind a remainder that can be decoded into an indefinitely long string
of bogus characters. This is an artifact of using decimal floating-point math to perform the
calculations in this example. In practice, arithmetic coding is based on binary fixed-point math,
which avoids this problem.
INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 12
CODING THEORY

One other problem is the fact that the binary fraction that is output by the arithmetic coder is of
indefinite length, and the decoder has no idea of where the string ends if it isn't told. In practice,
a length header can be sent to indicate how long the fraction is, or an end-of-transmission symbol
of some sort can be used to tell the decoder where the end of the fraction is.

2.2 Theoretical limit of compressed message
It is easy to show that the computed LOW limit never exceeds nn independently of the order of
symbols, where n is the size of the message. That means that the binary length of the LOW limit
can be estimated as log2 (nn) = nlog2 (n). After the computation of the HIGH limit and the
reduction of the message by selecting a number from the interval [LOW, HIGH) with the longest

trail of zeros we can presume that this length can be reduced on                 number of bits.
Since each frequency in a product occurs exactly same number of times as the value of this
frequency, we can use the size of the alphabet A for the computation of the product

We can see it clearly in the above example. The product of all frequencies in the message is 3 * 1
* 2 * 3 * 3 * 2, where 3 occur exactly 3 times, 2 occur exactly 2 times and so on. Applying log2
for the estimated number of bits in the message we have the result

.

Using the numbers from above example we can see the exact match with the Shannon entropy
limit calculated before

6 * log2 (6) − 3 * log2 (3) − 2 * log2 (2) − 1 * log2 (1) = 8.75.

It is not a coincidence. The formulas

INFORMATION AND COMMUNICATION THEORY                      BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 13
CODING THEORY

And

can be algebraically converted into each other. The latter represents the entropy E multiplied by
the length of the message. It uses pi probabilities of the occurrences of the involved symbols. The
entropy E is introduced by Claude Shannon in his fundamental work Mathematical Theory of
Communications as a statistical characteristic used in estimation of the quantity of information.

Another fundamental property that should be mentioned is the relationship between entropy and
the number of possible permutations. It is already shown that compression is achieved as a result
of uneven distribution of symbols. It depends on the product of all frequencies in a message and
not on the order of symbols. If we consider the distribution as a fixed parameter we can simply
enumerate all possible messages and pass the index of the message instead of the message itself.
The maximum possible index will be equal to the number of permutations in a message. The bit
length of this number is estimated by the formula

.

The two previous expressions estimate the number of bits in a compressed message as

.

These two estimates may have a noticeable difference for short messages but starting from 1000
symbols and longer there is only a fraction of a percent difference. That means that entropy is
close to the bit length of the number of possible permutations divided by the size of the message.
The explanation of why these formulas provide a close result is known as the Sterling
approximation for the logarithm of factorial

n * log2(n) − n ≈ log2(n!).

Both sides of the above expression tend to each other with the increasing of n. Taking all these
relationships into consideration we may define arithmetic coding as converting a message to a
whole number, whose length is close to the length of the number of possible permutations for the
provided statistical distribution of symbols and does not depend on the particular order of
symbols.

INFORMATION AND COMMUNICATION THEORY                   BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 14
CODING THEORY

2.3 Using probabilities instead of frequencies
Traditionally arithmetic coding was explained by using probabilities and not frequencies. The
very detailed explanation can be found in Introduction to Arithmetic Coding, where many other
sources are listed in bibliography. From a theoretical point of view usage of probabilities or
frequencies does not make any difference because they can be converted into each other by
multiplying both parts of equations by a constant. However, some interesting properties were
overlooked by researchers because of their exclusive use of probabilities: such as the idea that
arithmetic coding is only a generalized form of changing the base; and a relationship between
entropy and the number of permutations. Here we can show the traditional approach to the
explanation by simply dividing the expression for the LOW limit by the constant n n, or, for our
particular case, by the constant 66. In this case the left-hand side will contain the cumulative
probabilities and probabilities and the right-hand side will be a fraction. The table illustrating the
computation of the LOW limit will also be changed accordingly.

Symbol Part 1 Part 2                      Total
D      3/6                                3/6
A      0/6    3/6                         0/62
B      1/6    3/6 * 1/6                   3/63
D      3/6    3/6 * 1/6 * 2/6             18/64
D      3/6    3/6 * 1/6 * 2/6 * 3/6       54/65
B      1/6    3/6 * 1/6 * 2/6 * 3/6 * 3/6 54/66
0.53587962962962962962962962962963

In the same way the HIGH limit should be computed as LOW plus the product of all
probabilities and the reduction of the message can be achieved by the selection of the shortest
fraction between two intervals [LOW, HIGH). It is proven that the result always belongs to the
interval [0, 1). Each step in the encoding adds a smaller and smaller number, which contributes
to the growth of the fraction and needs the special treatment of the numerical processing that is
known as renormalization. When the concept was derived and explained by mathematicians and
passed to programmers for implementation they appeared to be bounded by the concept and
started implementation close to the probability concept, which resulted in tremendous
inconveniences in programming and caused a delay in delivering a reliable implementation of
years or, possibly, of decades. The domination of the probability concept can be seen in every

INFORMATION AND COMMUNICATION THEORY                     BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 15
CODING THEORY

patent on arithmetic coding filed before the year 2000. The explanation of arithmetic coding as
mapping a message onto the [0, 1) interval was included in claims risking the patents to be easily
circumvented by programs that do not compute probabilities and not dealing with
renormalization. In the frequency concept many computational challenges are avoided. In the
same way as probabilities, frequencies are slightly adjusted to convenient numbers. They are
scaled to make the base presented as a power of 2 and one multiplication turns into a binary shift.
The long products of frequencies are computed as the mantissa and exponent, where the mantissa
is maintained over the compression and the exponent resulted also as an additional binary shift.
The final computational part is adding numbers that overlap and manage the propagation of carry

47568690
34598908
996245

And so on. The frequency concept is reduced to the computation of integers shifted relatively to
each other and adding them in a stair looking structure. This also perfectly explains the fractional
bit lengths in an optimal encoding mentioned by Shannon, which states the possibility of
encoding a particular symbol in − log2 (p) bits although the number is fractional. This fractional
bit length is achieved by a variable shift computed on every step. When the implementation
detail for every arithmetic coder varies they all have one common thing: the limits LOW and
HIGH are computed on every step. The frequency type approach does not need the computation
of the HIGH limit at all, it is not part of the numerical implementation and the computational
burden is twice lower.

2.4 Implementation details for the probability concept
2.4.1 Defining a model

Arithmetic coders produce near-optimal output for a given set of symbols and probabilities (the
optimal value is −log2P bits for each symbol of probability P, see source coding theorem).
Compression algorithms that use arithmetic coding start by determining a model of the data –
basically a prediction of what patterns will be found in the symbols of the message. The more
accurate this prediction is the closer to optimality the output will be.

Example: a simple, static model for describing the output of a particular monitoring instrument
over time might be:

   60% chance of symbol NEUTRAL
   20% chance of symbol POSITIVE
INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 16
CODING THEORY

   10% chance of symbol NEGATIVE
   10% chance of symbol END-OF-DATA. (The presence of this symbol means that the
stream will be 'internally terminated', as is fairly common in data compression; when this
symbol appears in the data stream, the decoder will know that the entire stream has been
decoded.)

Models can also handle alphabets other than the simple four-symbol set chosen for this example.
More sophisticated models are also possible: higher-order modeling changes its estimation of the
current probability of a symbol based on the symbols that precede it (the context), so that in a
model for English text, for example, the percentage chance of "u" would be much higher when it
follows a "Q" or a "q". Models can even be adaptive, so that they continuously change their
prediction of the data based on what the stream actually contains. The decoder must have the
same model as the encoder.

2.4.2 A simplified example

As an example of how a sequence of symbols is encoded, consider a sequence taken from a set
of three symbols, A, B, and C, each equally likely to occur. Simple block encoding would use 2
bits per symbol, which is wasteful: one of the bit variations is never used.

A more efficient solution is to represent the sequence as a rational number between 0 and 1 in
base 3, where each digit represents a symbol. For example, the sequence "ABBCAB" could
become 0.0112013. The next step is to encode this ternary number using a fixed-point binary
number of sufficient precision to recover it, such as 0.0010110012 — this is only 9 bits, 25%
smaller than the naive block encoding. This is feasible for long sequences because there are
efficient, in-place algorithms for converting the base of arbitrarily precise numbers.

Finally, knowing the original string had length 6; one can simply convert back to base 3, round
to 6 digits, and recover the string.

2.4.3 Encoding and decoding

In general, each step of the encoding process, except for the very last, is the same; the encoder
has basically just three pieces of data to consider:

   The next symbol that needs to be encoded
   The current interval (at the very start of the encoding process, the interval is set to [0,1),
but that will change)

INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 17
CODING THEORY

   The probabilities the model assigns to each of the various symbols that are possible at
this stage (as mentioned earlier, higher-order or adaptive models mean that these
probabilities are not necessarily the same in each step.)

The encoder divides the current interval into sub-intervals, each representing a fraction of the
current interval proportional to the probability of that symbol in the current context. Whichever
interval corresponds to the actual symbol that is next to be encoded becomes the interval used in
the next step.

Example: for the four-symbol model above:

   the interval for NEUTRAL would be [0, 0.6)
   the interval for POSITIVE would be [0.6, 0.8)
   the interval for NEGATIVE would be [0.8, 0.9)
   The interval for END-OF-DATA would be [0.9, 1).

When all symbols have been encoded, the resulting interval unambiguously identifies the
sequence of symbols that produced it. Anyone who has the same final interval and model that is
being used can reconstruct the symbol sequence that must have entered the encoder to result in
that final interval.

It is not necessary to transmit the final interval, however; it is only necessary to transmit one
fraction that lies within that interval. In particular, it is only necessary to transmit enough digits
(in whatever base) of the fraction so that all fractions that begin with those digits fall into the
final interval.

2.5 Precision and renormalization
The above explanations of arithmetic coding contain some simplification. In particular, they are
written as if the encoder first calculated the fractions representing the endpoints of the interval in
full, using infinite precision, and only converted the fraction to its final form at the end of
encoding. Rather than try to simulate infinite precision, most arithmetic coders instead operate at
a fixed limit of precision which they know the decoder will be able to match, and round the
calculated fractions to their nearest equivalents at that precision. An example shows how this
would work if the model called for the interval [0,1) to be divided into thirds, and this was
approximated with 8 bit precision. Note that as now the precision is known, so are the binary
ranges we'll be able to use.

Probability          Interval reduced to Interval            reduced      to Range       in
Symbol
(expressed        as eight-bit precision (as eight-bit      precision    (in binary

INFORMATION AND COMMUNICATION THEORY                     BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 18
CODING THEORY

fraction)           fractions)                  binary)
00000000         -
A         1/3                 [0, 85/256)                 [0.00000000, 0.01010101)
01010100
01010101         -
B         1/3                 [85/256, 171/256)           [0.01010101, 0.10101011)
10101010
10101011         -
C         1/3                 [171/256, 1)                [0.10101011, 1.00000000)
11111111

A process called renormalization keeps the finite precision from becoming a limit on the total
number of symbols that can be encoded. Whenever the range is reduced to the point where all
values in the range share certain beginning digits, those digits are sent to the output. For however
many digits of precision the computer can handle, it is now handling fewer than that, so the
existing digits are shifted left, and at the right, new digits are added to expand the range as
widely as possible. Note that this result occurs in two of the three cases from our previous
example.

Digits that can be sent to Range                   after
Symbol Probability Range
output                     renormalization
00000000           -
A         1/3                                0                          00000000 - 10101001
01010100
01010101           -
B         1/3                                None                       01010101 - 10101010
10101010
10101011           -
C         1/3                                1                          01010110 - 11111111
11111111

2.6 Connections between arithmetic coding and Huffman
coding
2.6.1 Huffman coding

There is great similarity between arithmetic coding and Huffman coding – in fact, it has been
shown that Huffman is just a specialized case of arithmetic coding – but because arithmetic
coding translates the entire message into one number represented in base b, rather than
translating each symbol of the message into a series of digits in base b, it will sometimes
approach optimal entropy encoding much more closely than Huffman can.

In fact, a Huffman code corresponds closely to an arithmetic code where each of the frequencies
is rounded to a nearby power of ½ — for this reason Huffman deals relatively poorly with

INFORMATION AND COMMUNICATION THEORY                     BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 19
CODING THEORY

distributions where symbols have frequencies far from a power of ½, such as 0.75 or 0.375. This
includes most distributions where there are either a small numbers of symbols (such as just the
bits 0 and 1) or where one or two symbols dominate the rest.

For an alphabet {a, b, c} with equal probabilities of 1/3, Huffman coding may produce the
following code:

   a → 0: 50%
   b → 10: 25%
   c → 11: 25%

This code has an expected (2 + 2 + 1)/3 ≈ 1.667 bits per symbol for Huffman coding, an
inefficiency of 5 percent compared to log23 ≈ 1.585 bits per symbol for arithmetic coding.

For an alphabet {0, 1} with probabilities 0.625 and 0.375, Huffman encoding treats them as
though they had 0.5 probability each, assigning 1 bit to each value, which does not achieve any
compression over naive block encoding. Arithmetic coding approaches the optimal compression
ratio of:

.

When the symbol 0 has a high probability of 0.95, the difference is much greater:

.

One simple way to address this weakness is to concatenate symbols to form a new alphabet in
which each symbol represents a sequence of symbols in the original alphabet. In the above
example, grouping sequences of three symbols before encoding would produce new "super-
symbols" with the following frequencies:

   000: 85.7%
   001, 010, 100:   4.5% each
   011, 101, 110:   .24% each
   111: 0.0125%

With this grouping, Huffman coding averages 1.3 bits for every three symbols, or 0.433 bits per
symbol, compared with one bit per symbol in the original encoding.

2.7 US patents on arithmetic coding

INFORMATION AND COMMUNICATION THEORY                  BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 20
CODING THEORY

A variety of specific techniques for arithmetic coding are covered by US patents. Some of these
patents may be essential for implementing the algorithms for arithmetic coding that are specified
in some formal international standards. When this is the case, such patents are generally
available for licensing under what is called "reasonable and non-discriminatory" (RAND)
licensing terms (at least as a matter of standards-committee policy). In some well-known
instances (including some involving IBM patents) such licenses are available free, and in other
instances, licensing fees are required. The availability of licenses under RAND terms does not
necessarily satisfy everyone who might want to use the technology, as what may be "reasonable"
fees for a company preparing a proprietary software product may seem much less reasonable for
a free software or open source project.

At least one significant compression software program, bzip2, deliberately discontinued the use
of arithmetic coding in favor of Huffman coding due to the patent situation. Also, encoders and
decoders of the JPEG file format, which has options for both Huffman encoding and arithmetic
coding, typically only support the Huffman encoding option, due to patent concerns; the result is
that nearly all JPEGs in use today use Huffman encoding.[2]

Some US patents relating to arithmetic coding are listed below.

   U.S. Patent 4,122,440 — (IBM) Filed March 4, 1977, Granted 24 October 1978 (Now
expired)
   U.S. Patent 4,286,256 — (IBM) Granted 25 August 1981 (Now expired)
   U.S. Patent 4,467,317 — (IBM) Granted 21 August 1984 (Now expired)
   U.S. Patent 4,652,856 — (IBM) Granted 4 February 1986 (Now expired)
   U.S. Patent 4,891,643 — (IBM) Filed 15 September 1986, granted 2 January 1990 (Now
expired)
   U.S. Patent 4,905,297 — (IBM) Filed 18 November 1988, granted 27 February 1990
(Now expired)
   U.S. Patent 4,933,883 — (IBM) Filed 3 May 1988, granted 12 June 1990
   U.S. Patent 4,935,882 — (IBM) Filed 20 July 1988, granted 19 June 1990
   U.S. Patent 4,989,000 — Filed 19 June 1989, granted 29 January 1991
   U.S. Patent 5,099,440 — (IBM) Filed 5 January 1990, granted 24 March 1992
   U.S. Patent 5,272,478 — (Ricoh) Filed 17 August 1992, granted 21 December 1993

Note: This list is not exhaustive. See the following link for a list of more patents. [3] The Dirac
codec uses arithmetic coding and is not patent pending.[4]

Patents on arithmetic coding may exist in other jurisdictions; see software patents for a
discussion of the patentability of software around the world.

2.8 Benchmarks and other technical characteristics
INFORMATION AND COMMUNICATION THEORY                   BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 21
CODING THEORY

Every programmatic implementation of arithmetic encoding has a different compression ratio
and performance. While compression ratios vary only a little (usually under 1%) the code
execution time can vary by a factor of 10. Choosing the right encoder from a list of publicly
available encoders is not a simple task because performance and compression ratio depend also
on the type of data, particularly on the size of the alphabet (number of different symbols). One of
two particular encoders may have better performance for small alphabets while the other may
show better performance for large alphabets. Most encoders have limitations on the size of the
alphabet and many of them are designed for a dual alphabet only (zero and one).

INFORMATION AND COMMUNICATION THEORY                   BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 22
CODING THEORY

CHAPTER 3

3.0 Huffman Coding
In computer science and information theory, Huffman coding is an entropy encoding algorithm
used for lossless data compression. The term refers to the use of a variable-length code table for
encoding a source symbol (such as a character in a file) where the variable-length code table has
been derived in a particular way based on the estimated probability of occurrence for each
possible value of the source symbol. It was developed by David A. Huffman while he was a
Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of
Minimum-Redundancy Codes".

Huffman coding uses a specific method for choosing the representation for each symbol,
resulting in a prefix code (sometimes called "prefix-free codes", that is, the bit string
representing some particular symbol is never a prefix of the bit string representing any other
symbol) that expresses the most common characters using shorter strings of bits than are used for
less common source symbols. Huffman was able to design the most efficient compression
method of this type: no other mapping of individual source symbols to unique strings of bits will
produce a smaller average output size when the actual symbol frequencies agree with those used
to create the code. A method was later found to do this in linear time if input probabilities (also
known as weights) are sorted.

For a set of symbols with a uniform probability distribution and a number of members which is a
power of two, Huffman coding is equivalent to simple binary block encoding, e.g., ASCII
coding. Huffman coding is such a widespread method for creating prefix codes that the term
"Huffman code" is widely used as a synonym for "prefix code" even when such a code is not
produced by Huffman's algorithm.

Although Huffman's original algorithm is optimal for a symbol-by-symbol coding (i.e. a stream
of unrelated symbols) with a known input probability distribution, it is not optimal when the
symbol-by-symbol restriction is dropped, or when the probability mass functions are unknown,
not identically distributed, or not independent (e.g., "cat" is more common than "cta"). Other
methods such as arithmetic coding and LZW coding often have better compression capability:
both of these methods can combine an arbitrary number of symbols for more efficient coding,
and generally adapt to the actual input statistics, the latter of which is useful when input
probabilities are not precisely known or vary significantly within the stream. However, the
limitations of Huffman coding should not be overstated; it can be used adaptively,
accommodating unknown, changing, or context-dependent probabilities. In the case of known
independent and identically-distributed random variables, combining symbols together reduces
inefficiency in a way that approaches optimality as the number of symbols combined increases.

INFORMATION AND COMMUNICATION THEORY                   BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 23
CODING THEORY

3.1 HISTORY
In 1951, David A. Huffman and his MIT information theory classmates were given the choice of
a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the
problem of finding the most efficient binary code. Huffman, unable to prove any codes were the
most efficient, was about to give up and start studying for the final when he hit upon the idea of
using a frequency-sorted binary tree and quickly proved this method the most efficient.

In doing so, the student outdid his professor, who had worked with information theory inventor
Claude Shannon to develop a similar code. Huffman avoided the major flaw of the suboptimal
Shannon-Fano coding by building the tree from the bottom up instead of from the top down.

3.2 Problem Definition
Informal description

Given
A set of symbols and their weights (usually proportional to probabilities).
Find
A prefix-free binary code (a set of codewords) with minimum expected codeword length
(equivalently, a tree with minimum weighted path length from the root).

Formalized description

Input.
Alphabet                               ,   which     is     the   symbol     alphabet    of   size   n.
Set                                , which is the set of the (positive) symbol weights (usually
proportional        to        probabilities),       i.e.                                              .

Output.
Code                                       , which is the set of (binary) codewords, where ci is the
codeword for                   .

INFORMATION AND COMMUNICATION THEORY                       BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 24
CODING THEORY

Goal.

Let                                       be the weighted path length of code C. Condition:
for any code              .

Samples

Symbols     )                   a       b         c       d       e        sum
Input
(A,W)          Weights     )                   0.10    0.15      0.30    0.16    0.29     =1

Codewords                       000     001       10      01      11

Codeword length (in bits)       3       3         2       2       2

Output C

Weighted path length            0.30    0.45      0.60    0.32    0.58     L(C)=2.25

Probability budget                                                         = 1.00

Information content (in bits)   3.32    2.74      1.74    2.64    1.79
Optimality

Entropy               0.332   0.411 0.521 0.423 0.518            H(A)=2.205

INFORMATION AND COMMUNICATION THEORY                     BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 25
CODING THEORY

For any code that is biunique, meaning that the code is uniquely decodeable, the sum of the
probability budgets across all symbols is always less than or equal to one. In this example, the
sum is strictly equal to one; as a result, the code is termed a complete code. If this is not the case,
you can always derive an equivalent code by adding extra symbols (with associated null
probabilities), to make the code complete while keeping it biunique.

As defined by Shannon (1948), the information content h (in bits) of each symbol ai with non-
null probability is

The entropy H (in bits) is the weighted sum, across all symbols ai with non-zero probability wi,
of the information content of each symbol:

(Note: A symbol with zero probability has zero contribution to the entropy. When w = 0,
is an indefinite form; so by L'Hôpital's rule:

.

For simplicity, symbols with zero probability are left out of the formula above.)

As a consequence of Shannon's source coding theorem, the entropy is a measure of the smallest
codeword length that is theoretically possible for the given alphabet with associated weights. In
this example, the weighted average codeword length is 2.25 bits per symbol, only slightly larger
than the calculated entropy of 2.205 bits per symbol. So not only is this code optimal in the sense
that no other feasible code performs better, but it is very close to the theoretical limit established
by Shannon.

Note that, in general, a Huffman code need not be unique, but it is always one of the codes
minimizing L(C).

3.2.1 Basic technique

INFORMATION AND COMMUNICATION THEORY                     BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 26
CODING THEORY

A source generates 4 different symbols {a1,a2,a3,a4} with probability {0.4;0.35;0.2;0.05}. A
binary tree is generated from left to right taking the two less probable symbols, putting them
together to form another equivalent symbol having a probability that equals the sum of the two
symbols. The process is repeated until there is just one symbol. The tree can then be read
backwards, from right to left, assigning different bits to different branches. The final Huffman
code is:
Symbol Code
a1       0
a2       10
a3       110
a4       111
The standard way to represent a signal made of 4 symbols is by using 2 bits/symbol, but the
entropy of the source is 1.73 bits/symbol. If this Huffman code is used to represent the signal,
then the average length is lowered to 1.85 bits/symbol; it is still far from the theoretical limit
because the probabilities of the symbols are different from negative powers of two.

3.2.2 Example
To see how Huffman coding works, assume that a text file is to be compressed, and that the
characters in the file have the following frequencies:
A:        29
B:        64
C:        32
D:        12
E:         9
F:        66
G:        23
In practice, we need the frequencies for all the characters used in the text, including all letters,
digits, and punctuation, but to keep the example simple we'll just stick to the characters from A
to G.

The first step in building a Huffman code is to order the characters from highest to lowest
frequency of occurrence as follows:

66        64   32    29      23     12      9

INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 27
CODING THEORY

F       B       C      A       G       D       E
First, the two least-frequent characters are selected, logically grouped together, and their
frequencies added. In this example, the D and E characters have a combined frequency of 21:
:
.......
: 21 :
:     :
66      64      32     29      23      12    9
F       B       C      A       G       D     E
This begins the construction of a "binary tree" structure. We now again select the two elements
the lowest frequencies, regarding the D-E combination as a single element. In this case, the two
elements selected are G and the D-E combination. We group them together and add their
frequencies. This new combination has a frequency of 44:
:
..........
:   44    :
:         :
:      .......
:      : 21 :
:      :     :
66      64      32     29      23    12     9
F       B       C      A       G     D      E
We continue in the same way to select the two elements with the lowest frequency, group them
together, and add their frequencies, until we run out of elements. In the third iteration, the lowest
frequencies are C and A:
:
..........
:   44    :
:           :         :
.......        :     .......
: 61 :         :     : 21 :
:     :        :     :     :
66      64      32    29       23     12   9
F       B       C     A        G     D     E
The next iterations give:
:
..............
:     105     :
:             :
:         ..........
:         :   44    :
:         :         :
.......      :      .......
:       :       : 61 :       :      : 21 :
:       :       :     :      :      :     :
66      64      32    29     23     12    9
F       B       C     A      G      D     E

INFORMATION AND COMMUNICATION THEORY                     BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 28
CODING THEORY

:
..............
:     105     :
:             :
:         ..........
:         :   44   :
:              :         :        :
.......        .......      :     .......
: 130 :        : 61 :       :     : 21 :
:     :        :     :      :     :     :
66    64       32    29     23    12    9
F     B        C     A      G     D     E

:
....................
:        235        :
:                   :
:            ..............
:            :     105     :
:            :             :
:            :         ..........
:            :         :   44    :
:            :         :         :
.......     .......       :      .......
: 130 :     : 61 :        :      : 21 :
:     :     :      :      :      :     :
66    64    32     29     23     12    9
F     B     C      A      G      D     E
The result is known as a "Huffman tree". To obtain the Huffman code itself, each branch of the
tree is labeled with a 1 or 0. It doesn't matter how the 1s and 0s are assigned, though a consistent
scheme obviously is easier to deal with:
:
....................
:0                  :1
:                   :
:            ...............
:            :0             :1
:            :              :
:            :         ...........
:            :         :0         :1
:            :         :          :
.......     .......       :       .......
:0    :1    :0     :1     :       :0    :1
:     :     :      :      :       :     :
F     B     C      A      G       D     E
Tracing down the tree gives the "Huffman codes", with the shortest codes assigned to the
characters with the greatest frequency:

INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 29
CODING THEORY

F:       00
B:       01
C:      100
A:      101
G:      110
D:     1110
E:     1111

3.3     Main Properties
A Huffman coder will go through the source text file, convert each character into its
appropriate binary Huffman code, and dump the resulting bits to the output file. The Huffman
codes won't get mixed up in decoding. The best way to see that this is so is to envision the
decoder cycling through the tree structure, guided by the encoded bits it reads, moving from top
to bottom and then back to the top. As long as bits constitute legitimate Huffman codes and a bit
doesn't get scrambled or lost, the decoder will never get lost either.

Huffman coding is optimal when the probability of each input symbol is a negative power of
two. Prefix codes tend to have slight inefficiency on small alphabets, where probabilities often
fall between these optimal points. "Blocking", or expanding the alphabet size by coalescing
multiple symbols into "words" of fixed or variable-length before Huffman coding, usually helps,
especially when adjacent symbols are correlated (as in the case of natural language text). The
worst case for Huffman coding can happen when the probability of a symbol exceeds 2 −1 = 0.5,
making the upper limit of inefficiency unbounded. These situations often respond well to a form
of blocking called run-length encoding; for the simple case of Bernoulli processes, Golomb
coding is a provably optimal run-length code.

Arithmetic coding produces slight gains over Huffman coding, but in practice these gains have
seldom been large enough to offset arithmetic coding's higher computational complexity and
patent royalties. (As of July 2006, IBM owns patents on many methods of arithmetic coding in
the US; see US patents on arithmetic coding.)

3.4        Variations

Many variations of Huffman coding exist, some of which use a Huffman-like algorithm, and
others of which find optimal prefix codes (while, for example, putting different restrictions on
the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need
not even be polynomial time. An exhaustive list of papers on Huffman coding and its variations
is given by "Code and Parse Trees for Lossless Source Encoding"[1].

3.4.1 n-ary Huffman coding

The n-ary Huffman algorithm uses the {0, 1, ... , n − 1} alphabet to encode message and build
an n-ary tree. This approach was considered by Huffman in his original paper. The same
INFORMATION AND COMMUNICATION THEORY                  BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 30
CODING THEORY

algorithm applies as for binary (n equals 2) codes, except that the n least probable symbols are
taken together, instead of just the 2 least probable. Note that for n greater than 2, not all sets of
source words can properly form an n-ary tree for Huffman coding. In this case, additional 0-
probability place holders must be added. This is because the tree must form an n to 1 contractor;
for binary coding, this is a 2 to 1 contractor, and any sized set can form such a contractor. If the
number of source words is congruent to 1 modulo n-1, then the set of source words will form a
proper Huffman tree.

A variation called adaptive Huffman coding calculates the probabilities dynamically based on
recent actual frequencies in the source string. This is somewhat related to the LZ family of
algorithms.

3.4.3 Huffman template algorithm

Most often, the weights used in implementations of Huffman coding represent numeric
probabilities, but the algorithm given above does not require this; it requires only a way to order
weights and to add them. The Huffman template algorithm enables one to use any kind of
weights (costs, frequencies, pairs of weights, non-numerical weights) and one of many
combining methods (not just addition). Such algorithms can solve other minimization problems,
such as minimizing                               , a problem first applied to circuit design [2].

3.4.4     Length-limited Huffman coding

This is a variant where the goal is still to achieve a minimum weighted path length, but there is
an additional restriction that the length of each codeword must be less than a given constant. The
package-merge algorithm solves this problem with a simple greedy approach very similar to that
used by Huffman's algorithm. Its time complexity is O(nL), where L is the maximum length of a
codeword. No algorithm is known to solve this problem in linear or linearithmic time, unlike the
presorted and unsorted conventional Huffman problems, respectively.

3.4.5 Huffman coding with unequal letter costs
INFORMATION AND COMMUNICATION THEORY                    BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 31
CODING THEORY

In the standard Huffman coding problem, it is assumed that each symbol in the set that the code
words are constructed from has an equal cost to transmit: a code word whose length is N digits
will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc.
When working under this assumption, minimizing the total cost of the message and minimizing
the total number of digits are the same thing.

Huffman coding with unequal letter costs is the generalization in which this assumption is no
longer assumed true: the letters of the encoding alphabet may have non-uniform lengths, due to
characteristics of the transmission medium. An example is the encoding alphabet of Morse code,
where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission
time is higher. The goal is still to minimize the weighted average codeword length, but it is no
longer sufficient just to minimize the number of symbols used by the message. No algorithm is
known to solve this in the same manner or with the same efficiency as conventional Huffman
coding..

3.4.6 The canonical Huffman code

If weights corresponding to the alphabetically ordered inputs are in numerical order, the
Huffman code has the same lengths as the optimal alphabetic code, which can be found from
calculating these lengths, rendering Hu-Tucker coding unnecessary. The code resulting from
numerically (re-)ordered input is sometimes called the canonical Huffman code and is often the
code used in practice, due to ease of encoding/decoding. The technique for finding this code is
sometimes called Huffman-Shannon-Fano coding, since it is optimal like Huffman coding, but
alphabetic in weight probability, like Shannon-Fano coding. The Huffman-Shannon-Fano code
corresponding to the example is {000,001,01,10,11}, which, having the same codeword lengths
as the original solution, is also optimal.

3.4.7 Model reconstruction

Decompression generally requires transmission of information in order to reconstruct the
compression model (methods such as adaptive Huffman do not, although they typically produce
less than optimal code lengths). Originally, symbol frequencies were passed along to the
decompressor, but this method is very inefficient, as it can produce an unacceptable level of
overhead. The most common technique utilizes canonical Huffman encoding, which only
requires Bn bits of information (where B is the number of bits per symbol and n is the size of the
symbol's alphabet). Other methods, such as the "direct transmission" of the Huffman tree,
produce variable-length encoding of the model which can reduce the overhead to just a few
bytes, in some cases.

Because the compressed data can include unused "trailing bits", the decompressor must be able
to determine when to stop producing output. This can be accomplished by either transmitting the

INFORMATION AND COMMUNICATION THEORY                  BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 32
CODING THEORY

length of the decompressed data along with the compression model or by defining a special code
symbol to signify the end of input (the latter method can adversely affect code length optimality,
however).

3.5       Applications

Arithmetic coding can be viewed as a generalization of Huffman coding; indeed, in practice
arithmetic coding is often preceded by Huffman coding, as it is easier to find an arithmetic code
for a binary input than for a nonbinary input. Also, although arithmetic coding offers better
compression performance than Huffman coding, Huffman coding is still in wide use because of
its simplicity, high speed and lack of encumbrance by patents.

Huffman coding today is often used as a "back-end" to some other compression method.
DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end
model and quantization followed by Huffman coding.

References
   Background story: Profile: David A. Huffman, Scientific American, September 1991,
pp. 54-58
   Huffman's original article: D.A. Huffman, "A Method for the Construction of Minimum-
Redundancy Codes", Proceedings of the I.R.E., September 1952, pp 1098–1102
   MacKay, David J.C. (September 2003). "Chapter 6: Stream Codes"
(PDF/PostScript/DjVu/LaTeX). Information Theory, Inference, and Learning
Algorithms.        Cambridge         University      Press.      ISBN 0-521-64298-1.
http://www.inference.phy.cam.ac.uk/mackay/itila/book.html. Retrieved 2007-12-30.
   Rissanen, Jorma (May 1976). "Generalized Kraft Inequality and Arithmetic Coding"
(PDF). IBM Journal of Research and Development 20 (3): 198–203.
http://domino.watson.ibm.com/tchjr/journalindex.nsf/4ac37cf0bdc4dd6a85256547004d4
7e1/53fec2e5af172a3185256bfa0067f7a0?OpenDocument. Retrieved 2007-09-21.
   Rissanen, J.J.; Langdon, G.G., Jr. (March 1979). "Arithmetic coding" (PDF). IBM
Journal       of    Research      and      Development      23      (2):     149–162.
http://researchweb.watson.ibm.com/journal/rd/232/ibmrd2302G.pdf. Retrieved 2007-09-
22.
   Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-
262-03293-7. Section 16.3, pp. 385–392.
INFORMATION AND COMMUNICATION THEORY                  BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 33
CODING THEORY

   Witten, Ian H.; Neal, Radford M.; Cleary, John G. (June 1987). "Arithmetic Coding for
Data Compression" (PDF). Communications of the ACM 30 (6): 520–540.
doi:10.1145/214762.214771.
http://www.stanford.edu/class/ee398a/handouts/papers/WittenACM87ArithmCoding.pdf.
Retrieved 2007-09-21.

INFORMATION AND COMMUNICATION THEORY               BY: OBI KENNETH ABANG, COURTESY OF 2010 SETS

Page 34

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 414 posted: 7/21/2010 language: English pages: 34