Modelling Data Transmission through a Channel Based on Huffman Coding and Encryption Methods by ijcsiseditor


More Info
									                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 8, No. 9, December 2010

 Modeling Data Transmission through a Channel
Based on Huffman Coding and Encryption Methods
                                                          Eug` ne C. Ezin
                                       Institut de Math´ matiques et de Sciences Physiques
                                       e                                                   e
                                   Unit´ de Recherche en Informatique et Sciences Appliqu´ es
                                                  e                    e              e
                                        Universit´ d’Abomey-Calavi, R´ publique du B´ nin

   Abstract—Data transmission through a secure channel requires      coding, but next to run-length encoding, Huffman coding is
the attention of many researchers. In this paper, on the basis of    one of the simplest forms of file compression. Huffman coding
an alphabet of ciphers and letters, we propose a model for data      can be used effectively wherever there is a need for a compact
transmission through a secure channel. This is achieved at two
levels. First we associate each distinct symbol with a probability   code to represent a long series of a relatively small number of
in the message to transmit. By doing so, we modify the well-         distinct bytes.
known adaptive Huffman coding method. The obtained alphabet             For instance in [2], Aljihit et al. present a compres-
is used to construct the coded message to transmit through           sion/decompression scheme on selective Huffman coding for
a cryptosytem. Therefore, the original message is coded and          reducing the amount of test data. The difficulty observed to
encrypted before its delivering. The proposed model is examined.
                                                                     break the ciphered message shows that the proposed cryp-
   Keywords-component—Data compression, Huffman coding tech-         tosystem can be used as a secure channel for data trans-
nique, encryption and decryption algorithms.                         mission. That must be stored on a tester and transferred
                                                                     to each core in a system on a chip during manufacturing
                      I. I NTRODUCTION                               test. The request for electronic documents and services is
   Data transmission occurs when there is a channel between          increasing with the widespread use of digital data. Usually,
two machines. To exchange a datum, an encoding must be               electronic documents go through two separate processes: data
chosen for the transmission signals. This basically depends on       compression to achieve low transmission cost, and ciphering
the physical medium used to transfer the data, the guaranteed        to provide security. By using Huffman codes [3] we intend
data integrity and the transmission speed. Data transmission         to keep information retrieval system functions like indexing
can be simple if there are only two machines communicating,          and searching in the compressed file, which is not so easily
or if only a single piece of data is sent. Otherwise, it is          possible with adaptative data compression algorithms. Huff-
necessary to install several transmission lines or to share the      man codes have some advantages, namely simplicity, speed,
line among several different communication actors. Transmit-         auto synchronization. Advantage means that it is possible to
ting a message sometimes requires a safe channel to stop an          decode symbols in the middle of the coded text.
unauthorized person discovering its content.                            Another application of Huffman coding is the reduction of
   So far, many codes have been used to represent letters            the peak to average in orthogonal frequency division multi-
and messages, among which Morse Code, ASCII code, and                plexing(OFDM) system [4] which is a frequency modulation
UNICODE are the most famous [1]. Codes with variable                 technique for transmitting large amounts of digital data over
length including Huffman coding, are very useful in data             a radio wave. OFDM works by splitting the radio signal
compression field. It is worth noting that Huffman binary             into multiple smaller sub-signals that are then transmitted
codes have the property that no character code is a prefix to         simultaneously at different frequencies to the receiver.
any other character code. This is merely due to the fact that           Klein et al. [5] analyzed the cryptographic aspects of
all the characters occur at only at the leaf nodes in the tree.      Huffman codes used to encode a large natural language on
   Applications of Huffman coding are pervasive in computer          CD-ROM and concluded that this problem is NP-complete
science. This coding scheme is not limited to encoding mes-          for several variants of the encoding process [6]. Rivest et
sages. Indeed, Huffman coding can be used to compress parts          al. [7] cryptanalysed a Huffman encoded text assuming that
of both digital photographs and other files such as digital sound     the cryptanalyst does not know the codebook. According to
files (MP3) and ZIP files.                                             them, cryptanalysis in this situation is surprisingly difficult
   In the case of JPEG files, the main compression scheme uses        and even impossible in some cases due to the ambiguity of
a discrete cosine transform, but Huffman coding is used as a         the resulting encoded data. Data compression algorithms have
minor tool in the overall JPEG format. There are of course           been considered by cryptographers as a ciphering scheme [8].
many other file compression techniques besides Huffman                   Online banking is one of the most sensitive tasks performed

                                                                                                  ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 8, No. 9, December 2010

by general Internet users. Most traditional banks now offer        D. Concatenation operator
online banking services, and strongly encourage customers            Given a couple of words u and v, the concatenated word
to do online banking with peace of mind. Although banks            of u and v is w defined by putting the second after the first.
strongly advertise an apparent 100% online security guaran-        In other words, w = u · v or for simplicity w = uv. The
tee typically the fine print makes this conditional on users        concatenation defines a monoid structure on A . The words u
fulfilling certain security requirements. In [9] Mohammad           and v are the factors of the word w. In consequence, a word
intended to spur a discussion on real-world system security and    can have many factorizations.
user responsibilities, in a scenario where everyday users are
strongly encouraged to perform critical tasks over the Internet,   E. Morphism
despite the continuing absence of appropriate tools to do so.         Given S and A two sets of words on alphabets S and A
   In this paper we investigate the construction of a code         respectively, a mapping c from S into A is a morphism if
associated with the 36 alphanumerical symbols. We define a          it satisfies the following requirements:
random value probability for any symbol in the considered             • c( ) =
alphanumerical alphabet. This can be viewed as a dynamical            • ∀ u, v ∈ S , c(uv) = c(u)c(v).
adaptive Huffman coding procedure. The obtained symbols’
codebook is used to code message to transmit through an                                        III. C ODING
encryption system. The paper is organized as follows. The             Given a source alphabet S and a target alphabet T , a coding
section II presents some basic concepts. Section III describes     is an immersion morphism that satisfies the following
techniques for codes construction. Section IV presents the
                                                                                  ∀ u, v ∈ S , u = v =⇒ c(u) = c(v).
Huffman algorithm. Section V describes the proposed model
for data transmission. Section VI presents the conclusion and         A language is then a code if there is no word with two
perspectives.                                                      factorizations.
                                                                      Given a set language L, one can use the Sardinas–Patterson
                        II. C ONCEPTS                              algorithm to determine if L is a code or not. Before presenting
                                                                   the different steps of such an algorithm, let us first define the
  In this section, we review some basic concepts that will be      residual of a language and the quotient language.
used in the paper.
                                                                   A. Residual language
                                                                      Let L ⊂ A be a language and u ∈ A a word. We call a
A. Alphabet
                                                                   left-residual language of L by u the language denoted u−1 L
   An alphabet A is a finite and nonempty set of elements           defined by
called symbols or letters. Throughout this paper, we will use
                                                                            u−1 L = {v ∈ A |∃w ∈ L such that w = uv}
the terms symbols or letters interchangebly.
   We are familiar with some alphabets. For instance the Latin     To put it briefly, the left-residual of a language L by u is
alphabet contains twenty six symbols from A to Z. The Roman        composed of suffix of words of L for which the prefix u is
digit alphabet is composed of I, V, X, L, C, D, M . The dec-       deleted.
imal alphabet contains the symbols 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
                                                                   B. Quotient language
A binary alphabet contains the symbols 0 and 1.
                                                                     Let L and M be two languages. A left-quotient language
                                                                   of L by M is the language M −1 L defined by
B. Word                                                                                        ∪
                                                                                      M −1 L =     u−1 L.
  A word is a finite ordered combination of symbols belong-
ing to a given alphabet. A vocabulary is usually constructed
on an alphabet for the definition of words with a meaning.          In other words, M L is the union of all left-residuals of L
  A length of a word is the number of symbols it contains          by words belonging to M . So M −1 L is the set of all suffixes
whatever they are different or identical.                          of words in L that have a word belonging to M as prefix.
  A word without a symbol is called empty word and is              C. Sardinas-Patterson algorithm
denoted with length equal to 0 by convention.
                                                                      The Sardinas-Patterson algorithm is used to verify if a given
  A set of all words on a given alphabet A is denoted A .
                                                                   language is a code [10]. Given a language L on an alphabet
                                                                   denoted A, the following steps can be followed to check if
C. Prefix of a word                                                 such a language is a code.
  Given two words w and u defined on an alphabet A, the                One has to define the following sets Ln :
word u is a prefix of the word w or a left-factor of w if                        L0    = L
there exists a word v defined on the same alphabet such that                     L1    = L−1 L − { }                                    (1)
w = uv.                                                                         Ln    = L−1 Ln−1 ∪ Ln−1 L ∀n ≥ 2

                                                                                                ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 8, No. 9, December 2010

The process ends if one encounters an already Li calculated           B. Presentation of the Huffman algorithm
or if belongs to Li .                                                    The Huffman algorithm selects two nodes (children) with
   If there exists Li such that ∈ Li , then L is not a code.          the smaller probabilities and constructs a new node (parent)
Otherwise L is a code.                                                with a probability value equal to the addition of the prob-
   As said above, the Sardinas-Patterson algorithm is a useful        abilities of its children. This process ends when the node
tool to check whether a given language is a code or not. A            with probability equal to 1 is constructed. The result of this
generalization of Sardinas and Patterson characterization to          algorithm is a binary tree data structure with the root beeing
z-code can be found in [11] However, a major shortcoming              the node with probability equal to 1, and the last nodes (leaves)
of this algorithm is that it does not say anything about the          represent the original symbols.
optimality of the code. The following section presents the               Let S be a source of n symbols with f the frequency
Huffman algorithm for constructing an optimal code.                   distribution. We can assume that
                     IV. H UFFMAN C ODING
                                                                                       f (s1 ) ≥ f (s2 ) ≥ · · · ≥ f (sn ).
   Huffman D. in [3] introduced its algorithm in 1951 when
solving the problem of finding the most efficient binary code           Let us consider another source S = (S − {sn−1 , sn }) ∪
assigned by R. M. Fano. Indeed, Shannon [12] and Fano [13]            {sn−1,n } with the following frequencies for its symbols
have developed together coding procedures for the purpose                       {
                                                                                         f (s) = f (s) if s = sn−1,n
of proving that the average number of binary digit required                        f (sn−1,n ) = f (sn−1 ) + f (sn )
per message approaches the average amount of information
per message. Their coding procedures are not optimal. The             Then the follwing mapping
Huffman (or variable length coding) method is a lossless data
compression algorithm based on the fact that some symbols                                       f (s) =         f (s) if s = sn−1,n
have a higher frequency of occurence than others. These                    f : s −→
                                                                                           f (sn−1,n ) =        f (sn−1 ) + f (sn )
symbols are encoded in fewer bits than the fixed length coding
                                                                      defines an optimal coding set.
producing in average a higher compression. The idea behind
                                                                        The variant we introduced in the Huffman algorithm is
Huffman coding is to use shorter bit patterns for more frequent
                                                                      described in the subsection V-A. It consists of a dynamic
symbols. To achieve that goal, Huffman algorithm needs the
                                                                      and adaptive procedure to assign the probability value to each
probability of occurence of each symbol. These symbols are
                                                                      symbol in the alphabet.
stored in nodes and then sorted in ascending order of their
prabability value.                                                                  V. DATA T RANSMISSION S YSTEM
   The algorithm selects two nodes (children) with the smaller
                                                                         We will describe in this section, the proposed coding method
probabilities and constructs a new node (parent) with a proba-
                                                                      based on the Huffman coding algorithm. First we present the
bility value equal to the sum of the probabilities of its children.
                                                                      technique for obtaining the probability assign to each symbol.
This process ends when the node with the probability equal to
1 is constructed. The result of this algorithm is a binary tree       A. Assignment of Probability Values to Symbols
data structure with the root being the node with probability            Considering the alphanumerical alphabet A as a source of
equal to 1, and the last nodes (leaves) represent the original        the 36 symbols. The assigned random values to the symbols
symbols. The binary tree is then traversed to reach its leaves        can be arranged into a 6 × 6-matrix. The following pseudo-
when a certain symbol is needed. Moving to a right child insert       code generates such random values.
a 1 in the bit pattern of the symbol, while moving to left child
insert a 0. The result is a unique bit pattern for each symbol        #define ROWS 6
that can uniquely decoded. This is due to the fact that no code       #define COLS 6
word can be a prefix of any larger-length code word.                   float val[ROWS][COLS], sum =0.0f;
A. Optimal code
                                                                      for(int i =        0; i < ROWS; i++){
   The concept of optimal code was introduced in order to                 for(int        j = 0; j < COLS; j++){
reduce the space occupied by coded data. The definition of             val[i][j] =        rand(0,1);
optimal code involves the use of the average length of a code.        sum = sum +        val[i][j];
   Let S be a source of n symbols with the frequency distri-              }
bution f and C a mapping code from S into A . The average             }
length of the mapping code C is defined by                             for(int i =        0; i < ROWS; i++)
                 nf,C =      f (s) × |C(s)|                               for(int        j=0; j < COLS; j++)
                                                                      val[i][j] =        val[i][j]/sum;
in which |C(s)| represents the length of the word s.                  A sample of generated values is given in Table I.
   A code C of S is optimal with respect to the distribution f          We then assigned each probability value to each symbol in
if there does not exist another code K such that nf,K ≤ nf,C .        the alphabet. Without loss of generallity, the result obtained

                                                                                                  ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                      Vol. 8, No. 9, December 2010

                          TABLE I                                                                                TABLE III
                         SYMBOLS .
       0.0394   0.0135      0.0463      0.0383      0.0328      0.0341                              Symbols               Codewords              Symbols          Codewords
       0.0438   0.0264      0.0235      0.0464      0.0366      0.0017
       0.0061   0.0463      0.0387      0.0317      0.0259      0.0134                                      0             1010                        I           01110
       0.0441   0.0466      0.0069      0.0020      0.0190      0.0022                                      1             1011                        J           01111
       0.0306   0.0076      0.0204      0.0410      0.0317      0.0047                                      2             1100                        K           10000
       0.0047   0.0469      0.0442      0.0451      0.0083      0.0398                                      3             1101                        L           10010
                                                                                                            4             1110                        M           000010
                                                                                                            5             1111                        N           000011
                                                                                                            6             00000                       O           010010
                                                                                                            7             00010                       P           100011
is written in Table III to construct the adaptive and dynamic                                               8             00011                       Q           100110
code based on Huffman algorithm. Like Huffman code, it                                                      9             00100                       R           0100111
has the prefix property that states for no character code is                                                 A             00101                       S           1000100
                                                                                                            B             00110                       T           1000101
a prefix to any other character code. It is due to the fact that                                             C             00111                       U           1001110
all the characters occur only at the leaf nodes in the tree. More                                           D             01000                       V           01001100
information on the theory of codes can be found in [14].                                                    E             01010                       W           01001101
                                                                                                            F             01011                       X           10011111
                                                                                                            G             01100                       Y           00111100
                                  TABLE II
                                                                                                            H             01101                       Z           100111101

           Symbols          0           1           2           3
          Frequency      0.0469      0.0466      0.0464      0.0463
           Symbols          4           5           6           7                                                   Identification of                  Huffman
          Frequency      0.0463      0.0451      0.0442      0.0441                                                 different symbols                 codewords
           Symbols          8           9           A           B
          Frequency      0.0438      0.0410      0.0398      0.0394
           Symbols          C           D           E           F                                                                                                  Compressed message
                                                                                                             Message             Random probability
          Frequency      0.0387      0.0383      0.0366      0.0359                                                                                               with Huffman algorithm
                                                                                                            to transmit          value assignment
           Symbols         G           H            I           J
          Frequency      0.0341      0.0328      0.0317      0.0317                     Sender
           Symbols         K            L          M            N
          Frequency      0.0306      0.0264      0.0235      0.0204
           Symbols         O            P          Q            R                                                                  Transmitted                            Encryption
          Frequency      0.0190      0.0135      0.0134      0.0083                                                                message                                 method
           Symbols          S           T           U           V
          Frequency      0.0076      0.0069      0.0061      0.0047                              Receiver
           Symbols         W            X           Y           Z
                                                                                            Fig. 1.          Block diagram for transmitting the original message.
          Frequency      0.0047      0.0022      0.0020      0.0017

                                                                                    D. Decoding Huffman Algorithms
B. Presentation of the obtained optimal code
                                                                                      While the Huffman encoding is a two pass problem, the
   We construct an optimal code by assigning each of the
                                                                                    decoding Huffman technique can be done in one pass by
36 random values probability to each symbol in the order
                                                                                    opening the encoded file and reading the frequency data out
they appear. Without loss of generality, one can consider the
                                                                                    of it. It consists of:
probability values assigned in ascending order with respect to
the symbols in the alphbet A. The result code obtained when                            •   create the Huffman tree based on that information (The
applying                                                                                   total number of encoded bytes is the frequency at the root
                                                                                           of the Huffman tree.);
C. Proposed Data Transmission Channel                                                  •   read data out of the file and search the tree to find the
  Once the original message is formed, the set of different                                correct char to decode (a 0 bit means go left, 1 go right
symbols in such a message is considered with assignment of                                 for binary tree).
random value probability to each symbol we defined in sub-                              Many works have been done to construct decoding Huffman
section V-A. Huffman algorithm is then applied to compress                          algorithms and their evaluated complexities. In [9] a parallel
data to be transmitted. Next, an encryption method is used to                       Huffman decoding algorithm for concurrent read and exclusive
encrypt the message before delivering it to the receiver. The                       write, parallel random access memory which used N proces-
proposed model is shown in Fig. 1.                                                  sors is proposed. Its complexity in time is O(logn). In [15]
  On the receiver’s side, the user requires to know the key of                      Pushpar et al. proposed a new approach to Huffman coding and
the encryption method and the adaptive codebook constructed                         in [16] a ternary for Huffman decoding technique. They used
based on random value probability assignment. Figure 2 shows                        the concept of adaptive Adaptive Huffman coding based on
the process for decoding the received message.                                      ternary tree. This algorithm complexity is less than O(logn).

                                                                                                                                    ISSN 1947-5500
                                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                             Vol. 8, No. 9, December 2010

              Received                                                                      [8] G. Simmons, Contemporary Cryptology - the Science of Information
              message                         codewords
                                                                                                Integrity. IEEE Press, 1991.
                                                                                            [9] M. Mohammad, “Authentification and securing personal information in
                                                                                                an untrusted internet,” Ph.D. dissertation, Carleton University of Ottawa,
                            Decryption                    Huffman decoding
                                                                                                Ontario, 2009.
                                                                                           [10] A. Sardinas and G. W. Patterson, “A necessary and sufficient condition
         Receiver                                                                               for the unique decomposition of coded messages,” Convention Record
                            Alphabet symbols                                                    of the Institute of Radio Engineers, National Convention. Part 8:
                            and their codewords                                                 Information Theory, pp. 104–108, 1953.
                                                                                           [11] M. Madonia et al., “A generalization of sardinas and patterson algorithm
                                                                                                to z-code,” in Theretical Computer Science 108, 1992, pp. 251–270.
                                                                                           [12] C. E. Shannon, “A mathematical theory of communication,” in Bell
                                                               message                          System Technology Journal, vol. 27, 1948, pp. 398–403.
                                                                                           [13] R. M. Fano, “The transmission information,” Reasearch Laboratory of
                                                                                                Electronics, MIT, Cambridge, Massachusetts, Tech. Rep., 1949.
        Fig. 2.     Block diagram for decoding the received message.                       [14] J. Berstel and D. Perrin, Theory of Codes. Academic Press, 2002.
                                                                                           [15] R. S. Pushpa and G. Madhu, “A new approach to huffman coding,” in
                                                                                                Journal of Computer Science, vol. 4, no. 4, 2010.
                                                                                           [16] R. S. and G. M. Pushpa, “Ternary tree and a new huffman decoding
           VI. C ONCLUDING AND F URTHER WORKS                                                   technique,” in International Journal of Computer Science and Network
   In this paper, we proposed a model for data transmission                                     Security, vol. 10, no. 3, 2010.
through a channel by introduing the use of dynamic and                                                                AUTHOR PROFILE
adaptive Huffman code with random values probability as-
                                                                                                                     Eug` ne C. Ezin received his Ph.D
signement. The resulting alphabet is used to code the message
                                                                                                                 degree with highest level of distinction
to transmit. This last is then encrypted based on encryption
                                                                                                                 in 2001 after research works carried out
algorithm known in cryptography.
                                                                                                                 on neural and fuzzy systems for speech
   In our future works we will implement such a model by
                                                                                                                 applications at the International Institute
trying some algorithms in cryptography in order to evaluate
                                                                                                                 for Advanced Scientific Studies in Italy.
its security level. This future work will include the evaluation
                                                                                                                 Since 2007, he is a senior lecturer in
of the confidentiality, integrity, and authentification.
                                                                                                                 computer science. He is a reviewer of
                           ACKNOWLEDGMENT                                                  Mexican International Conference on Artificial Intelligence.
                                                                                           His research interests include high performance computing,
  We thank anonymous reviewers for their review efforts.
                                                                                           neural network and fuzzy systems, signal processing, cryptog-
                                 R EFERENCES                                               raphy, modeling and simulation. He is currently in charge of
                                                                                           the master program in computer science and applied sciences
 [1] S. Roman, Coding and Information Theory. Springer-Verlag, 1992.
                                                                                           at the Institut de Math´ matiques et de Sciences Physiques of
 [2] J. Abhijit, “Computer-aided design of integrated circuits and systems,”
     IEEE Transaction, vol. 22, no. 6, 2003.                                               the Abomey-Calavi University in Republic of Benin.
 [3] D. A. Huffman, “A method of maximum and minimum redundancy
     codes,” in Proc. IRE’1952, 1952, pp. 1098–1101.
 [4] A. Ashraf, “Peak-to-average power ratio reduction in ofdm systems us-
     ing huffman coding,” in Proc. of World Academy of Science, Engineering
     and Technology, vol. 33.
 [5] S. T. Klein et al., “Storing text retrieval systems on cdrom: Compression
     and encryption considerations,” in ACM Transactions on Information
     Systems, vol. 7, no. 3, 1989.
 [6] S. T. Klein and A. S. Fraenkel, “Algorithmica 12,” 1989.
 [7] R. L. Rivest et al., “On breaking a huffman code,” in Proc. IEEE
     Transactions on Information Theory, vol. 42, no. 3, 1996.

                                                                                                                           ISSN 1947-5500

To top