VIEWS: 150 PAGES: 5 CATEGORY: Emerging Technologies POSTED ON: 1/20/2011 Public Domain
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 9, December 2010 Modeling Data Transmission through a Channel Based on Huffman Coding and Encryption Methods e Eug` ne C. Ezin e Institut de Math´ matiques et de Sciences Physiques e e Unit´ de Recherche en Informatique et Sciences Appliqu´ es e e e Universit´ d’Abomey-Calavi, R´ publique du B´ nin eugene.ezin@imsp-uac.org Abstract—Data transmission through a secure channel requires coding, but next to run-length encoding, Huffman coding is the attention of many researchers. In this paper, on the basis of one of the simplest forms of ﬁle compression. Huffman coding an alphabet of ciphers and letters, we propose a model for data can be used effectively wherever there is a need for a compact transmission through a secure channel. This is achieved at two levels. First we associate each distinct symbol with a probability code to represent a long series of a relatively small number of in the message to transmit. By doing so, we modify the well- distinct bytes. known adaptive Huffman coding method. The obtained alphabet For instance in [2], Aljihit et al. present a compres- is used to construct the coded message to transmit through sion/decompression scheme on selective Huffman coding for a cryptosytem. Therefore, the original message is coded and reducing the amount of test data. The difﬁculty observed to encrypted before its delivering. The proposed model is examined. break the ciphered message shows that the proposed cryp- Keywords-component—Data compression, Huffman coding tech- tosystem can be used as a secure channel for data trans- nique, encryption and decryption algorithms. mission. That must be stored on a tester and transferred to each core in a system on a chip during manufacturing I. I NTRODUCTION test. The request for electronic documents and services is Data transmission occurs when there is a channel between increasing with the widespread use of digital data. Usually, two machines. To exchange a datum, an encoding must be electronic documents go through two separate processes: data chosen for the transmission signals. This basically depends on compression to achieve low transmission cost, and ciphering the physical medium used to transfer the data, the guaranteed to provide security. By using Huffman codes [3] we intend data integrity and the transmission speed. Data transmission to keep information retrieval system functions like indexing can be simple if there are only two machines communicating, and searching in the compressed ﬁle, which is not so easily or if only a single piece of data is sent. Otherwise, it is possible with adaptative data compression algorithms. Huff- necessary to install several transmission lines or to share the man codes have some advantages, namely simplicity, speed, line among several different communication actors. Transmit- auto synchronization. Advantage means that it is possible to ting a message sometimes requires a safe channel to stop an decode symbols in the middle of the coded text. unauthorized person discovering its content. Another application of Huffman coding is the reduction of So far, many codes have been used to represent letters the peak to average in orthogonal frequency division multi- and messages, among which Morse Code, ASCII code, and plexing(OFDM) system [4] which is a frequency modulation UNICODE are the most famous [1]. Codes with variable technique for transmitting large amounts of digital data over length including Huffman coding, are very useful in data a radio wave. OFDM works by splitting the radio signal compression ﬁeld. It is worth noting that Huffman binary into multiple smaller sub-signals that are then transmitted codes have the property that no character code is a preﬁx to simultaneously at different frequencies to the receiver. any other character code. This is merely due to the fact that Klein et al. [5] analyzed the cryptographic aspects of all the characters occur at only at the leaf nodes in the tree. Huffman codes used to encode a large natural language on Applications of Huffman coding are pervasive in computer CD-ROM and concluded that this problem is NP-complete science. This coding scheme is not limited to encoding mes- for several variants of the encoding process [6]. Rivest et sages. Indeed, Huffman coding can be used to compress parts al. [7] cryptanalysed a Huffman encoded text assuming that of both digital photographs and other ﬁles such as digital sound the cryptanalyst does not know the codebook. According to ﬁles (MP3) and ZIP ﬁles. them, cryptanalysis in this situation is surprisingly difﬁcult In the case of JPEG ﬁles, the main compression scheme uses and even impossible in some cases due to the ambiguity of a discrete cosine transform, but Huffman coding is used as a the resulting encoded data. Data compression algorithms have minor tool in the overall JPEG format. There are of course been considered by cryptographers as a ciphering scheme [8]. many other ﬁle compression techniques besides Huffman Online banking is one of the most sensitive tasks performed 195 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 9, December 2010 by general Internet users. Most traditional banks now offer D. Concatenation operator online banking services, and strongly encourage customers Given a couple of words u and v, the concatenated word to do online banking with peace of mind. Although banks of u and v is w deﬁned by putting the second after the ﬁrst. strongly advertise an apparent 100% online security guaran- In other words, w = u · v or for simplicity w = uv. The tee typically the ﬁne print makes this conditional on users concatenation deﬁnes a monoid structure on A . The words u fulﬁlling certain security requirements. In [9] Mohammad and v are the factors of the word w. In consequence, a word intended to spur a discussion on real-world system security and can have many factorizations. user responsibilities, in a scenario where everyday users are strongly encouraged to perform critical tasks over the Internet, E. Morphism despite the continuing absence of appropriate tools to do so. Given S and A two sets of words on alphabets S and A In this paper we investigate the construction of a code respectively, a mapping c from S into A is a morphism if associated with the 36 alphanumerical symbols. We deﬁne a it satisﬁes the following requirements: random value probability for any symbol in the considered • c( ) = alphanumerical alphabet. This can be viewed as a dynamical • ∀ u, v ∈ S , c(uv) = c(u)c(v). adaptive Huffman coding procedure. The obtained symbols’ codebook is used to code message to transmit through an III. C ODING encryption system. The paper is organized as follows. The Given a source alphabet S and a target alphabet T , a coding section II presents some basic concepts. Section III describes is an immersion morphism that satisﬁes the following techniques for codes construction. Section IV presents the ∀ u, v ∈ S , u = v =⇒ c(u) = c(v). Huffman algorithm. Section V describes the proposed model for data transmission. Section VI presents the conclusion and A language is then a code if there is no word with two perspectives. factorizations. Given a set language L, one can use the Sardinas–Patterson II. C ONCEPTS algorithm to determine if L is a code or not. Before presenting the different steps of such an algorithm, let us ﬁrst deﬁne the In this section, we review some basic concepts that will be residual of a language and the quotient language. used in the paper. A. Residual language Let L ⊂ A be a language and u ∈ A a word. We call a A. Alphabet left-residual language of L by u the language denoted u−1 L An alphabet A is a ﬁnite and nonempty set of elements deﬁned by called symbols or letters. Throughout this paper, we will use u−1 L = {v ∈ A |∃w ∈ L such that w = uv} the terms symbols or letters interchangebly. We are familiar with some alphabets. For instance the Latin To put it brieﬂy, the left-residual of a language L by u is alphabet contains twenty six symbols from A to Z. The Roman composed of sufﬁx of words of L for which the preﬁx u is digit alphabet is composed of I, V, X, L, C, D, M . The dec- deleted. imal alphabet contains the symbols 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. B. Quotient language A binary alphabet contains the symbols 0 and 1. Let L and M be two languages. A left-quotient language of L by M is the language M −1 L deﬁned by B. Word ∪ M −1 L = u−1 L. A word is a ﬁnite ordered combination of symbols belong- u∈M ing to a given alphabet. A vocabulary is usually constructed −1 on an alphabet for the deﬁnition of words with a meaning. In other words, M L is the union of all left-residuals of L A length of a word is the number of symbols it contains by words belonging to M . So M −1 L is the set of all sufﬁxes whatever they are different or identical. of words in L that have a word belonging to M as preﬁx. A word without a symbol is called empty word and is C. Sardinas-Patterson algorithm denoted with length equal to 0 by convention. The Sardinas-Patterson algorithm is used to verify if a given A set of all words on a given alphabet A is denoted A . language is a code [10]. Given a language L on an alphabet denoted A, the following steps can be followed to check if C. Preﬁx of a word such a language is a code. Given two words w and u deﬁned on an alphabet A, the One has to deﬁne the following sets Ln : word u is a preﬁx of the word w or a left-factor of w if L0 = L there exists a word v deﬁned on the same alphabet such that L1 = L−1 L − { } (1) w = uv. Ln = L−1 Ln−1 ∪ Ln−1 L ∀n ≥ 2 196 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 9, December 2010 The process ends if one encounters an already Li calculated B. Presentation of the Huffman algorithm or if belongs to Li . The Huffman algorithm selects two nodes (children) with If there exists Li such that ∈ Li , then L is not a code. the smaller probabilities and constructs a new node (parent) Otherwise L is a code. with a probability value equal to the addition of the prob- As said above, the Sardinas-Patterson algorithm is a useful abilities of its children. This process ends when the node tool to check whether a given language is a code or not. A with probability equal to 1 is constructed. The result of this generalization of Sardinas and Patterson characterization to algorithm is a binary tree data structure with the root beeing z-code can be found in [11] However, a major shortcoming the node with probability equal to 1, and the last nodes (leaves) of this algorithm is that it does not say anything about the represent the original symbols. optimality of the code. The following section presents the Let S be a source of n symbols with f the frequency Huffman algorithm for constructing an optimal code. distribution. We can assume that IV. H UFFMAN C ODING f (s1 ) ≥ f (s2 ) ≥ · · · ≥ f (sn ). Huffman D. in [3] introduced its algorithm in 1951 when solving the problem of ﬁnding the most efﬁcient binary code Let us consider another source S = (S − {sn−1 , sn }) ∪ assigned by R. M. Fano. Indeed, Shannon [12] and Fano [13] {sn−1,n } with the following frequencies for its symbols have developed together coding procedures for the purpose { f (s) = f (s) if s = sn−1,n of proving that the average number of binary digit required f (sn−1,n ) = f (sn−1 ) + f (sn ) per message approaches the average amount of information per message. Their coding procedures are not optimal. The Then the follwing mapping Huffman (or variable length coding) method is a lossless data { compression algorithm based on the fact that some symbols f (s) = f (s) if s = sn−1,n have a higher frequency of occurence than others. These f : s −→ f (sn−1,n ) = f (sn−1 ) + f (sn ) symbols are encoded in fewer bits than the ﬁxed length coding deﬁnes an optimal coding set. producing in average a higher compression. The idea behind The variant we introduced in the Huffman algorithm is Huffman coding is to use shorter bit patterns for more frequent described in the subsection V-A. It consists of a dynamic symbols. To achieve that goal, Huffman algorithm needs the and adaptive procedure to assign the probability value to each probability of occurence of each symbol. These symbols are symbol in the alphabet. stored in nodes and then sorted in ascending order of their prabability value. V. DATA T RANSMISSION S YSTEM The algorithm selects two nodes (children) with the smaller We will describe in this section, the proposed coding method probabilities and constructs a new node (parent) with a proba- based on the Huffman coding algorithm. First we present the bility value equal to the sum of the probabilities of its children. technique for obtaining the probability assign to each symbol. This process ends when the node with the probability equal to 1 is constructed. The result of this algorithm is a binary tree A. Assignment of Probability Values to Symbols data structure with the root being the node with probability Considering the alphanumerical alphabet A as a source of equal to 1, and the last nodes (leaves) represent the original the 36 symbols. The assigned random values to the symbols symbols. The binary tree is then traversed to reach its leaves can be arranged into a 6 × 6-matrix. The following pseudo- when a certain symbol is needed. Moving to a right child insert code generates such random values. a 1 in the bit pattern of the symbol, while moving to left child insert a 0. The result is a unique bit pattern for each symbol #define ROWS 6 that can uniquely decoded. This is due to the fact that no code #define COLS 6 word can be a preﬁx of any larger-length code word. float val[ROWS][COLS], sum =0.0f; A. Optimal code for(int i = 0; i < ROWS; i++){ The concept of optimal code was introduced in order to for(int j = 0; j < COLS; j++){ reduce the space occupied by coded data. The deﬁnition of val[i][j] = rand(0,1); optimal code involves the use of the average length of a code. sum = sum + val[i][j]; Let S be a source of n symbols with the frequency distri- } bution f and C a mapping code from S into A . The average } length of the mapping code C is deﬁned by for(int i = 0; i < ROWS; i++) ∑ nf,C = f (s) × |C(s)| for(int j=0; j < COLS; j++) s∈S val[i][j] = val[i][j]/sum; in which |C(s)| represents the length of the word s. A sample of generated values is given in Table I. A code C of S is optimal with respect to the distribution f We then assigned each probability value to each symbol in if there does not exist another code K such that nf,K ≤ nf,C . the alphabet. Without loss of generallity, the result obtained 197 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 9, December 2010 TABLE I TABLE III A SAMPLE OF GENERATED VALUES FOR PROBABILITY ASSIGNMENT TO S OURCE SYMBOLS WITH THEIR CORRESPONDING CODEWORDS . SYMBOLS . 0.0394 0.0135 0.0463 0.0383 0.0328 0.0341 Symbols Codewords Symbols Codewords 0.0438 0.0264 0.0235 0.0464 0.0366 0.0017 0.0061 0.0463 0.0387 0.0317 0.0259 0.0134 0 1010 I 01110 0.0441 0.0466 0.0069 0.0020 0.0190 0.0022 1 1011 J 01111 0.0306 0.0076 0.0204 0.0410 0.0317 0.0047 2 1100 K 10000 0.0047 0.0469 0.0442 0.0451 0.0083 0.0398 3 1101 L 10010 4 1110 M 000010 5 1111 N 000011 6 00000 O 010010 7 00010 P 100011 is written in Table III to construct the adaptive and dynamic 8 00011 Q 100110 code based on Huffman algorithm. Like Huffman code, it 9 00100 R 0100111 has the preﬁx property that states for no character code is A 00101 S 1000100 B 00110 T 1000101 a preﬁx to any other character code. It is due to the fact that C 00111 U 1001110 all the characters occur only at the leaf nodes in the tree. More D 01000 V 01001100 information on the theory of codes can be found in [14]. E 01010 W 01001101 F 01011 X 10011111 G 01100 Y 00111100 TABLE II H 01101 Z 100111101 S OURCE SYMBOLS WITH THEIR CORRESPONDING FREQUENCIES . Symbols 0 1 2 3 Frequency 0.0469 0.0466 0.0464 0.0463 Symbols 4 5 6 7 Identification of Huffman Frequency 0.0463 0.0451 0.0442 0.0441 different symbols codewords Symbols 8 9 A B Frequency 0.0438 0.0410 0.0398 0.0394 Symbols C D E F Compressed message Message Random probability Frequency 0.0387 0.0383 0.0366 0.0359 with Huffman algorithm to transmit value assignment Symbols G H I J Frequency 0.0341 0.0328 0.0317 0.0317 Sender Symbols K L M N Frequency 0.0306 0.0264 0.0235 0.0204 Symbols O P Q R Transmitted Encryption Frequency 0.0190 0.0135 0.0134 0.0083 message method Symbols S T U V Frequency 0.0076 0.0069 0.0061 0.0047 Receiver Symbols W X Y Z Fig. 1. Block diagram for transmitting the original message. Frequency 0.0047 0.0022 0.0020 0.0017 D. Decoding Huffman Algorithms B. Presentation of the obtained optimal code While the Huffman encoding is a two pass problem, the We construct an optimal code by assigning each of the decoding Huffman technique can be done in one pass by 36 random values probability to each symbol in the order opening the encoded ﬁle and reading the frequency data out they appear. Without loss of generality, one can consider the of it. It consists of: probability values assigned in ascending order with respect to the symbols in the alphbet A. The result code obtained when • create the Huffman tree based on that information (The applying total number of encoded bytes is the frequency at the root of the Huffman tree.); C. Proposed Data Transmission Channel • read data out of the ﬁle and search the tree to ﬁnd the Once the original message is formed, the set of different correct char to decode (a 0 bit means go left, 1 go right symbols in such a message is considered with assignment of for binary tree). random value probability to each symbol we deﬁned in sub- Many works have been done to construct decoding Huffman section V-A. Huffman algorithm is then applied to compress algorithms and their evaluated complexities. In [9] a parallel data to be transmitted. Next, an encryption method is used to Huffman decoding algorithm for concurrent read and exclusive encrypt the message before delivering it to the receiver. The write, parallel random access memory which used N proces- proposed model is shown in Fig. 1. sors is proposed. Its complexity in time is O(logn). In [15] On the receiver’s side, the user requires to know the key of Pushpar et al. proposed a new approach to Huffman coding and the encryption method and the adaptive codebook constructed in [16] a ternary for Huffman decoding technique. They used based on random value probability assignment. Figure 2 shows the concept of adaptive Adaptive Huffman coding based on the process for decoding the received message. ternary tree. This algorithm complexity is less than O(logn). 198 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 9, December 2010 Huffman Received [8] G. Simmons, Contemporary Cryptology - the Science of Information message codewords Integrity. IEEE Press, 1991. [9] M. Mohammad, “Authentiﬁcation and securing personal information in an untrusted internet,” Ph.D. dissertation, Carleton University of Ottawa, Decryption Huffman decoding method Ontario, 2009. method [10] A. Sardinas and G. W. Patterson, “A necessary and sufﬁcient condition Receiver for the unique decomposition of coded messages,” Convention Record Alphabet symbols of the Institute of Radio Engineers, National Convention. Part 8: and their codewords Information Theory, pp. 104–108, 1953. [11] M. Madonia et al., “A generalization of sardinas and patterson algorithm to z-code,” in Theretical Computer Science 108, 1992, pp. 251–270. Original [12] C. E. Shannon, “A mathematical theory of communication,” in Bell message System Technology Journal, vol. 27, 1948, pp. 398–403. [13] R. M. Fano, “The transmission information,” Reasearch Laboratory of Electronics, MIT, Cambridge, Massachusetts, Tech. Rep., 1949. Fig. 2. Block diagram for decoding the received message. [14] J. Berstel and D. Perrin, Theory of Codes. Academic Press, 2002. [15] R. S. Pushpa and G. Madhu, “A new approach to huffman coding,” in Journal of Computer Science, vol. 4, no. 4, 2010. [16] R. S. and G. M. Pushpa, “Ternary tree and a new huffman decoding VI. C ONCLUDING AND F URTHER WORKS technique,” in International Journal of Computer Science and Network In this paper, we proposed a model for data transmission Security, vol. 10, no. 3, 2010. through a channel by introduing the use of dynamic and AUTHOR PROFILE adaptive Huffman code with random values probability as- Eug` ne C. Ezin received his Ph.D e signement. The resulting alphabet is used to code the message degree with highest level of distinction to transmit. This last is then encrypted based on encryption in 2001 after research works carried out algorithm known in cryptography. on neural and fuzzy systems for speech In our future works we will implement such a model by applications at the International Institute trying some algorithms in cryptography in order to evaluate for Advanced Scientiﬁc Studies in Italy. its security level. This future work will include the evaluation Since 2007, he is a senior lecturer in of the conﬁdentiality, integrity, and authentiﬁcation. computer science. He is a reviewer of ACKNOWLEDGMENT Mexican International Conference on Artiﬁcial Intelligence. His research interests include high performance computing, We thank anonymous reviewers for their review efforts. neural network and fuzzy systems, signal processing, cryptog- R EFERENCES raphy, modeling and simulation. He is currently in charge of the master program in computer science and applied sciences [1] S. Roman, Coding and Information Theory. Springer-Verlag, 1992. e at the Institut de Math´ matiques et de Sciences Physiques of [2] J. Abhijit, “Computer-aided design of integrated circuits and systems,” IEEE Transaction, vol. 22, no. 6, 2003. the Abomey-Calavi University in Republic of Benin. [3] D. A. Huffman, “A method of maximum and minimum redundancy codes,” in Proc. IRE’1952, 1952, pp. 1098–1101. [4] A. Ashraf, “Peak-to-average power ratio reduction in ofdm systems us- ing huffman coding,” in Proc. of World Academy of Science, Engineering and Technology, vol. 33. [5] S. T. Klein et al., “Storing text retrieval systems on cdrom: Compression and encryption considerations,” in ACM Transactions on Information Systems, vol. 7, no. 3, 1989. [6] S. T. Klein and A. S. Fraenkel, “Algorithmica 12,” 1989. [7] R. L. Rivest et al., “On breaking a huffman code,” in Proc. IEEE Transactions on Information Theory, vol. 42, no. 3, 1996. 199 http://sites.google.com/site/ijcsis/ ISSN 1947-5500