Document Sample

RSA Encryption and Decryption using the Redundant Number System on the FPGA Koji Nakano, Kensuke Kawakami, and Koji Shigemoto Department of Information Engineering, Hiroshima University Kagamiyama 1-4-1, Higashi-Hiroshima, JAPAN Abstract redundant radix-64K number system that accelerates arith- metic operations. The main contribution of this paper is to present efﬁ- The Montgomery modulo multiplication is used to speed cient hardware algorithms for the modulo exponentiation the modulo multiplication ¡ ¡ Ê ¾ ÑÓ Å for Ê-bit È ÑÓ Å used in RSA encryption and decryption, and numbers , , and Å . The idea of Montgomery mod- implement them on the FPGA. The key ideas to acceler- ulo multiplication is not to use direct modulo computa- ate the modulo exponentiation are to use the Montgomery tion, which is very costly in terms of the computing time modulo multiplication on the redundant radix-64K num- and hardware resources. By iterative computation of Mont- ber system in the FPGA, and to use embedded ¢ -bit ½ ½ gomery modulo multiplication, the modulo exponentiation multipliers and embedded 18k-bit block RAMs in effective È ÑÓ Å can be computed, which is a key operation way. Our hardware algorithms for the modulo exponen- for RSA encryption and decryption [2, 10]. In our previ- tiation for Ê-bit numbers È , , and Å can run in less ous paper [6], we have presented an efﬁcient implementa- than Ê´¾ · µ´ ½ · ½µ Ê clock cycles and in expected tion of the Montgomery modulo multiplication in an FPGA. ´½ Ê · µ´ ½ · ½µ Ê clock cycles. We have implemented The key ideas are to use 18-bit embedded multiplier in the our modulo exponentiation hardware algorithms on Xilinx FPGA and to perform the computation based on the redun- VirtexII Pro family FPGA XC2VP30-6. The implementa- dant radix-64K number system. The experimental results tion results shows that our hardware algorithm for 1024-bit in [6] show that 1024-bit Montgomery multiplication can modulo exponentiation can be implemented to run in less be performed in 1.54 × using 6824 slices and 129 multipli- than 2.521ms and in expected 1.892ms. ers on a Xilinx Virtex II Family FPGA. However, this im- plementation needs a lot of multipliers and has long critical path through multipliers. We have improved this result us- ing 18k-bit block RAMs as look up tables [11]. The experi- 1 Introduction mental results in [11] show that 1024-bit Montgomery mul- tiplication can be performed in 1.23 × using 7883 slices, 64 It is well known that the addition of two Ò-bit numbers multipliers, and 29 block RAMs. can be done using a ripple carry adder with the cascade of The main contribution of this paper is to present hard- Ò full adders [4]. The ripple carry adder has a carry chain ware algorithms for the modulo exponentiation used in through all the Ò full adders. Thus, the delay time to com- RSA encryption and decryption [10] based on our previous plete the addition is proportional to Ò. The carry look-ahead work [6, 11]. Both hardware algorithms using redundant- adder [4, 9] which computes the carry bits using the preﬁx 64K number system run in Ê ½ ·½ clock cycles to computation can reduce the depth of the circuit. Although complete the Montgomery modulo multiplication ¡ ¡ the delay time is Ç ´ÐÓ µ Ò , its constant factor is large and ¾ ÑÓ Ê Å for Ê-bit numbers , , and Å . We have the circuit is much more complicated than the ripple carry used these hardware algorithms to complete the modulo ex- adder. Hence, it is not often to use the carry look-ahead ÑÓ ponentiation È Å for Ê-bit numbers È , , and Å . adder for actual implementations. On the other hand, re- Our implementations for both hardware algorithms run in dundant number systems can be used to accelerate addi- ´¾ · µ´ ½ · ½µ less than Ê Ê clock cycles and in expected tion. Using redundant number systems, we can remove long ´½ · µ´ ½ ·½µ Ê Ê clock cycles. Thus, the 1024-bit mod- carry chains in the addition. The readers should refer to [9] ulo exponentiation can be done in less than 133380 clock (Chapter 3) for comprehensive survey of redundant num- cycles and in expected 100100 clock cycles. We have im- ber systems. In our previous paper [6], we have presented plemented our modulo exponentiation hardware algorithms Authorized licensed use limited to: Bharat University. Downloaded on September 1, 2009 at 10:32 from IEEE Xplore. Restrictions apply. on Xilinx VirtexII Pro family FPGA XC2VP30-6. The im- ¼¼¼¼¼¼µ and ´¼¼½½¼½, ½½¼¼¼¼, ¼¼¼¼¼¼, ¼¼¼¼¼¼µ are over- plementation results show that our hardware algorithm for ﬂows, because their values are greater than ¾½ ½. We as- 1024-bit modulo exponentiation can be implemented to run sume that, if the resulting value of an operation is a -digit in 2.521ms and in expected 1.892ms. redundant radix-¾Ö number and it is greater than ¾ Ö ½, it There are a lot of works that have been presented for is not necessary for a circuit or a program performing the the modulo exponentiation (i.e. RSA encryption and de- operation to guarantee the correct result due to the overﬂow cryption) using the Montgomery modulo multiplication. error. Clearly, the redundant bits ½ Ö ·½ Ö of the most Blum et al. [3] showed that 1024-bit modulo exponentiation signiﬁcant digit ½ of a -digit redundant radix- Ö num- ¾ can be done in 11.95ms using 6633 CLBs on XC40150XV- ber are not zero, then the value of is overﬂow. Note 8. Amanor et al. [1] presented a hardware algorithm for that can be overﬂow even if ½ Ö Ö is zero.·½ the Montgomery modulo multiplication and estimated the In our previous paper [6], we have presented hardware running time if it is used for 1024-bit modulo exponenti- algorithms for various arithmetic operations for redundant ation. From their estimation, it runs in expected 22.7ms ¾ radix- Ö numbers. For the reader’s beneﬁt, we will review on XCV2000E-6. Thus, our implementation runs more the hardware algorithms for arithmetic operations. The than 10 times faster. Mazzeo et al. [7] have presented that reader should refer to [6] for the details. RSA encryption È ÑÓ Å for 1024-bit numbers È and Å , and ½¾ ·½ can be done in 2.99ms using 2,902 ¾º½ Ø ÓÒ ÓÖ Ê ÙÒ ÒØ ÆÙÑ Ö× slices on XCV2000E. Garg and Vig [5] have shown RSA encryption for the same instance in 0.167ms using 28,891 Let us see the computation of the sum of two redun- slices on a Xilinx Virtex 4 family FPGA. In our imple- dant numbers. For two 4-digit redundant radix- num- ¾ mentation, the RSA encryption can be done in less than ´½ · µ´½¼¾ ½ ·½µ ½ ¿¼ bers ´¼¼¼½¼½ ½½¼¼½½ ½½½½½½ ½¼½½½½µ and for ¾ ·½½ clock cycles and in 0.027ms . Consequently, our implementation runs ´¼¼¼¼½½ ½¼½½½½ ¼½½½½½ ¼½¼¼¼½µ , their sum · can be computed by the position sum as follows: much faster than known implementations. 11 11 10 Ö 2 Redundant Radix-¾ Numbers and Arith- 0101 0011 1111 1111 metic Operations 10 01 01 + 0011 1111 1111 0001 001101 010110 100001 010000 In this paper, we use the following notation to repre- sent the consecutive bits in a number. For a number , let Clearly, the addition has no block carry. Let us see the ad- ( ) be consecutive bits from -th to -th bits, ¼ ¾ dition of two -digit redundant radix- Ö numbers and . where the least signiﬁcant bits is -th bit. For example, ¾ ½½½¼¼ for ½½½½¼¼¼¼ . The sum · can be computed as follows: ¾ A -digit redundant radix- Ö number is a sequence of ¼ Ö ½ ¼ · ¼ Ö ½ ¼ ´ ·¾µ Ö È -bit numbers ´ ½, ¾ , , ¼ . The value of µ ¼ ½ Ö · ½ Ö · Ö ½ ¼ · ½ Ö · ½ Ö ½ ¡ Ö We call, for each with Ö ¾ ·¾ is ½¼ Ö ¼ and Ö ·½ bits, Ö , principal bits and redundant · Ö ½ ¼ ´½ µ bits, respectively. For example, ´¼¼¼½¼½ ¼½¼¼½½ , , Hence, ¼ ¾Ö · ¾Ö ¾Ö·½ and · ¾Ö · · ¾ Ö ½½½½½½ ½¼½½½½µ , is a 4-digit redundant radix- number, ¾ ¾Ö·¾ holds if Ö ¾. Thus, is a correct redundant radix-¾Ö where underlined binary numbers are redundant bits. If all the redundant bits of this redundant radix- number ¾ number. Let us design a combinational circuit to compute the sum are zero, it can be converted to the non-redundant radix- ¾ ·. Let ´¾ ¾ µ Ö Ö denote an adder circuit number by just removing the redundant bits. Also, the ¾ that computes the sum of two -bit and two Ö-bit integers. non-redundant numbers can be converted to the equivalent ¼¼ Also, let ´ µ denote the resulting value of redundant numbers by attaching redundant bits to each ¾ the sum of -bit numbers and , and Ö-bit numbers digit. ¼ Ö ´¼ ¼ ¼ Ö ½¼ ½ ¼µ È and . Clearly, ¼ From the deﬁnition, the value of a -digit redundant ¾ radix- Ö number ½ Ö·¾ ¡ Ö ´¾ ½µ ¾ and ½ Ö ´ Ö ½ Ö ·½ Ö Ö ·½ Ö ½µ´¾Ö·¾ ½µ is up to ¼ ½¼ Ö ½ ¼µ . Thus we have, ´¾ ¾Ö ½ ¾Ö . However, we assume that the valid ¾ ½ value of is up to Ö . If the value of is greater than Lemma 1 The addition of two -digit redundant radix- Ö ¾ ¾ ½ Ö , it is regarded as overﬂow. For example, 4-digit numbers can be computed using adders ÖÖ ´¾ ¾ µ ¾ redundant radix- numbers ´¼½¼¼¼¼ ¼¼¼¼¼¼ ¼¼¼¼¼¼ , , , without block carries, whenever Ö . ¾ Authorized licensed use limited to: Bharat University. Downloaded on September 1, 2009 at 10:32 from IEEE Xplore. Restrictions apply. ¾º¾ ÅÙÐØ ÔÐ Ø ÓÒ Ó Ê ÙÒ ÒØ ÆÙÑ¹ each digit of the sum Ì can be computed as follows. Ö× Ì¼ È¼ Ö ½ ¼ · ¼ Ö ½ ¼ ¿ We show that the multiplication of -digit and 1-digit re- Ì½ È¼ ¾Ö ½ Ö · È½ Ö ½ ¼ ¾ dundant radix- numbers can be computed without block · ¼ Ö·½ Ö · ½ Ö ½ ¼ carry. Let ´¼½¼¼½½ ½¼¼¼½½ ½¼½½½½µ and Ì È ¾ ¾Ö · ¿ ¾Ö · È ½ ¾Ö ½ Ö ´½¼¼½¼½µ . The product ¡ can be computed using 6- ·È Ö ½ ¼ · ½ Ö · ½ Ö bit¢6-bit 12-bit multiplications as follows. · Ö ½ ¼ (¾ ½) 010011 101001 010001 Ì È ¾ ¾Ö · ¿ ¾Ö · È ½ ¾Ö ½ Ö ¢ 100101 · ½ Ö · ½ Ö · Ö ½ ¼ È ½ ¾Ö · ¿ ¾Ö · Ö · ½ Ö 0010 0111 0101 0101 1110 1101 Ì ·½ + 0010 1011 1111 Clearly, each Ì can be computed using ÖÖÖ, ´¾ µ 000010 010000 011111 010100 000101 and the resulting value has no more than Ö bits if Ö . ·¾ Clearly, we do not have the block carries. Let us formally Thus, Ì is a ´ · ¾µ -digit redundant radix- Ö number and ¾ we have, conﬁrm that the multiplication of -digit and 1-digit redun- ¾ dant radix- Ö numbers can be computed without block car- Lemma 3 For a -digit redundant radix-¾Ö number , a ries. Let and be -digit and 1-digit redundant radix- Ö ¾ ½-digit redundant Öradix-¾Ö number , and a ´ · ½µ-digit numbers. Also, let È ¡ ( ¼ ) be the ½ redundant radix-¾ number , the product sum ¡ · partial multiplication. Since both and has Ö bits, ·¾ can be computed using ÅÍÄ´Ö · ¾ Ö · ¾µs, · ¾ È has Ö ¾· bits. We can compute the product Ë ¡ ´¾ Ö Ö Öµs, and a ´ ·½µ´Ö ·¾µ-bit registers, when- as follows. ever Ö . Ë¼ È¼ Ö ½ ¼ Let Ì ÈË´ µ denote the circuit (or function) Ë½ È¼ ¾Ö ½ Ö · È½ Ö ½ ¼ for Lemma 3. Using ÈË´ µ we can compute the sum of two -digit redundant radix radix- Ö numbers ¾ Ë È ¾ ¾Ö · ¿ ¾Ö · È ½ ¾Ö ½ Ö and . Let ½ ¾ ´ ¼ and µ ·È Ö ½ ¼ (¾ ½) ´ ½ ¾ µ ¼ be two -digit redundant radix radix- Ë È ¾ ¾Ö · ¿ ¾Ö · È ½ ¾Ö ½ Ö ¾ Ö numbers. We will show how to compute the product Ë È ½ ¾Ö · ¿ ¾Ö ´ È È¾ ½ , È¾ ¾ , , È¼ µ ¡ using ÈË´ . µ ·½ We compute partial products ¡ ¼ , ¡ ½ , , ¡ ½ Ë¼ ¾Ö , Ë½ ¾Ö · ¾Ö ¾Ö·½, Ë ¾Ö , and in turn. We use´ µ ½ ¼ to denote regis- Hence, Ë ·½ ¾ hold. Also, if Ö ¿ then Ë ¾ · ¾Ö · ¾Ö ´ · ½µ ters storing a interim ¾ -digit redundant radix radix- Ö ¾Ö·¾ holds. Thus, Ë ´Ë ·½ Ë Ë¼ µ is a redundant ÈË´ number. We ﬁrst compute ¼µ ¼ . Then, È¼ is the radix-¾Ö number. ÈË´ least signiﬁcant digit ¼µ · ½ ¼¼ Ö . We store the Let ÅÍÄ´Ö · ¾ Ö · ¾µ and ´ Ö Öµ denote com- remaining·½ ÈË´ ¼µ ´ ·½µ´ ·¾µ ½ ·¾ digits ¼ Ö Ö binational circuits to compute the ´¾Ö · µ-bit product of ÈË´ in . After that, we compute µ ½ . Clearly, È½ is two ´Ö · ¾µ-bit numbers and the ´Ö · ¾µ-bit sum of one ÈË´ the least signiﬁcant digit µ ·½ ¼ ½ Ö holds, and 4-bit and two Ö-bit numbers. Each of the partial products ·½ then we store the remaining ÈË´ µ´ · digits ½ È ½ ½ ¡ , È ¾ ¾ ¡ , , È¼ ¼¡ ½µ´ · ¾µ ½ · ¾ Ö Ö in . Continuing similarly, we can be computed using Ö ÅÍÄ´ · ¾ · ¾µ Ö can . After that, each Ë obtain the product Å ¡ . Thus we have, can be computed using ´ µ Ö Ö . Thus, we have Lemma 4 For two -digit redundant radix- Ö numbers ¾ Lemma 2 The product of -digit and 1-digit redundant and , the product ¡ in the redundant radix- Ö rep- ¾ ¾ radix- Ö numbers can be computed using Ö Ö ÅÍÄ´ ·¾ · resentation can be computed in ÅÍÄ´ · ¾ · ¾µ Ö Ö clock cycles using ·¾ ´¾ Ö Ö Ö s, and a µ ¾µs, and ´ µ Ö Ö s, whenever Ö . ¿ ´ · ½µ´ · ¾µ Ö s, -bit register, whenever Ö . Next, to show a circuit to compute two -digit redundant numbers, we will show how to add a ´ · ½µ -digit radix- Ö ¾ 3 Montgomery Modulo Multiplication number to the product ¡ . More speciﬁcally, we will show how to compute Ë ¡ · . Later, is used to In the RSA encryption/decryption, the modulo expo- store interim results of the product sum. We can compute nentiation È Å or È ÑÓ Å are ÑÓ Authorized licensed use limited to: Bharat University. Downloaded on September 1, 2009 at 10:32 from IEEE Xplore. Restrictions apply. computed, where È and are plain and cypher text, and ¾ Ê ÑÓ Å È Ê ½ Ê ¾ ¡¡¡ ¾Ê ÑÓ If ½ ¼ Å. ¼ ´ µ Å and ´ µ Å are encryption and decryption keys. then the Montgomery modulo multiplication in line 6 is Usually, the number of bits in È , , , and Å is 1024 not executed. Hence, ´ È Ê ½ Ê ¾ ¡¡¡ ½ Ê Å ¾ ÑÓ µ or larger. Also, the modulo exponentiation is repeatedly holds. If ½ ½ then it is executed, and thus, we have computed for ﬁxed , , and Å , and various È and . ´ È Ê ½ Ê ¾ ¡¡¡ ¼ Ê ¾ ÑÓ Å ¡ È ¡ Ê µ Å ¾ ÑÓ Since modulo operation is very costly in terms of the com- È Ê ½ Ê ¾ ¡¡¡ ½ Ê ¾ ÑÓ Å È Ê ½ Ê ¾ ¡¡¡ ½ Ê ¾ ÑÓ puting time and hardware resources, we use Montgomery Å . This completes the proof of the induction. Thus, af- modulo multiplication [8], which does not use direct mod- ter terminating the for-loop, we have È Ê Å.¾ ÑÓ ulo operations. In the Montgomery modulo multiplica- Finally, by the Montgomery modulo multiplication in line tion, three Ê-bit numbers , , and Å are given, and 8, we have È Ê ´ ¾ ÑÓ Å ¡ ¡ Ê µ ½ ¾ ÑÓ Å ´ ¡ · ¡ Å ¡ Ê µ ¾ ÑÓ Å is computed, where an È ÑÓÅ. integer is selected such that the least signiﬁcant Ê bits In the worst case, ½for all . If this is the case, of ¡ · ¡ Å are zero. The value of can be com- the Montgomery modulo multiplication is executed no more ´ µ puted as follows. Let Å ½ denote the minimum non- than Ê ¾ ·¾ times. Also, since ½ with probability , ½¾ ´ negative number such that Å ½ ¡ Å or Ê µ ½´ ¾ ½µ it is executed expected Ê ½ ·¾ times. Thus we have, ´ÑÓ ¾ µ Ê . If Å is odd, then Å ½ ´ µ ¾ Ê always holds. We can select such that ´´ µ ´ ¡ ¡ Å ½ Ö . µµ ½ ¼ Lemma 5 The modulo exponentiation È Å ÑÓ ´ ¡ · µ ½¼ ¡ Å Ö are zero. for Ê-bit numbers È , , and Å can be computed by execut- ¾ ·¾ For such , ¼ Å Ê and ¾ ¼ ¾ Ê , we ing the Montgomery modulo multiplication Ê times and ½ ·¾ ¾ ÑÓ ¾ ÑÓ Since can guarantee that ¡´ ¡Å ¡ · Ê µ ¾ ¾ Å . Thus, expected Ê times. if Ê Å and ¾Ê Å by subtracting Å from ¡ ´ · µ ¾ ¡ Å ¡ Ê , we can are given. obtain ´ ¡ · ¡ Å ¡ Ê µ ¾ ÑÓ Å if it is not less than Å . Since ¡ · ¡Å ¡ ´ÑÓ µ Å , we write 5 Hardware Algorithms for Montgomery ´ ¡ · ¡ Å ¡ Ê µ ¾ Å ÑÓ ¡ ¡ Ê ¾ ÑÓ Å. Modulo Multiplication 4 Modulo Exponentiation using Mont- The main purpose of this section is to review hardware gomery Modulo Multiplication algorithms [6, 11] for Montgomery modulo multiplication. Let us see how Montgomery modulo multiplication is º½ ÐÓ ¹ ÖÖÝ¹ Ö ÁÑÔÐ Ñ ÒØ Ø ÓÒ Ó used to compute È ÑÓ Å . Since Ê and Å are ÅÓÒØ ÓÑ ÖÝ ÅÓ ÙÐÓ ÅÙÐØ ÔÐ Ø ÓÒ ﬁxed, we can assume that Ê ¾ ÑÓ Å and ¾Ê Å ¾ ÑÓ are computed beforehand. Let the binary representation of Recall that in the Montgomery modulo multiplication, be Ê ½ Ê ¾ ¡ ¡ ¡ ½ . The modulo exponentiation Ê-bit numbers , , and Å are given. In this subsection, È ÑÓ Å can be computed using the following algo- we assume and are a -digit redundant radix- Ö num- ¾ rithm based on the right-to-left method: ¾ ber and a 1-digit redundant radix- Ö number, respectively. 1. Ê¾ ÑÓ Å; We will show a circuit to compute the Montgomery mod- ¡ ´ · ¡ Å ¡ Ö for such , , µ ¾ 2. È ´¾ ÑÓ È ¡ ¾Ê Å ¡ Ê µ ¾ ÑÓ Å; ulo multiplication and Å . We assume that the value of and are given to 3. for ½ Ê downto 0 do the circuit as inputs, Å is ﬁxed and Å ½ is computed ´ µ 4. begin 5. ¾ ÑÓ ¡ ¡ Ê Å; beforehand. This assumption makes sense if Montgomery 6. if ½ then ¡È ¡ ¾ Ê ÑÓ Å; modulo multiplication is used to compute the modulo expo- nentiation for RSA encryption and decryption. Recall that, using the circuit for Lemma 2, ¡ can be 7. end 8. ½ ¾ ÑÓ ¡ ¡ Ê Å; computed using ÅÍÄ´ ·¾ ·¾µ Ö Ö s and ´ µ Ö Ö s. The underlined formulas are computed by the Montgomery After computing ¡ , we need to compute such that the modulo multiplication. Let us conﬁrm that È ÑÓ least signiﬁcant Ö bits of´ · ¡ µ ¡ Å are zero. We Å È Ê ½ Ê ¾ ¡¡¡ ¼ ÑÓ Å holds. Clearly, È can compute ´´ µ ½¼ ´ µµ ½ ¼ ¡ Ö ¡ Å ½ Ö È Ê ¾ ÑÓ Å holds. Let us show that, at the end of the for- using a ÅÍÄ´ µ Ö Ö . Once is obtained, the product ¡ Å is loop for , È Ê ½ Ê ¾ ¡¡¡ Ê ¾ ÑÓ Å holds by in- computed using the circuit for Lemma 2. Finally, the sum duction on . Suppose that È Ê ½ Ê ¾ ¡¡¡ Ê Å ¾ ÑÓ ´ ¡ · µ ¡ Å is computed by the circuit for Lemma 1. holds at the end of the for-loop for . After ex- Note that both ¡ and ¡ Å are ´ · ½µ -digit redundant ecuting line 5 of the following for-loop, we have ¾ radix- Ö numbers. However, since the least signiﬁcant digit ´È Ê ½ Ê ¾ ¡¡¡ Ê ¾ ÑÓ µ´ Å ¡ È Ê ½ Ê ¾ ¡¡¡ Ê Å ¡ ¾ ÑÓ µ of ¡ and ¡ Å are zero, we can omit the addition of the Authorized licensed use limited to: Bharat University. Downloaded on September 1, 2009 at 10:32 from IEEE Xplore. Restrictions apply. least signiﬁcant digit. The readers should refer to Figure 1 for illustrating the circuit for Lemma 6. Multiplier ´ ¡ µÖ ½ ¼ Adder Multiplier ´ ¡ µ ´ ¡ µÖ ½ ¼ ¡ · ´ ¡ µ Adder ¡Å Multiplier Multiplier Figure 2. Circuit to compute ¡ ¡· ´ ¡ µ Å ¡ · ¡Å ´ Å ½µ using a memory Figure 1. Circuit to compute ´ ¡ · ¡ Åµ Lemma 7 [11] Montgomery modulo multiplication ¡ ´ using multipliers · µ ¾ ¡ Å ¡ Ö for -digit and Å , and 1-digit ¾ of redundant radix- Ö representation can be computed us- ing ÅÍÄ´ · ¾ · ¾µ Ö Ö s, ·¾ Ö Ö Ö s, ´¾ µ To compute the multiplication ¡ , we can use a cir- ´¾ ¾ µ ¾ Ö Ö , and a Ö -word Ö-bit memory, without cuit for Lemma 2 which uses ÅÍÄ´ · ¾ · ¾µ Ö Ö s and block carries, whenever Ö . ´ µ Ö Ö . To compute , we use a ÅÍÄ´ µ Ö Ö . After If Ö ½ and Ö ½¼¾ that to compute the multiplication ¡ Å , we also use a cir- , then the circuit for Lemma 7 needs 64Ã -word 1024-bit memory of size 64Å bits. Since cuit for Lemma 2 and the addition ¡ · ¡ Å can be computed using ´¾ ¾ µ Ö Ö by Lemma 1. the size of block memory of current FPGAs is up to few mega bits, this circuit cannot be implemented in FPGAs. Therefore, we have, Lemma 6 [6] Montgomery modulo multiplication ¡ ´ · º¿ ÅÓÒØ ÓÑ ÖÝ ÅÓ ÙÐÓ ÅÙÐØ ÔÐ Ø ÓÒ µ¾ ¡ Å ¡ Ö for -digit and 1-digit of redundant radix- Í× Ò Û Ö Å ÑÓÖÝ ¾Ö representation can be computed using Ö ¾ · ½ ÅÍÄ´ · ¾ ·¾µ ¾ Ö s, ´ Ö Ö s, and µ Ö Ö , without ´¾ ¾ µ We will reduce the size of memory to compute the func- block carries, whenever Ö . tion . Recall that, is a Ö-bit number such that the least signiﬁcant Ö bits of ¡ · ¡ Å are zero. Let Ö-bit num- º¾ ÅÓÒØ ÓÑ ÖÝ ÅÓ ÙÐÓ ÅÙÐØ ÔÐ Ø ÓÒ ¾ ber partition into two Ö bits such that Ö Ö ½ ¾ Í× Ò Å ÑÓÖÝ and Ö ¾ ½¼. We can compute the values of and separately as follows. Let Å ½ be the mini- ´ µ The circuit for Lemma 6 has a cascade of three multi- mum non-negative integer such that Å ½ ¡ Å ½ ´ µ pliers, which can be a long critical path. Also, it needs too ´ÑÓ ¾ µ Ö ¾ . Also, let ½ ¾ ´ ¡ Ö Ö and µ many multipliers. We remove multipliers for computing ´ µ ¾ ½¼ ¡ Ö . We set ´ µ ¡ Å ½ . to improve the circuit for Lemma 6. The key idea is to use Then, the least signiﬁcant Ö bits of ¡¾ · ¡ Å are zero. a memory to look up the value of ¡ Å . Let be a function such that ´ µ ´´ ¾ ½ ¼ µ Ö ¡Å Ö Let be a function such that ´ µ Ö ¡ ´ ½¼ ½ ¾ ´ Ö ¡ Å ½ µ· and if ¼ ´ µ ¾ ½¼ ¼ ¡Å Ö ´ µµ ½¼ Å ½ Ö ¡ Å . The function can be computed and ½ otherwise. Function can be computed using ¾ using a Ö word ´ · ½µ Ö-bit memory as follows. The value ¾ ¾ a combinational circuit with Ö input bits and Ö out put of ´µ ¼ ( ¾ ½ Ö ) is stored in address of the mem- bits. We set ´ · ´ µµ ¾ ½ ¼ Ö . Then, the ory in advance. Then, by reading address Ö of the ½¼ least signiﬁcant digit of ¡ · · ¾ ¡ Å ¡ Å ¡ Ö ¾ is zero. memory, we can obtain the value of ´ µ in one clock cy- We will implement this idea in the same way as cle. Using this memory, ´ µ ¡ can be computed in one Lemma 7. Instead of computing and , we compute ¡ Å clock cycle. After that, the addition ¡ ¡ can · ´ µ and ¡ Å using a memory. Let be a function such that be computed using ·½ ´¾ ¾ µ Ö Ö s from Lemma 1. ´ µ ´ Ö ¾ ½¼ ´ µµ ¾ ½ ¼ ¡ Å ½ Ö ¡ Å . Sim- Figure 2 illustrates the circuit to compute ¡ ¡ . · ´ µ ¾ ilarly to , function can be computed using Ö ¾ -word ´ ´ ·¾µ· Ö ¾µ Ö -bit memory. Then, ¡ Å ´ µ ¡ and Note that the least signiﬁcant digit of ¡ · ´ ¡ µ ¡Å ´ · ´ µµ · ¾ holds. Thus, ¡Å ¡Ö¾ is always zero. Hence, we can omit the computation of the ´ µ· ¡ ´ · ´ µµ ¾ ¡ Ö ¾ . The readers should least signiﬁcant digit of and the following addition. Thus, refer to Figure 3 for illustrating the circuit to compute ¾ we use a Ö word Ö-bit memory for computing ¡ ´ µ ¡· ¡Å ¡ · ´ µ· ´ · ´ µµ ¾ ¡ Ö ¾. and ´¾ ¾ µ Ö Ö s to compute the sum ¡ ¡ . ·´ µ Since a FPGAs has dual port memories, two modules to Therefore, we have, ¾ compute in Figure 3 can be computed by a single Ö ¾ - Authorized licensed use limited to: Bharat University. Downloaded on September 1, 2009 at 10:32 from IEEE Xplore. Restrictions apply. shown hardware algorithms for Montgomery modulo mul- tiplication for ¡ ´ · µ¾ ¡ Å ¡ Ö for -digit and Å , and Multiplier 1-digit are shown for Lemmas 6 and 8. Using the same ¡ ¡ technique, we can obtain hardware algorithms for -digit numbers. More speciﬁcally, from Lemma 6, we have, Theorem 9 Montgomery modulo multiplication ¡ ´ · Adder µ ¾ ¡ Å ¡ Ö for three -digit redundant radix- Ö num- ¾ bers , and Å can be computed in clock cycles using ¾ · ½ ÅÍÄ´ · ¾ · ¾µ ¾ Ö Ö s, Ö Ö s, ´ µ Adder ´¾ ¾ ¾ µ ´ ·½µ´ ·¾µ Ö Ö Ö , and a Ö -bit register, with- out block carries, whenever Ö . Figure 3. Circuit to compute ¡ · ´ ¡ µ· Further, from Lemma 8, we have, ´ ¡ · ´ ¡ µµ ¡ ¾Ö ¾ Theorem 10 Montgomery modulo multiplication ¡ ´ · word ´ Ö · Ö ¾µ-bit dual port memory in the same time. µ ¾ ¡ Å ¡ Ö for three -digit redundant radix- Ö num- ¾ The readers may think that a combinational circuit to com- bers , and Å can be computed in clock cycles us- pute is not necessary. However, block RAMs in most ing ÅÍÄ´ · ¾ · ¾µ · ¾ Ö Ö s, Ö Ö Ö s, ´¾ µ FPGAs to implement a memory support only synchronous ´¾ ¾ ¾ µ ¾ Ö Ö Ö , a Ö ¾-word Ö-bit dual port memory, read. Thus, one clock cycle is necessary to read a memory. ¾ and a combinational circuit with Ö -bit input and Ö -bit ¾ It follows that, if we use a memory to implement the com- output, and a ´ · ½µ´ · ¾µ Ö -bit register, without block car- putation of , two clock cycles are necessary to compute ries, whenever Ö . ´ · ´ .µµ The readers should refer to [6, 11] for the details. Let us evaluate the hard aware resources necessary to compute ¡ ·´¡ ¡µ· ´ ·´ ¡ ¡ Ö ¾. The µµ ¾ multiplication ¡ can be computed using Ö ÅÍÄ´ · 6 Modulo Exponentiation on the FPGA ¾ · ¾µ Ö s and ´ µ Ö Ö s from Lemma 2. Function ´ µ ¡ can be computed using a combinational circuits The hardware algorithms for the Montgomery modulo with 8 input bits and 8 output bits and addition ¡ · multiplication shown in Section 5 compute ¡ ¡ ´ · ´ µ ¡ can be computed ´ ¾ ¾µ Ö Ö . After that the µ¾ Å ¡ Ö . Recall that ¡ ´ · ¡ Å ¡ Ö can be largerµ¾ value of function for two arguments can be computed us- than Å . Thus, we need to subtract Å if it is no less than Å ¾ ing a Ö ¾ -word Ö-bit dual-port memory. Finally, the sum to obtain ¡ ÑÓ Å . In other words, we need to check ´ µ· ´ ¡ ¡ µ· ´ ¡ · ´ ¡ ¡ Ö ¾ can be µµ ¾ if ´ ¡ · µ¾ ¡ Å ¡ Ö is less than Å . Unfortunately, the computed using ´¾ ¾ ¾ µ Ö Ö Ö by straightforward ¾ comparison of two redundant radix- Ö numbers is not obvi- generalization of Lemma 1. Consequently, we have, ous. To perform the comparison, we need to convert them Lemma 8 [11] Montgomery modulo multiplication ¡ ´ into the non-redundant numbers and the conversion is very ¡ ´ ¡Å ¡ Ö · µ¾ · µ ¾ ¡ Å ¡ Ö for -digit and Å , and and 1- costly. Therefore, we do not check if is less than Å . Alternatively, we check if the redundant ¾ digit of redundant radix- Ö representation can be com- puted using ÅÍÄ´ · ¾ · ¾µ Ö Ö ´ s, Ö Ö s, one µ bits of the most signiﬁcant digit are not zero. If this is the case, we add Å to ¡ ´ · ¡ Å ¡ Ö . Since theµ ¾ ´ ¾ ¾µ Ö Ö , ´¾ ¾ ¾ µ ¾ Ö Ö Ö , a Ö ¾-word Ö-bit dual-port memory, and a combinational circuit with Ö -bit ¾ redundant bits of the most signiﬁcant digit are either 00 or ¾ input and Ö -bit output, without block carries, whenever 01, we can guarantee that, after the addition, they are 00. Note that, Å may not be added if ¡ ´ ¡Å ¡ Ö · µ¾ Ö . is no less than than Å . However, since we can guarantee that the redundant bits of the most signiﬁcant digit are 00, º ÅÓÒØ ÓÑ ÖÝ ÅÓ ÙÐÓ ÅÙÐØ ÔÐ Ø ÓÒ we can avoid the overﬂow. Thus, the Montgomery modulo ÓÖ ÌÛÓ ¹ Ø ÆÙÑ Ö× multiplication hardware algorithms for Theorems 10 and 9 can be modiﬁed to obtain the resulting value with the re- Recall that we have shown a hardware algorithm for the dundant bits of the most signiﬁcant digit being 00 in ·½ ¾ product of -digit and 1-digit radix- Ö numbers is shown clock cycles. for Lemma 2. Using this hardware algorithm iteratively, the Suppose that these modiﬁed circuits for Montgomery ¾ product of two -digit radix- Ö numbers can be computed modulo multiplication are used to compute the modulo ex- as shown in Lemma 4. In the previous subsection, we have ponentiation algorithm based on the algorithm in Section 4. Authorized licensed use limited to: Bharat University. Downloaded on September 1, 2009 at 10:32 from IEEE Xplore. Restrictions apply. At the end of the algorithm the value is stored as a -digit in 18.46 MHz for non-redundant numbers. Hence, the total ¾ redundant radix- Ö number. We ﬁrst convert it to the -digit computing time for redundant numbers is 2.521 ms, while ¾ non-redundant radix- Ö number. For this purpose, we add that for non-redundant numbers is 7.218 ms. Thus, we have zero times. After that, all that redundant bits. Note that achieved the speedup factor of 2.86 using redundant number the resulting value can be no less than Å . Hence we check systems. if it is no less than Å . If this is the case, we add Å and perform the iterations of addition zero again to convert it 8 Conclusion to non-redundant number. If Å ¾ Ö ½ , we can guaran- tee that the value thus obtained is less than Å . In this way, we can obtain È ÑÓ Å in non-redundant number We have presented hardware algorithms for the mod- ulo exponentiation used in RSA encryption and decryption. system. Let us evaluate the clock cycles to obtain È ÑÓ The best algorithm runs in 2.67ms and in expected 1.99ms to compute È ÑÓ Å for 1024-bit numbers È , , and Å Å using Theorems 9 or 10. The Montgomery modulo multiplication takes ·½ clock cycles, and it is executed on Xilinx Virtex II Pro Family FPGA XCVP30-6. It also ½ ¾ ·½ at most Ö¾ ·¾ times from Lemma 5. After that, it is runs in 0.027ms if . Our hardware algorithms run faster than previously presented algorithms. converted to non-redundant number in clock cycles. If the resulting value is no less than Å , Å is added and ·½ then zero is added times to it in totally clock cy- References cles. Thus, the modulo exponentiation can be computed in ´¾ · ¾µ´ · ½µ · ¾ · ½ ´¾ · µ´ · ½µ Ö Ê Ê Ö clock [1] D. N. Amanor, C. Paar, J. Pelzl, V. Bunimov, and ´½ · µ´ · ½µ cycles and in expected Ê Ê Ö clock cycles. M. Schimmler. Efﬁcient hardware architectures for modular multiplication on FPGAs. In Proc. of International Confer- Theorem 11 The modulo exponentiation È ÑÓ ence on Field Programmable Logic and Applications, pages Å for Ê-bit numbers can be computed using hardware al- 539–542, 2005. gorithms for Theorems 9 or 10 in less than ´¾Ê · µ´Ê Ö · [2] T. Blum and C. Paar. High-radix montgomery modular ex- ½µ clock cycles. ponentiation on reconﬁgurable hardware. IEEE Trans. on Computers, 50(7):759–764, 2001. [3] T. Blum and C. Paar. High-radix montgomery modular ex- If we use non-redundant numbers, the conversion from ponentiation on reconﬁgurable hardware. IEEE Transac- the redundant numbers is not necessary. If this is the case, the modulo exponentiation can be computed in Ê ´¾ · tions on Computers, 50(7):759–764, 2001. ¾µ´ ·½µ ´½ ·¾µ´ · [4] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduc- Ê Ö clock cycles and in expected Ê Ê Ö tion to Algorithms. MIT Press, 1990. ½µ clock cycles. [5] R. Garg and R. Vig. An efﬁcient montgomery multiplica- tion algorithm and RSA cryptographic processor. In Proc. of International Conference on Computational Intelligence 7 Experimental Results and Multimedia Applications, pages 188–195, 2007. [6] K. Kawakami, K. Shigemoto, and K. Nakano. Redundant We have implemented our hardware algorithms for the radix-¾Ö number system for accelerating arithmetic opera- modulo exponentiation on Virtex II Pro Family FPGA tions on the FPGAs. In Proc. of International Conference on XC2VP30-6, which has 13,696 slices, 136 ¢ -bit mul- ½ ½ Parallel and Distributed Computing, Applications and Tech- tipliers, and 136 18k-bit dual-port block RAMs. We have nologies (PDCAT), pages 370–377, 2008. used XST in ISE Foundation 10.1i for logic synthesis and [7] A. Mazzeo, L. Romano, G. P. Saggese, and N. Mazzocca. analysis. Since this FPGA has 18-bit multipliers as building FPGA-based implementation of a serial RSA processor. In blocks, it makes sense to let Ö ½ . Thus, we use redun- Proc. of Design, Automation and Test in Europe Conference ¾ dant radix-64K (i.e. radix- ½ ) number system. and Exhibition, 2003. [8] P. L. Montgomery. Modular multiplication without trial divi- Table 1 shows the performance of the experimental re- sion. Mathematics of Computation, 44(170):519–521, 1985. sults of the modulo exponentiation shown in Theorem 9 [9] B. Parhami. Computer Arithmetic - Algorithm and Hard- and Theorem 11, which use ½ ½ ¢ -bit multipliers. Ta- ware Designs. Oxford University Press, 2000. ble 2 shows that for Theorem 10 and Theorem 11, which use [10] R. L. Rivest, A. Shamir, and L. Adleman. A method for ½ ½¢ multipliers and 18k-bit block RAMs. In both table, obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21:120 – 126, 1978. the performance are evaluated for -digit redundant radix- [11] K. Shigemoto, K. Kawakami, and K. Nakano. Accelerat- 64K numbers and non-redundant numbers. Clearly, the ing montgomery modulo multiplication for redundant radix- clock frequency for redundant numbers are ﬁxed, while it 64k number system on the FPGA using dual-port block decreases as the number of bits increases for non-redundant RAMs. In Proc. of International Conference On Embedded numbers. For example, in Table 2, the 1024-bit modulo ex- and Ubiquitous Computing(EUC), pages 44–51, 2008. ponentiation runs in 52.9MHz for redundant numbers and Authorized licensed use limited to: Bharat University. Downloaded on September 1, 2009 at 10:32 from IEEE Xplore. Restrictions apply. Table 1. Modulo exponentiation using 18-bit multipliers (Theorems 9 and 11) bits 64 128 256 512 1024 redundant clock(MHz) 40.80 41.50 40.38 40.30 40.21 clock cycles (worst) 660 2340 8772 33924 133380 time (ms) 0.016 0.056 0.217 0.841 3.317 clock cycles (expected) 500 1764 6596 25476 100100 time (ms) 0.012 0.042 0.161 0.632 2.489 slices 865 1708 2905 5811 13467 multipliers 9 17 33 65 129 non-redundant clock(MHz) 47.43 41.75 33.61 25.12 16.38 clock cycles (worst) 650 2322 8738 33858 133250 time (ms) 0.013 0.055 0.259 1.347 8.134 clock cycles (expected) 480 1746 6562 25344 99970 time (ms) 0.010 0.041 0.195 1.008 6.103 slices 530 974 1971 3909 7549 multipliers 9 17 31 63 123 Table 2. Modulo exponentiation using 18-bit multipliers and 18k-bit block RAMs(Theorems 10 and 11) bits 64 128 256 512 1024 redundant clock(MHz) 53.20 52.51 52.77 52.57 52.90 clock cycles (worst) 660 2340 8772 33924 133380 time (ms) 0.012 0.044 0.166 0.645 2.521 clock cycles (expected) 500 1764 6596 25476 100100 time (ms) 0.009 0.033 0.124 0.484 1.892 slices 896 1652 3204 5868 11589 multipliers 4 8 16 32 64 block RAMs 2 4 8 15 29 non-redundant clock(MHz) 68.06 58.83 44.47 30.40 18.46 clock cycles (worst) 650 2322 8738 33858 133250 time (ms) 0.09 0.039 0.196 1.113 7.218 clock cycles (expected) 480 1746 6562 25344 99970 time (ms) 0.07 0.029 0.147 0.833 5.415 slices 613 1086 2034 3911 7708 multipliers 4 8 15 31 61 block RAMs 2 4 8 15 29 Authorized licensed use limited to: Bharat University. Downloaded on September 1, 2009 at 10:32 from IEEE Xplore. Restrictions apply.

DOCUMENT INFO

Shared By:

Stats:

views: | 117 |

posted: | 3/5/2010 |

language: | English |

pages: | 8 |

Description:
RSA Encryption and Decryption using the Redundant Number System is a security protection algorithm that many banks are using now for the data encryption and decryption.

OTHER DOCS BY rajeshpolineni

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.