# INTRODUCTION

ntegrating security protocols in embedded systems is not an easy proposition. Embedded system designers have so far failed to install required levels of security in embedded systems, due to unusual design constraints like storage limitation, restricted processor performance and easy power drain in embedded devices. Regular cryptography algorithms are not suitable for Embedded systems due to wide footprint.

Due to resource constrains in the design and implementation of secure protocols, system designers are well advised to use area efficient versions of existing, proven security protocols, rather than developing their own protocols or implementations. This call for refined, area and space optimized; easy to deploy versions of original cryptography algorithms tailor made for resource constrained Embedded system.

Of the available choices, AES is the most powerful, most secured encryption algorithm with a key size ranging from 128 to 256. RSA is another well established and most preferred public key cryptography algorithm. To provide security equivalent to AES, RSA public-key sizes would have to range between 3,072 and 15,000 bits long, too big for embedded implementation. One appealing solution to the key size problem is the promising family of asymmetric algorithms known as Elliptic Curve Cryptography, or ECC.

Victor Miller and Neal Koblitz proposed the concept of elliptic curve cryptography in the mid of 1980's as an advancement in public key cryptographic systems such as DSA and RSA. The main advantage of ECC is the usage of shorter key helping compact implementations, resulting in faster cryptographic operations, running on smaller chips or more compact software. For hardware-based implementations of security functions, the benefits of ECC are more in comparison to RSA and AES. Optimized ECC chip designs have been designed and are as much as 37 times faster than its software counterparts. ECC offers other advantages of small software footprint, low hardware implementation costs, low bandwidth requirements, high device performance. Due to these many advantages of ECC a number of hardware implementations have been proposed, and included in many standards such as IEEE 1363and NIST.

Modular multiplication is the most primitive and critical operation in ECC. The design of Finite field multipliers is the complex design issue in the implementation of the ECC processor. A number of multipliers with different area and time complexity are reported in the available literatures. The Karatsuba Ofman algorithm is agreed upon as a most efficient multiplication algorithm and is widely adopted in VLSI implementation. Here input operands are processed as the "most significant half" and the "least significant half".

In proposed multiplier, instead of splitting input operands into the "most significant half" and the "least significant half", our method split operands according to the parity of multiplicands's exponent. Both the space and time complexities of the resulting multiplier are found to be much better than that of traditional multiplier. Here we have concentrated and investigated on area optimization. This is a significant achievement if we intend to use this multiplier in FPGA implementation of elliptical curve cryptography for embedded systems.

The basic step of Karatsuba Ofman's algorithm is a formula that computes the product of two large numbers using three multiplications of smaller numbers, each with about half as many digits as operands, plus some additions and digit shifts. Karatsuba Ofman method of multiplication is a faster way of multiplying two integers of length n. For the first step of the algorithm, it initially requires the breaking of multi digit integers into parts. These parts can then be used in three multiplications to produce a solution.

The Karatsuba algorithm is an effective multiplication algorithm. It diminishes the multiplication of two n -digit numbers to at most single-digit multiplications in general (and exactly when n is a power of 2). It is along these lines quicker than the traditional algorithms, which requires n 2 single-digit products. If n = 2 10 = 1024, in particular, the precise counts are 3 10 = 59,049 and (2 10 ) 2 = 1,048,576, respectively. Here the operands ? and ? can be divided into two equalsize parts L ? and H ? , L ? and H ? respectively, which represent the 2 / l higher and lower order bits of ? and ? . We can split them in two parts as follows:
i l i i i l l i i i l i i x a x a x a x ? ? ? ? = ? = ? = + = = 1 2 0 1 2 1 0 ) ( ? i l i i i l i l i l x a x a ? ? ? = ? = + + = 1 2 0 1 2 0 2 2 ? L H l x x ? ? ? + = 2 ) ( Likewise, L H l x x ? ? ? + = 2 ) ( The product * ) (x ? can be computed as ) ( ). ( )* ( x x x ? ? ? = By using above equation output can be represented as, 2 ) ( )* ( l H L L H l H H L L x x x ? ? ? ? ? ? ? ? ? + + + = To further improve the computation of the product * ) (x ? equation can be modified as, 2 ) )( ( ( )* ( l L H L H H H L L l H H L L x x x ? ? ? ? ? ? ? ? ? ? ? ? ? + + + + + + = II.

# Related Work

Several modular multiplication algorithms have been proposed. Of them all, Karatsuba-Ofman algorithm, KOA is widely used for performing modular arithmatic. In [1] a variant of Karatsuba multiplier of the type GF ((2 n ) 8 is presented and is highly area efficient. Kimmo U. Järvinen et.al [2] adopted an efficient implementation of point multiplication on Koblitz curves and was designed for extremely-constrained, secure applications. In that approach a new technique was introduced for point addition which required fewer registers and small processor. In [3] Hossein Mahdizadeh and Massoud Masoumi built elliptic curve cryptographic processor by organizing multipliers in parallel.

In [4], hybrid multiplier was proposed that intelligently switches between two variants of multiplier depending on the size of multiplicands. The Karatsuba multiplier is efficient algorithm ensuring fewer LUTs and stable number of Flip-flops for the smaller bit multiplications, while the systolic variant ensures fewer LUTs count for the bigger size multiplicands. This hybrid multiplier does the initial recursion using the systolic algorithm while final small sized multiplications are accomplished using the Karatsuba algorithm. Area analysis report suggested that by using a hybrid multiplier instead of just traditional Karatsuba Multiplier, eventually helps in reducing FPGA footprint. This hybrid multiplier exhibits a savior of 7.56 % in terms of Flip flop slices. Involves 52 % fewer LUTs and utilizes 47% fewer slices as compared to traditional Karatsuba multiplier for 256 bit multiplication.

In [5], FPGA based area efficient ECC processor was built using Digit multiplier. Instead of processing vector (polynomial) bit by bit or parallelly, operands are process in 16-bit word format. This Modular multiplier exhibits a savior of 23.88% in terms of Flip flop slices. Proposed Multiplier involves 62 % fewer LUTs and utilizes 59% fewer slices as compared to traditional Karatsuba multiplier for 256 bit multiplication. Further reduction was achieved in ECC processor by employing efficient double point multiplication algorithm.

Area and speed efficient 163 bit Scalar multiplier with improved area and speed was designed by Sujoy Sinha Roy et.al [6]. Scalar multiplication was performed on Xilinx Virtex V platforms over GF (2 163 ). The implementation used a novel three stage pipelined bitparallel Karatsuba multiplier with subquadratic complexity. Scalar multiplication algorithm, optimized field primitives, balanced pipeline stages, and enhanced scheduling of point arithmetic all contributed to a highspeed architecture with a significantly small area Hybrid binary-ternary number system for elliptic curve cryptosystems was designed by Jithra Adikari et.al [7].

In their newly designed system three novel algorithms for both single and double scalar multiplication were implemented. The first algorithm is w-HBTF and the other two algorithms, namely, HBTJF and RHBTJF. The output results showed that hybrid algorithms are almost always faster than classical w-NAF methods or JSF. Kazuo Sakiyama et.al [8] implemented modular multiplication algorithm that integrates three different 


# PROPOSED AREA OPTIMIZED LOW LATENCY MULTIPLIER

Proposed multiplier is based on Karatsuba Ofman Algorithm. Two operands A and B in KOA can be represented in polynomial format as
A=? a i X i ???1 ??=0 =?? ?? ? ?? ?? +?? ?? ?1 ??=0 ?? ?? + ? ?? ?? ?? ?? ?? ?1 ??=0 = X m A H +A L Similarly, B=? b i X i ???1 ??=0 =?? ?? ? ?? ?? +?? ?? ?1 ??=0 ?? ?? + ? ?? ?? ?? ?? ?? ?1 ??=0 = X m B H +B L Where A H =? ?? ?? +?? ?? ?1 ??=0 ?? ?? and A L =? ?? ?? ?? ?? ?? ?1 ??=0 B H =? ?? ?? +?? ?? ?1 ??=0 ?? ?? and B L =? ?? ?? ?? ?? ?? ?1 ??=0
The product of A and B can be written as
AB=A H B H X 2m +{[(A H +A L )(B H +B L )] -[A H B H +A L B L ]}Xm+A L B L
In Modulo 2 domain, addition and subtraction operation can be accomplished using XOR Gate and product using AND gate. Total gates involved for implementing above expression will be five AND gates for accomplishing five multiplication activities and four XOR gates for performing modulo 2 addition. Further, it is a three level realization, as input operand has to pass through maximum of three levels of XOR gate before reaching output line because of overlapping. Therefore, total XOR gate delay for implementing the above expression will be 3T D besides the cost of the recursive computation of the three partial products. Thus basic KOA multiplier utilizes five AND gates and four XOR gates to accomplish the basic multiplication activity.

Instead of splitting input operands into the "most significant half" and the "least significant half", the method split operands according to the parity of X's exponent.

Accordingly, Operands A and B may be rewritten as
A=? a i X i ???1 ??=0 =? ?? 2?? ?? ?1 ??=0 ?? 2?? + ? ?? 2??+1 ?? 2??+1 ?? ?1 ??=0 =? ?? 2?? ?? ?1 ??=0 ?? 2?? + ?? ? ?? 2??+1 ?? 2?? ?? ?1 ??=0 Similarly, B=? b i X i ???1 ??=0 =? ?? 2?? ?? ?1 ??=0 ?? 2?? + ? ?? 2??+1 ?? 2??+1 ?? ?1 ??=0 =? ?? 2?? ?? ?1 ??=0 ?? 2?? + ?? ? ?? 2??+1 ?? 2?? ?? ?1 ??=0
Now let us define
A E = ? ?? 2?? ?? ?1 ??=0 ?? 2?? A O =? ?? 2??+1 ?? 2?? ?? ?1 ??=0 B E =? ?? 2?? ?? ?1 ??=0 ?? 2?? B O =? ?? 2??+1 ?? 2?? ?? ?1 ??=0
Let i=2 and X 2 =Y, Then product AB can be written as
AB= (A E (y) + XA O (y)) (B E (y) + XB O (y)) ={A E (y)B E (y)+X 2 A O (y)B O (y)}+ X{A E (y)B O (y) + A O (y)B E (y))}
Applying KOA formula to above expression,


# = {[A E (y)B E (y) + YA O (y)B O (y)]} + X{[(A E (y) + A O (y))(B E (y) + B O (y))]-[A E (y)B E (y) + A O (y)B O (y)]}

In VLSI implementation of above expression multiplying a polynomial by x or y = x 2 is equivalent to shifting its coefficients left, and no gate is required. For implementing this revised expression four AND gates and three XOR gates are required. The expressions in the three square brackets can be computed concurrently, and these addition operations require one XOR gate delay 1T D . We conclude that computing product AB needs only a total delay of 2T D besides the cost of the recursive computation of the three partial products.

Compared to the 3T D gate delays required in traditional formula, one XOR gate delay 1T D is saved for each recursive iteration. Thus compared to the four EXOR gates delays required in traditional formula, one XOR gate and also one AND gate is saved for each recursive iteration, which is a significant achievement considering embedded systems are highly resource constrained. . Thus this variant of KOA multiplier utilizes four AND gates and three XOR gates to accomplish the basic multiplication activity. From the above basic concept, it is observed that by splitting the operand polynomials, based on the parity, results in reduction of hardware (fewer AND and XOR gates) and latency time (fewer levels due to little overlapping). By extending this very concept of splitting the operands based on parity to FPGA implementation, we have explored and implemented FPGA based finite field multiplier and compared FPGA footprint with that of traditional KOA multiplier. Also we have incorporated the proposed multiplier in ECC processor and investigated its impact on the footprint of entire processor. In this paper, we have only concentrated on area occupancy of proposed multiplier, its impact on processor footprint but haven't investigated quantum of latency improvement, which forms the future scope of investigation.

IV.


# RESULTS AND DISCUSSION

In this section, we focus on the FPGA implementation of the proposed multiplier. The proposed architecture is coded in verilog HDL and is synthesized using Xilinx ISE version 14.4 design software and is implemented on Xilinx Virtex-4 xc4vlx200ff1513 FPGA. The RTL schematic for the proposed Finite Field multiplier is shown in figure 1.   Since the area of the complete processor mainly depends on the incorporated GF multiplier, most compared to traditional Karatsuba Ofman multiplier for 16 bit multiplication. This helps in bringing down the footprint of modular multiplier on FPGA while building cryptography schemes. 


# b) Area Report of Crypto Processor

The ECC processor implementation uses a double point multiplication algorithm proposed in [9]. We have adapted revised and area optimized version of Montgomery's PRAC algorithm [10]. The area comparision is carried out for crypto processors using proposed multiplier the one proposed in [11].

Table 4 below shows device utilization summary of Crypto processor, when the above proposed multiplier is employed for performing modular multiplications. Processor needs 376 slices out of 89,088 available slices in the target device. Among the 178,176 available four input LUTs only 573 are used. The multiplier also needs only 451 out of 178,176 Flip Flops.


# Table 4 : Resource Utilization by Ecc Processor

We have investigated area occupancy of cryptoprocessr built using proposed variant of Karatsuba similar ECC work. Bar chart representation shown in figure 3 compares the resources utilized by proposed ECC with a similar work [11]. The comparison of implemented ECC processor employing proposed modular multiplier and the crypto Processor using traditional Karatsuba Multiplier with respect to the area occupied (Slice registers, Slices, LUTs and IOBs) is tabulated in Table 5 and compared in figure 3. Proposed implementation utilize about 23.93 % reduced slices and 62.65 % fewer LUTs. This is a significant achievement as we intend to put this ECC hardware for providing security services in embedded system which is known to be highly resources thirsty.  V.


# CONCLUSION

We have proposed a novel method to implement the modular multiplier for performing critical multiplication activities in ECC processor. Area occupancy of the resulting multiplier is found to be far superior than that of traditional KOA multiplier. This contributes in bringing down the footprint of entire ECC process. While injecting this modification, nowhere we have tampered with the basic flow and structure of basic ECC protocol. Just by processing the operands in different style, we have achieved area optimization of complete ECC protocol and made it suitable for embedded system with no compromise on security of ECC algorithm. 


# REFERENCES RÉFÉRENCES REFERENCIAS
![XV Issue III Version I existing algorithms, based on Barrett reduction and Montgomery reduction. This modular multiplier is highly speed efficient. III.](image-2.png "Je")
1![Fig. 1 : RTL Diagram of Proposed multiplier](image-3.png "Fig. 1 :")
2![Fig. 2 : Resource utilization comparison of two KOA variants for L=16 A bar chart representation shown in figure 2 compares the resource utilization of Proposed multiplier with traditional Karatsuba Ofman Multiplier for L =16. Thus proposed multiplier exhibits a savior of 20 % in terms of Flip flop slices. Proposed Multiplier involves 53 % fewer LUTs and utilizes51.72% fewer slices as](image-4.png "Fig. 2 :")
3![Fig.3: Resource utilization comparison of ECC processor using proposed multiplier and[11] ](image-5.png "Fig. 3 :")
![1. C. Paar A new architecture for a parallel finite field multiplier with low complexity based on composite fields. IEEE Transactions on Computers, 45(7): 856-861, July 1996.](image-6.png "")
1Karatsuba Ofman MultiplierProposed New Version MultiplierMultiplicandFlip flopsLUTsOccupiedIOBsFlip flopsLUTsOccupiedIOBsSizeSlicesSlices2 bit172516915161394 bit32754317213119138 bit4615079332957322116 bit5416287654578423732 bit10231516812981142786964 bit266665411257145270143133128 bit64617821000513273526272261256 bit11181376170089135301038529517
2
3
5
			© 2015 Global Journals Inc. (US) Global Journal of Researches in Engineering ( ) Volume XV Issue
		
		
* 
	
		Efficient algorithm and architecture for elliptic curve cryptography for extremely constrained secure applications
		
			RAzarderakhsh
		
		
			KJarvinen
		
		
			Mozaffari-Kermani
		
	
		IEEE Transactions on circuits and systems-I
		
			61
			4
			April 2014
		
	
* 
	
		Novel architecture for efficient FPGA implementation of elliptic curve cryptographic processor over GF(2 163 )
		
			HosseinMahdizadeh
		
		
			MassoudMasoumi
		
	
		IEEE Transactions on very large scale integration (vlsi) systems
				
			Dec.2013
			21
			
		
* 
	
		Low footprint Hybrid Finite field multiplier for Embedded cryptography
		
			Sunil Devidas Bobade
		
		
			Dr
		
		
			RVijay
		
		
			Mankar
		
	
		International Journal of Computer Science and Information Security(IJCSIS)
		
			13
			3
			
			March 2015
		
	
* 
	
		Space optimized Multiplier Architecture for Embedded cryptography
		
			Sunil Devidas Bobade
		
		
			Dr
		
		
			RVijay
		
		
			Mankar
		
	
		International Journal of Computer and Applications (IJCA)
		
			113
			14
			
			March 2015
		
	
* 
	
		Theoretical modeling of elliptic curve scalar multiplier on LUT-based FPGAs for area and speed
		
			.S SRoy
		
		
			Rebeiro
		
		
			DMukhopadhyay
		
	
		IEEE Transactions on Very Large Scale Integration (VLSI) systems
				
			May 2013
			21
		
	
* 
	
		Efficient algorithm and architecture for elliptic curve cryptography for extremely constrained secure applications
		
			RAzarderakhsh
		
		
			KJarvinen
		
		
			Mozaffari-Kermani
		
	
		IEEE Transactions on circuits and systems-I
		
			61
			4
			April 2014
		
	
* 
	
		Tripartite modular multiplication
		
			KazuoSakiyama
		
		
			MiroslavKnezevica
		
		
			JunfengFana
		
		
			BartPreneela
		
		
			IngridVerbauwhedea
		
	
		Integration, the VLSI Journal
		
			44
			4
			
			September 2011
		
	
* 
	
		Theoretical modeling of elliptic curve scalar multiplier on LUT-based FPGAs for area and speed
		
			.S SRoy
		
		
			Rebeiro
		
		
			Mukhopadhyay
		
	
		IEEE Transactions on Very Large Scale Integration (VLSI) systems
				
			May 2013
			21
		
	
* 
	
		A New Double Point Multiplication Algorithm and its Application to Binary Elliptic Curves with Endomorphisms
		
			RAzarderakhsh
		
		
			KKarabina
		
	
		IEEE Transactions on Computers
		
			2013
		
	
	to appear: pp


* 
	
		Reconfigurable Architecture for Elliptic Curve Cryptography Using FPGA
		
			ARahuman
		
		
			GAthisha
		
		
			2013
			Hindawi Publishing Corporation Mathematical Problems in Engineering