Prime field elliptic curve cryptography coprocessor
By designing a prime field elliptic curve cryptography coprocessor, and employing NAF encoding and a pipelined structure, the problems of complex structure and high power consumption in existing technologies are solved, achieving high-performance computing speed and frequency, which is suitable for the encryption and decryption needs of the 5G era.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA NORMAL UNIV
- Filing Date
- 2022-08-01
- Publication Date
- 2026-06-23
AI Technical Summary
Existing elliptic curve cryptography coprocessors suffer from complex structures, high power consumption, and low computational performance, making it difficult to meet the data encryption requirements of the 5G era.
Design a prime field elliptic curve cryptography coprocessor, including a register module, a NAF encoding module, an arithmetic module, and a controller module. The NAF encoding dot product algorithm is adopted. The arithmetic module adopts a three-stage pipeline structure consisting of a modular multiplier unit, a modular squarer unit, a first modular adder unit, and a second modular adder unit to perform modular multiplication and modular squaring operations in parallel. The controller module controls the data flow through a pipelined manner.
It achieves high-performance computing speed, reaching a maximum operating frequency of 390MHz, with a good performance trade-off between area and time, making it suitable for high-speed encryption and decryption applications.
Smart Images

Figure CN115421791B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of information security technology, and in particular to an elliptic curve cryptography coprocessor for prime number fields. Background Technology
[0002] Mobile communication is evolving from traditional human-to-human connections to connections between humans and things, and between things themselves; the Internet of Things (IoT) is an inevitable trend. With the deepening development of the IoT, the number of IoT smart terminal devices is exploding, bringing more complex security challenges such as device authentication, data protection, and wireless communication. Therefore, designing a secure cryptographic coprocessor for IoT systems to ensure the security and privacy of data transmission is crucial. Furthermore, with the large-scale deployment and application of 5G devices, IoT applications demand higher speeds for data encryption; the performance of existing secure cryptographic coprocessors can no longer meet the needs of the 5G era, urgently requiring a high-performance, high-efficiency secure cryptographic coprocessor.
[0003] Elliptic Curve Cryptography (ECC) is an effective solution for achieving this goal. Since its release by Koblit and Miller in 1985, ECC has rapidly become the most popular next-generation public-key cryptosystem due to its advantages such as short keys, low latency, high security, and fast processing speed. It has been accepted and standardized by international standards organizations such as ANSI, IEEE, NIST, and SCA, and is widely used in the fields of IoT and information security. ECC can be implemented on both software and hardware platforms. ECC schemes implemented on FPGA hardware feature reprogrammability, configurability, and high-efficiency cryptographic processing performance, making them very suitable for IoT security applications.
[0004] However, existing elliptic curve cryptography coprocessors still suffer from drawbacks such as complex structure and high power consumption. Furthermore, due to the complexity of ECC cryptography theory and the large amount of computation, existing technologies struggle to meet the computational performance requirements. In current technologies, using low-bit-width multipliers leads to lower arithmetic unit performance; using a large number of multipliers results in high power consumption; and using high-bit-width multipliers, due to hardware limitations, forces operations to run at lower frequencies, resulting in lower computational performance. Summary of the Invention
[0005] This invention aims to at least solve one of the technical problems existing in the prior art. To this end, this invention proposes an elliptic curve cryptography coprocessor for prime number fields.
[0006] The technical solution adopted in this invention is:
[0007] This invention includes a prime field elliptic curve cryptography coprocessor, comprising a register module, a NAF encoding module, an arithmetic module, and a controller module;
[0008] The register module is used to receive the input raw data and store the coordinate data and intermediate data generated during the dot product operation;
[0009] The NAF encoding module is used to execute the NAF encoding dot product algorithm to reduce the number of dot addition and dot multiplication operations in the dot product operation process;
[0010] The computation module includes a modular multiplier unit, a modular squaring unit, a first modular adder unit, and a second modular adder unit. The modular multiplier unit is connected to the first modular adder unit and the second modular adder unit to form a three-stage pipeline structure to perform modular multiplication operations. The modular squaring unit is connected to the first modular adder unit and the second modular adder unit to form a three-stage pipeline structure to perform modular squaring operations. The modular multiplier unit and the modular squaring unit perform operations in parallel.
[0011] The controller module is used to control the data flow of the dot product operation in a sequential manner.
[0012] Furthermore, the register module includes a random number register and a coordinate register, the random number register being connected to the NAF encoding module;
[0013] The random number register is used to store random number data;
[0014] The coordinate register is used to store coordinate data.
[0015] Furthermore, the modular multiplier unit includes a multiplication operation component and a modular reduction and subtraction operation component. The output of the multiplication operation component is the input of the modular reduction and subtraction operation component, and the output of the modular reduction and subtraction operation component flows into the first modular adder unit and the second modular adder unit.
[0016] Furthermore, the area of the module squarer unit is 0.66 times the area of the module multiplier unit.
[0017] Furthermore, the modulus squarer unit includes a square calculation component and a modulus reduction calculation component. The output of the square calculation component is the input of the modulus reduction calculation component, and the output of the modulus reduction calculation component flows into the first modulus adder unit and the second modulus adder unit.
[0018] Furthermore, the first modulus adder unit includes a first adder, a second adder, multiple shift registers, and multiple data selectors. The first adder and the second adder are connected in series. The multiple data selectors are connected in stages and connected to the first adder to control the data input to the first adder. The multiple shift registers and the multiple data selectors are connected and connected to the input terminal of the second adder. The output terminal of the second adder is connected to one of the data selectors and then to two of the shift registers.
[0019] Furthermore, the output of the first converter unit is designed with four sets of interfaces to output corresponding data.
[0020] Furthermore, the structure of the second molder unit is the same as that of the first molder unit.
[0021] The beneficial effects of this invention are:
[0022] This invention provides a prime-field elliptic curve cryptography coprocessor, comprising a register module, a NAF encoding module, an arithmetic module, and a controller module. The arithmetic module includes a modular multiplier unit, a modular squaring unit, a first modular adder unit, and a second modular adder unit. The modular multiplier unit is connected to the first and second modular adder units to form a three-stage pipelined structure for performing modular multiplication operations. The modular squaring unit is also connected to the first and second modular adder units to form a three-stage pipelined structure for performing modular squaring operations. The modular multiplier unit and the modular squaring unit perform operations in parallel, which can accelerate the computation speed, achieve higher performance, and simultaneously achieve an optimal trade-off between area and performance, making it very suitable for high-speed encryption and decryption applications.
[0023] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description or may be learned by practice of the invention. Attached Figure Description
[0024] The above and / or additional aspects and advantages of the present invention will become apparent and readily understood from the description of the embodiments taken in conjunction with the following drawings, in which:
[0025] Figure 1 This is a framework diagram of the prime field elliptic curve cryptography coprocessor described in an embodiment of the present invention;
[0026] Figure 2 This is a schematic diagram of the NAF-encoded dot product algorithm described in an embodiment of the present invention;
[0027] Figure 3 This is a schematic diagram of the ordinary Montgomery algorithm described in an embodiment of the present invention;
[0028] Figure 4This is a schematic diagram of the improved Montgomery algorithm described in an embodiment of the present invention;
[0029] Figure 5 This is a structural diagram of the computing module described in an embodiment of the present invention;
[0030] Figure 6 This is a schematic diagram of the modular multiplication operation described in an embodiment of the present invention;
[0031] Figure 7 This is a schematic diagram of the simplification process of square operation as described in an embodiment of the present invention;
[0032] Figure 8 This is a schematic diagram of the improved Montgomery modular square algorithm described in an embodiment of the present invention;
[0033] Figure 9 This is a schematic diagram of the modular square operation described in an embodiment of the present invention;
[0034] Figure 10 This is a schematic diagram of the improved modular inverse algorithm described in an embodiment of the present invention;
[0035] Figure 11 This is a schematic diagram of the adder described in an embodiment of the present invention;
[0036] Figure 12 This is a schematic diagram of the structure of the first die-cutting unit in an embodiment of the present invention. Detailed Implementation
[0037] Embodiments of the present invention are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.
[0038] In the description of this invention, it should be understood that the orientation descriptions, such as up, down, front, back, left, right, etc., are based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing this invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limiting this invention.
[0039] In the description of this invention, "several" means one or more, "more than" means two or more, and "greater than," "less than," "exceeding," etc., are understood to exclude the stated number. The use of terms such as "first," "second," "third," etc., is merely for distinguishing technical features and should not be construed as indicating or implying relative importance, or implicitly specifying the number of indicated technical features or their sequential relationship.
[0040] In the description of this invention, unless otherwise explicitly defined, terms such as "set up," "install," and "connect" should be interpreted broadly, and those skilled in the art can reasonably determine the specific meaning of the above terms in this invention in conjunction with the specific content of the technical solution.
[0041] The core operations in Elliptic Curve Cryptography (ECC) are basic operations on the base field and dot product operations on elliptic curves. An elliptic curve defined in the binary field (GF(2m)) satisfies the following equation:
[0042] E:v 2 +xv=x 3 +ax 2 +b (1);
[0043] This equation is called the Weierstrass equation, where a, b ∈ GF(2 m ), and b≠0. The elements of an elliptic curve include the set of all points (x, y) in equation (1), and the point at infinity 0. The most important operations of an elliptic curve are point addition (PA) and point doubling (PD). For P≠Q, the expression for PA is R(x, y). R y R ) = P + Q, for P = Q, the expression for PD is R(x) = P + Q. R y R ) = 2P.
[0044] The formulas for calculating PA and PD differ in different coordinate systems. Refer to Table 1, which lists the formulas for curve y. 2 +xy=x 3 The computational complexity of PA and PD under various coordinate systems is shown in Table 1. Using mixed Jacobi-affine coordinates to calculate PA and Jacobi coordinates to calculate PD can effectively avoid the most time-consuming inversion operation, reduce the computation frequency of modular multiplication and modular squaring operations, and achieve the highest overall computational speed.
[0045] Table 1 Comparison of computational complexity in different coordinate systems
[0046] Coordinate PA PD Affine Coordinate 1Ia+2Mb+1Sc 1I+2M+2S Projective Coordinate 12M+2S 7M+3S Jacobin Coordinate 12M+4S 4M+4S Chudnovsky Coordinate 11M+3S 5M+4S mixed Jacobin-Affine Coordinate 8M+3S
[0047] Switching between different coordinate systems requires coordinate transformation, where formula (2) is the formula for converting affine coordinates to Jacobin coordinates. Formula (3) is the formula for converting Jacobin coordinates to affine coordinates.
[0048] (x, y)→{(X, Y, Z)|X=x, Y=y, Z=1} (2);
[0049] (X, Y, Z)→{(x, y)|x=X / Z 2 y = Y / Z 3} (3);
[0050] In mixed Jacobian-affine coordinates, PA is calculated using the following formula:
[0051]
[0052] In mixed Jacobian coordinates, the formula for calculating PD is:
[0053]
[0054] The point multiplication (PM) operation on an elliptic curve is defined as kP = (k / 2)P + (k / 2)P, where k is a positive integer. PM can be decomposed into two operations: PA and PD. To accelerate the operation speed of point multiplication on elliptic curves and achieve an optimal trade-off between area and performance, this embodiment proposes a prime-field elliptic curve cryptography coprocessor. Within a 130nm CMOS standard cell library, this coprocessor achieves an area of 222.37k gates, a maximum operating frequency of 390MHz, and a multiplication time of 215µs. This coprocessor circuit exhibits significant advantages in both performance and latency, making it highly suitable for high-speed encryption and decryption applications in the Internet of Things (IoT).
[0055] Reference Figure 1 This invention proposes a prime-field elliptic curve cryptography coprocessor, comprising a register module, a NAF encoding module, an arithmetic module, and a controller module; wherein,
[0056] The register module is used to receive the raw input data and store the coordinate data and intermediate data generated during the dot product operation;
[0057] The NAF encoding module is used to execute the NAF encoded dot product algorithm to reduce the number of dot addition and dot multiplication operations during the dot product process;
[0058] The computation module includes a modular multiplier unit, a modular squaring unit, a first modular adder unit, and a second modular adder unit. The modular multiplier unit is connected to the first modular adder unit and the second modular adder unit to form a three-stage pipeline structure to perform modular multiplication operations. The modular squaring unit is connected to the first modular adder unit and the second modular adder unit to form a three-stage pipeline structure to perform modular squaring operations. The modular multiplier unit and the modular squaring unit perform operations in parallel.
[0059] The controller module is used to control the data flow of dot product operations in a pipelined manner.
[0060] In this embodiment, the register module includes random number registers k and h, coordinate registers x2, y2, x3, y3, and z3, which are used to store coordinate data and intermediate operation data generated during the dot multiplication operation of the elliptic curve cryptography algorithm.
[0061] In this embodiment, considering that using the NAFF-encoded dot product algorithm can more efficiently calculate PM, a NAF encoding module is added to the prime field elliptic curve cryptography coprocessor to execute the NAF-encoded dot product algorithm, thereby reducing the number of dot addition and dot multiplication operations during the dot product process. The principle of the NAF-encoded dot product algorithm is to reduce the number of non-zero bits in a positive integer k through NAF encoding. The NAF encoding of k can be represented as... Where k i ∈{0, ±1}, k l-1 ≠0, and there are no two consecutive k values. i It is non-zero. NAF encoding has the fewest non-zero digits, about 1 / 3 of the bit length. Using the NAF encoding dot product algorithm, the number of calculations of PM can be reduced to t times PD and t / 3 times PA.
[0062] However, when calculating PM using the NAF-encoded dot product algorithm, the binary k needs to be converted to NAF code beforehand, which wastes a significant number of clock cycles. (Refer to...) Figure 2 , Figure 2 This is a schematic diagram of the NAF-encoded dot product algorithm. By pre-compiling h = 3k, the NAF encoding conversion can be avoided, with the cost being only an additional register resource. In the NAF-encoded dot product algorithm, h is scanned simultaneously from left to right. i and k i The value of h is used to execute PD once for each bit scanned. i =1 and k i PA is executed once when h = 0, and when h i =0 and k i When Q = 1, PS (Point Subtraction) is executed once. Since QP = Q + (-P), where -P = (x, -y) for P = (x, y), PS can be converted to PA for computation at almost no cost.
[0063] Next, the modular multiplier unit, modular squarer unit, first modular adder unit, and second modular adder unit in the arithmetic module will be described.
[0064] Modular multiplication is the most important and critical operation in ECC cryptosystems, and the choice of modular multiplication algorithm directly affects the overall system efficiency. For a, b∈F PModular multiplication in a finite field can be solved by multiplying a and b as integers and then taking the modulo of the result with respect to p, where the modulo operation is computationally complex. The Montgomery algorithm effectively reduces the complexity of modular multiplication. The principle of the Montgomery algorithm is to perform multiplication and modulo operation simultaneously, using lower-cost addition and shift operations to replace the modulo operation. The Montgomery algorithm requires preprocessing and post-processing of the data before and after computation, and its efficiency for a single modular multiplication calculation is not high, but it is very suitable for computationally intensive applications like the ECC algorithm. (See reference...) Figure 3 , Figure 3 This is a schematic diagram of a standard Montgomery algorithm. Figure 3 The standard Montgomery algorithm shown divides a k-bit data 'a' into s = k / w segments with each segment consisting of w bits. It replaces the k-bit * k-bit multiplier with a w-bit * k-bit multiplier, requiring only s / w iterations to obtain the Montgomery modular multiplication result, effectively reducing hardware area and power consumption. In the standard Montgomery algorithm, steps 2.2 to 2.4 are equivalent to calculating R(x) = x * 2 once. -w Modulo reduction of modp. See reference. Figure 4 This invention proposes an improved Montgomery algorithm, namely the Montgomery Fast Partial Modular Reduction Algorithm. By combining the ordinary Montgomery algorithm with NIST primes, a special modular reduction structure is used to avoid the two multiplications in steps 2.2 to 2.4, allowing for a fast yield of R(x) = x * 2. -w modp 256 The result.
[0065] The principle of the improved Montgomery algorithm is based on calculating R(x) = x * 2. -64 mod p 256 When, since x mod p 256 ≡x*(p 256 +1)mod p 256 p256+1=2 256 -2 224 +2 192 +2 96 The lower 64 bits are all 0, so x can be represented as x = c * 2. 64 +l, then we only need to calculate r = l >> 64 mod p 256 =l*(p 256 +1) >> 64 mod p 256 =(1<<192)-(l<<160)+(l<<128)+(l<<32), we can get R(x)=x*(p 256 +1) >> 64 mod p 256 =c+r mod p 256 The result.
[0066] Reference Figure 5 The modular multiplier unit is connected to the first modular adder unit and the second modular adder unit, forming a three-stage pipeline structure to perform modular multiplication operations. The modular multiplication unit includes a multiplication component and a modular reduction / subtraction component. The output of the multiplication component is the input of the modular reduction / subtraction component, and the output of the modular reduction / subtraction component flows into the first and second modular adder units. The first and second modular adder units are shared resources of the ECC system and are connected to the modular multiplication unit to perform operations such as... Figure 3 The example shown is step 3 of the standard Montgomery algorithm. Modular multiplication uses four 64-bit * 64-bit multipliers to perform a 64-bit * 256-bit multiplication operation, concatenating the four 128-bit intermediate results into two 256-bit data sets. The result is stored in registers c0 and c1, represented as mul(a i b) = c0 + c1 * 2 64 The data concatenation process is shown in Table 2. To reduce system latency and area, the modulo-subtraction operation component uses CSA adders to replace some adders, where CSA257 and CSA258 are used to calculate, for example... Figure 3 The addition u = u + a in step 2.1 of the ordinary Montgomery algorithm shown i *b. Since the CSA adder outputs two data points, a 64-bit adder is used to add the lower 64 bits of the two data points to obtain h*2. 64 The value of +l is given in Table 3. The calculation method for r is as follows: by concatenating 1s as shown in Table 3, using a 96-bit adder to calculate the high 96 bits of r, and then concatenating them with the low 160 bits, we obtain r = l >> 64 mod p. 256 The value of . CSA256 is used to calculate R(x) = c + r mod p. 256 The calculation result is stored in registers d0 and d1 and participates in the calculation of the next loop. After four loops, the calculation result of CSA256 is sent to the first and second modular adder units to perform the final modular addition operation, obtaining the Montgomery modular multiplication result a*b*r. -1 mod p 256 .
[0067] Table 2 shows the process of concatenating the operation data of the multiplication operation component in the modular multiplication unit.
[0068]
[0069] Table 3 Calculation method of R
[0070]
[0071] The process of modular multiplication is as follows: Figure 6 As shown, the multiplication unit, the modular reduction and subtraction unit, the first modular adder unit, and the second modular adder unit present a three-stage pipeline structure. It takes 6 clock cycles to perform a single modular multiplication operation, and only 4 clock cycles to perform a modular multiplication operation under the pipeline. The modular multiplier unit implemented using the improved Montgomery algorithm can achieve a good balance between area and performance.
[0072] Next, the module squarer unit will be explained.
[0073] The modular squaring unit is designed based on the modular multiplier unit. Squaring, as a special type of multiplication, allows for the merging of intermediate multiplication results, saving approximately half the computational cost. The modular reduction part uses the same Montgomery fast partial modular reduction algorithm as the modular multiplier unit.
[0074] To satisfy R(x) = x * 2 -64 mod p 256 The structure requires that the output of the squaring operation be completed within 4 clock cycles, and that the output result should be 64 bits higher than the result of the previous clock cycle after each clock cycle. By dividing the input data 'a' of the modulo squaring unit into 8 segments of 32-bit length, and merging and rearranging the intermediate results of the multiplication, a result conforming to R(x) = x * 2 is obtained. -64 mod p 256 The structure yields four results. This embodiment uses nine 32-bit*32-bit multipliers to implement the squaring operation. Nine multiplication operations are performed in parallel per clock cycle. The inputs to the multipliers are selected by a data selector. The nine 64-bit intermediate results are represented from most significant bit to least significant bit as result9-result1. The squaring process is as follows... Figure 7 As shown, in the calculation result of each clock cycle, the gray part is concatenated into a 321-bit data c2, and the white part is concatenated into a 256-bit data c3, represented as squ(a) = c2 + c3*2 33 . Figure 7 The simplification process of the square operation has a total of 8 multiplication results that cannot be combined. These are located in the least significant bit of c2 in the first three clock cycles and c2 in the last clock cycle. This results in two different calculation methods for c2. The calculation methods for c2 and c3 are shown in Tables 4, 5 and 6.
[0075] Table 4 shows the calculation of C2 for the first three clock cycles.
[0076]
[0077] Table 5 Calculation of C2 in the fourth clock cycle
[0078] Number of digits 320 319-256 255-192 191-128 127-64 63-0 c2 0 result9 result7 result5 result3 result1
[0079] Table 6 Calculation of C3
[0080] Number of digits 279-216 215-152 151-98 97-34 33-0 c3 result8 result6 result4 result2
[0081] The circuit structure of the modulus squarer unit is as follows: Figure 5 As shown, the modular squarer unit includes a square calculation component and a modular reduction calculation component. The output of the square calculation component is the input of the modular reduction calculation component, and the output of the modular reduction calculation component flows into the first modular adder unit and the second modular adder unit. In this embodiment, the circuit structure of the modular reduction calculation component is basically the same as the circuit structure of the modular reduction operation component in the modular multiplier unit, but the CSA adder needs to be modified according to the bit width of different data. CSA258 and CSA289 are used to calculate the addition in u = u + squ(a), and the 64-bit adder and 96-bit adder are used to calculate r = l >> 64 mod p. 256 CSA257 is used to calculate R(x) = c + r mod p 256 The calculation result is stored in registers d2 and d3 and participates in the calculation of the next loop. After four loops, the calculation result of CSA257 is sent to the first and second modular adder units to perform the final modular addition operation, obtaining the Montgomery modular square result a. 2 *r -1 modp 256 .
[0082] The calculation result of CSA257 needs to be verified before being output to the first and second modular squarer units. The algorithm flow of the modular squarer unit is as follows: Figure 8 As shown in Algorithm 4, when Algorithm 4 iterates to step 5, the highest bit of c2 is 0, and the value of the higher 65 bits of c2 is a7*a7≤(2 32 -1)*(2 32 -1)=(2 64 -2 33 +1), so c2 < 2 320 -2 289 +2 256 +2 256 -1, after the data passes through CSA258 and CSA289, u = u + squ(a) < 2 320 -2 288 -2 287 Then, after passing through a 64-bit shift register c<2 256 -2 224 -2 223 <p256 Finally, at CSA257, since r = (l << 192) - (l << 160) + (l << 128) + (l << 32) < p 256 Therefore, c + r < 2 * p 256 You can directly perform modular addition on the output of CSA257.
[0083] The process of modulo squaring is as follows: Figure 9 As shown, the squaring unit, the modulo reduction / subtraction unit, and the first and second modulo adders form a three-stage pipelined architecture to perform modulo squaring operations. A single modulo squaring operation requires 6 clock cycles, and the pipelined computation of a single modulo squaring operation requires 4 clock cycles. The main resource overhead of the modulo squaring unit is nine 32-bit * 32-bit multipliers, and its area is only 0.66 times that of a modulo multiplier unit using four 64-bit * 64-bit multipliers.
[0084] As shown in Table 1, calculating PA requires 8M+3S, and calculating PD requires 4M+6S. After selecting the NAF-encoded dot product algorithm, if single modular multiplication is used to calculate PM, replacing all modular squaring operations with modular multiplication, the total computation time would be 10M*n + 11M*n / 3 = 13.67M*n. However, if modular multiplication and modular squaring are used in parallel to calculate PM, the computation time would be 5M*n + 8M*n / 3 = 7.67M*n, which is only 0.56 times the original amount. Therefore, designing a modular squaring unit to calculate PM is essential.
[0085] In this embodiment, modular inverse operation is required during coordinate transformation. Modular inverse operation is typically implemented using the Extended Euclidean Algorithm (EED). Common EED algorithms include binary and quaternary algorithms. The binary algorithm is well-suited for hardware implementation, requiring only iterative subtraction and shift operations to calculate the modular inverse. The binary algorithm requires an average of 363 clock cycles to calculate one modular inverse. Compared to the binary algorithm, the quaternary algorithm scans 2 bits per cycle instead of 1 bit, theoretically achieving twice the speed. However, it consumes significant circuit resources and is unsuitable for hardware implementation. Therefore, referring to... Figure 10 This embodiment proposes a binary-quaternary hybrid algorithm as an improved modular inversion algorithm based on binary and quaternary algorithms. It integrates some judgment conditions of the quaternary algorithm into the process of the binary algorithm. Performing one modular inversion operation requires about 300 clock cycles, which is only 83% of that of the binary algorithm.
[0086] In such Figure 10In Algorithm 5, the division by 4 and division by 2 operations of u and v can be implemented using a simple right shift operation. For the division by 4 and division by 2 operations of x1 and x2, it is necessary to first determine the least significant bit of x1 and x2, then adjust the least significant bit of x1 and x2 to 0 using +p, +2p, and +3p operations, and finally implement the right shift operation. The modulo inversion unit uses two adders to update the values of registers u and v, and two modulo adder units to update the values of registers x1 and x2. The structure of the adders is as follows: Figure 11 As shown. To minimize the area of the ECC circuit system, this embodiment uses registers u and v to store the input affine coordinate data P = (x... p y p The registers of the module adder are multiplexed, and registers x1 and x2 are multiplexed with the input registers of the module adder unit.
[0087] The modular adder unit, serving as a common arithmetic resource for the ECC dot product processor, is designed to meet all functional requirements in modular inverse, modular multiplication, modular squaring, dot product addition, dot product multiplication, and dot product operations. Based on the original two cascaded 257-bit adders, several data selectors and shift registers are added to implement operations such as (a+b)mod p, (ab)mod p, a / 4mod p, a / 2mod p, (ab) / 2mod p, and a+b. The modular adder unit has four sets of output interfaces, allowing it to provide input data to other modules according to different needs.
[0088] Specifically, refer to Figure 12 The first modular adder unit includes a first adder, a second adder, multiple shift registers, and multiple data selectors. The first adder and the second adder are connected in series. The multiple data selectors are connected in stages and connected to the first adder to control the data input to the first adder. The multiple shift registers and multiple data selectors are connected and connected to the input of the second adder. The output of the second adder is connected to one data selector and then to two shift registers. The structure of the second modular adder unit is the same as that of the first modular adder unit.
[0089] This invention leverages the parallel computation capabilities of modular multiplication, modular squaring, and modular addition operations. It expands PA and PD operations into a series of finite-field computational steps and uses a controller module to reschedule and coordinate the use of each arithmetic logic unit, effectively improving the system's computational speed and data throughput. Because the computational order of PA and PD has a certain data dependency, at the beginning of the computation, the controller module controls only modular squaring operations, leaving the modular multiplier unit idle. At the end of the computation, the controller module controls only modular multiplication and modular addition operations, leaving the modular squarer unit idle. By analyzing the possible adjacent computational sequences of dot addition and dot multiplication operations in the NAF dot multiplication algorithm, the empty modular multiplier and modular squarer units are filled using a head-to-tail connection method. This allows the computation of the next PA or PD operation to begin before the previous step has finished. The steps for PA and PD calculations are shown in Tables 7 and 8. After optimization, it takes 21 clock cycles to calculate PD once and 32 clock cycles to calculate PA once. Executing a complete PM takes approximately 21*256+32*256 / 3+300≈8406 clock cycles.
[0090] Table 7 Calculation Flow of PA
[0091]
[0092] Table 8 Calculation process of PD
[0093]
[0094] This embodiment proposes a prime field elliptic curve cryptography coprocessor that can perform modular multiplication and modular squaring operations in parallel. Under the 130nm CMOS standard process, the highest operating frequency can reach 390MHz, the encryption time is 0.0215ms, the area is 222.4kGEs, and the area-to-time performance ratio is 4.78. The overall performance is very good and it is very suitable for high-speed encryption and decryption applications.
Claims
1. A prime field elliptic curve cryptography coprocessor, characterized in that, It includes a register module, a NAF encoding module, an arithmetic module, and a controller module; The register module is used to receive the input raw data and store the coordinate data and intermediate data generated during the dot product operation; The NAF encoding module is used to execute the NAF encoding dot product algorithm to reduce the number of dot addition and dot multiplication operations in the dot product operation process; The computation module includes a modular multiplier unit, a modular squaring unit, a first modular adder unit, and a second modular adder unit. The modular multiplier unit is connected to the first modular adder unit and the second modular adder unit to form a three-stage pipeline structure to perform modular multiplication operations. The modular squaring unit is connected to the first modular adder unit and the second modular adder unit to form a three-stage pipeline structure to perform modular squaring operations. The modular multiplier unit and the modular squaring unit perform operations in parallel. The controller module is used to control the data flow of the dot product operation in a pipeline manner; The modular multiplier unit includes a multiplication operation component and a modular reduction and subtraction operation component. The output of the multiplication operation component is the input of the modular reduction and subtraction operation component, and the output of the modular reduction and subtraction operation component flows into the first modular adder unit and the second modular adder unit. The modulus squarer unit includes a square calculation component and a modulus reduction calculation component. The output of the square calculation component is the input of the modulus reduction calculation component, and the output of the modulus reduction calculation component flows into the first modulus adder unit and the second modulus adder unit. The first adder unit includes a first adder, a second adder, multiple shift registers, and multiple data selectors. The first adder and the second adder are connected in series. The multiple data selectors are connected in stages and connected to the first adder to control the data input to the first adder. The multiple shift registers and the multiple data selectors are connected and connected to the input terminal of the second adder. The output terminal of the second adder is connected to one of the data selectors and then to two of the shift registers. The structure of the second die-cutting unit is the same as that of the first die-cutting unit.
2. The prime field elliptic curve cryptography coprocessor according to claim 1, characterized in that, The register module includes a random number register and a coordinate register, and the random number register is connected to the NAF encoding module. The random number register is used to store random number data; The coordinate register is used to store coordinate data.
3. The prime field elliptic curve cryptography coprocessor according to claim 1, characterized in that, The area of the module squarer unit is 0.66 times the area of the module multiplier unit.
4. The prime field elliptic curve cryptography coprocessor according to claim 1, characterized in that, The first converter unit has four sets of interfaces at its output to output corresponding data.