Efficient implementation system and method of SM4 cryptographic algorithm based on FPGA
By combining a cyclic key expansion architecture and a pipelined encryption architecture on an FPGA, and employing an optimized design of lookup table and algebraic S-boxes, the problems of high resource consumption and low efficiency in existing technologies are solved, and a more efficient implementation of the SM4 cryptographic algorithm is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIDIAN UNIV
- Filing Date
- 2024-01-03
- Publication Date
- 2026-06-19
AI Technical Summary
Existing FPGA-based implementations of the SM4 cryptographic algorithm suffer from high resource consumption and low efficiency. Especially in scenarios with limited hardware resources, the loop-based architecture has low throughput and efficiency, while the pipeline-based architecture has high throughput but excessive resource consumption.
A cyclic key expansion architecture combined with a lookup table S-box and a pipelined encryption architecture combined with a pipelined algebraic S-box are adopted. Through the XOR operation and cyclic shift of the round operation submodule and the algebraic S-box, an efficient encryption process is achieved.
It achieves faster encryption frequency, higher encryption throughput, and lower hardware resource consumption, improving the encryption efficiency of low-resource devices.
Smart Images

Figure CN117938360B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of cryptography technology, specifically relating to an efficient implementation system and method for the SM4 cryptographic algorithm based on FPGA. Background Technology
[0002] FPGA-based implementations of the SM4 algorithm can be categorized into two types: cyclic and pipelined. For the cyclic type, Jin et al. implemented a cyclic SM4 cryptographic algorithm on a Xilinx Virtex-4 FPGA, completing one encryption cycle every 36 clock cycles, with 4 clock cycles used for data buffering and 32 clock cycles for encryption. Gao et al. used a Moore-type state machine to control the encryption process, storing pre-calculated round keys in registers, saving time required for key expansion. Guan et al. and He et al. used a Moore-type state machine to control the data processing, with the encryption module and round key expansion module running in parallel. The round keys generated by the round key generation module are directly passed to the encryption module, eliminating the need for additional registers to store them, and the encryption module does not need to wait. On the other hand, for the pipelined type, Jin et al. used 128 dual-port BRAMs to store round keys and lookup-style S-boxes required for data encryption. Gao et al. implemented the SM4 cryptographic algorithm using a 32-stage pipeline, improving encryption throughput and speed. Guan et al. reduced the size of the pipeline in the 32-stage pipeline by half and combined it with two rounds of iterative computation to achieve 32 rounds of data encryption operations.
[0003] S-box permutation is a crucial step in the SM4 cryptographic algorithm used for obfuscation, ensuring its security. The implementation method of S-box permutation also affects the efficiency of SM4 implementation on FPGAs. Currently, S-box permutation methods are divided into table lookup and algebraic methods. FPGA-based implementations of SM4, including cyclic and pipelined architectures, all employ table lookup for S-box permutation, where a given input is used to find the corresponding output. However, table lookup occupies a large chip area and is inefficient. Liu et al. analyzed the properties of the SM4 S-box and proposed a theory for constructing the S-box based on finite field inversion. Compared to the traditional table lookup method, this method uses algebraic operations to calculate the S-box output, resulting in a more compact chip area, suitable for scenarios with strict chip area constraints. Wang et al. proposed an implementation method for constructing the S-box based on finite field inversion. This method reduces hardware resource consumption; however, when implemented on FPGAs, the large computational load and complex combinational logic lead to high circuit latency, significantly impacting the overall frequency of FPGA-based SM4 implementations.
[0004] Analysis reveals two main problems with existing FPGA-based SM4 cryptographic algorithm implementation methods: First, existing encryption architectures cannot balance resource consumption and performance. Loop architectures are suitable for resource-constrained scenarios but have low throughput and efficiency, while pipeline architectures have high throughput but consume excessive resources. Second, the existing S-box permutation implementation method consumes high hardware resources, occupies a large chip area, and has low efficiency. Summary of the Invention
[0005] To address the aforementioned problems in the existing technology, this invention provides an efficient implementation system and method for the SM4 cryptographic algorithm based on FPGA. The technical problem to be solved by this invention is achieved through the following technical solution:
[0006] In a first aspect, the present invention provides an efficient implementation system for the SM4 cryptographic algorithm based on FPGA, comprising: a round key expansion module and an encryption module, wherein the encryption module comprises 32 cascaded round operation sub-modules; wherein,
[0007] The round key expansion module is used to generate 32 round keys based on the key input by the user, and input the 32 round keys into the 32 round operation sub-modules in the encryption module respectively;
[0008] The encryption module is used to generate ciphertext based on the processing results obtained by the 32 round operation sub-modules performing round operations on the input data using their own round keys.
[0009] In one embodiment of the present invention, the encryption module further includes multiple registers corresponding to each wheel operation submodule, each wheel operation submodule including: a wheel operation unit and an algebraic S-box; wherein,
[0010] The wheel operation unit is used to divide the input data into blocks K by bit. i Block K i+1 Block K i+2 and block K i+3 Then, using its own round key RK i Block K i+1 Block K i+2 and block K i+3 Perform an XOR operation and divide the resulting first data into blocks L bit by bit. i Block L i+1 Block L i+2 and block L i+3 Further, block L i Block L i+1 Block L i+2 and block L i+3Each algebraic expression S-box is input separately, and the permutation data of each algebraic expression S-box is cyclically shifted. Based on the cyclically shifted permutation data and the block K... i Perform an XOR operation and store the processing result of the i-th round operation submodule into the corresponding register; i = 0, 1, 2, ..., 31;
[0011] The input data of the wheel operation submodule is the processing result of the previous wheel operation submodule, the input data of the 0th wheel operation submodule is the plaintext to be encrypted, and the processing result of the 31st wheel operation submodule is the ciphertext after the plaintext is encrypted.
[0012] Secondly, the present invention provides an efficient implementation method of the SM4 cryptographic algorithm based on FPGA, which is applied to the above-mentioned system;
[0013] The efficient implementation method of the FPGA-based SM4 cryptographic algorithm includes:
[0014] Obtain the key input by the user and the plaintext to be encrypted;
[0015] Generate 32 round keys Rk based on the key. i , i = 0, 1, 2, ..., 31;
[0016] Using the 32 round keys Rk i The plaintext to be encrypted is subjected to 32 rounds of operations to obtain the ciphertext after encryption.
[0017] In one embodiment of the present invention, each round of operation processes the input data according to the following steps:
[0018] Divide the input data of the i-th round of operation into K blocks according to their bits. i Block K i+1 Block K i+2 and block K i+3 ;
[0019] Using the round key RK corresponding to the i-th round operation i and block K i+1 Block K i+2 and block K i+3 Perform an XOR operation to obtain the first data;
[0020] Divide the first data into blocks L by bit. i Block L i+1 Block L i+2 and block L i+3 Then, block L i Block L i+1 Block L i+2 and block L i+3 Input the algebraic expression S-box into each expression to obtain the permutation data;
[0021] The permutation data of each algebraic S-box are spliced together and cyclically shifted;
[0022] Based on the permutation data after cyclic shifting and the block K i Perform an XOR operation and store the result of the i-th round of operation into the corresponding register.
[0023] In one embodiment of the present invention, the input data of the i-th round of operation is divided into K blocks by bit. i Block K i+1 Block K i+2 and block K i+3 Before the steps, it also includes:
[0024] The processing result of the (i-1)th round operation is obtained from the register corresponding to the (i-1)th round operation submodule and used as the input data for the i-th round operation.
[0025] The input data for the 0th round of operation is the plaintext to be encrypted, and the processing result of the 31st round of operation is the ciphertext after the plaintext has been encrypted.
[0026] In one embodiment of the present invention, the round key RK corresponding to the i-th round operation is utilized. i and block K i+1 Block K i+2 and block K i+3 The steps to perform an XOR operation to obtain the first data include:
[0027] For block K i+1 Block K i+2 and block K i+3 Perform an XOR operation to obtain the first XOR result;
[0028] The first XOR result is compared with the round key RK corresponding to the i-th round operation. i Perform an XOR operation to obtain the first data.
[0029] In one embodiment of the present invention, the first data is divided into blocks L by bit. i Block L i+1 Block L i+2 and block L i+3 Then, block L i Block L i+1 Block L i+2 and block L i+3 The steps for obtaining permutation data by inputting algebraic expressions into an S-box include:
[0030] Divide the first data into blocks L by bit. i Block L i+1 Block L i+2and block L i+3 ;
[0031] Block L i Block L i+1 Block L i+2 and block L i+3 Inputting the algebraic expression into box S yields block L. i The corresponding first permutation data, block L i+1 The corresponding second permutation data, block L i+2 The corresponding third permutation data and block L i+3 The corresponding fourth permutation data.
[0032] In one embodiment of the present invention, the step of splicing the permutation data of each algebraic S-box and performing cyclic shifting includes:
[0033] After concatenating the first, second, third, and fourth permutation data bit by bit, the concatenated data is shifted left by 2 bits, 10 bits, 18 bits, and 24 bits respectively to obtain the first, second, third, and fourth shifted data.
[0034] In one embodiment of the present invention, based on the permutation data after cyclic shifting and the block K... i The steps of performing an XOR operation and storing the result of the i-th round of operation into a register include:
[0035] Perform an XOR operation between the first shifted data and the second shifted data to obtain the second data;
[0036] The third shifted data is XORed with the fourth shifted data to obtain the third data.
[0037] The second data, the third data, and the concatenated data are XORed to obtain the fourth data.
[0038] Combine the fourth data with block K i Perform an XOR operation to obtain the processing result block K of the i-th round of operations. i+4 ;
[0039] Block K stores the processing results of the i-th round of operations. i+4 To register.
[0040] In one embodiment of the present invention, the input data of the i-th round of operation is divided into K blocks by bit. i Block K i+1 Block K i+2 and block K i+3 The steps include:
[0041] Obtain the input data for the i-th round of operation, wherein the input data is 32 bits;
[0042] The 0th to 7th bits of the input data are used as block K. i Use bits 8 to 15 as block K i+1 Use bits 16-23 as block K i+2 Use bits 24-32 as block K i+3 .
[0043] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0044] This invention provides an efficient implementation system and method for the SM4 cryptographic algorithm based on FPGA. The system adopts a cyclic key expansion architecture combined with a lookup table S-box, and a pipelined encryption architecture combined with a pipelined algebraic S-box. The encryption architecture adopted in this invention has a faster encryption frequency, higher encryption throughput, and lower hardware resource consumption, overcoming the shortcomings of low encryption efficiency and high hardware resource consumption in existing technologies, enabling low-resource hardware devices to have faster encryption efficiency.
[0045] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description
[0046] Figure 1 This is a schematic diagram of the structure of an efficient implementation system for the SM4 cryptographic algorithm based on FPGA provided in an embodiment of the present invention;
[0047] Figure 2 This is a schematic diagram of the structure of the round key generation module provided in an embodiment of the present invention;
[0048] Figure 3 This is a schematic diagram of the structure of the wheel operation submodule provided in an embodiment of the present invention;
[0049] Figure 4 This is a flowchart of an efficient implementation system method for the SM4 cryptographic algorithm based on FPGA provided in an embodiment of the present invention;
[0050] Figure 5 This is a schematic diagram of wheel operation provided in an embodiment of the present invention;
[0051] Figure 6 This is a schematic diagram of the processing procedure for algebraic S-boxes in related technologies;
[0052] Figure 7 This is a schematic diagram of the processing procedure of the algebraic S-box provided in an embodiment of the present invention;
[0053] Figure 8 This is a schematic diagram of the process of extending the round key provided in an embodiment of the present invention. Detailed Implementation
[0054] The present invention will be further described in detail below with reference to specific embodiments, but the implementation of the present invention is not limited thereto.
[0055] Figure 1 This is a schematic diagram of the structure of an efficient SM4 cryptographic algorithm implementation system based on FPGA provided in an embodiment of the present invention. Figure 1 As shown, this embodiment of the invention provides an efficient implementation system for the SM4 cryptographic algorithm based on FPGA, including: a round key expansion module and an encryption module, wherein the encryption module includes 32 cascaded round operation sub-modules; wherein,
[0056] The round key expansion module is used to generate 32 round keys based on the key input by the user, and input the 32 round keys into the 32 round operation sub-modules in the encryption module respectively;
[0057] The encryption module is used to generate ciphertext based on the processing results obtained by the 32 round operation sub-modules performing round operations on the input data using their own round keys.
[0058] In this embodiment, the round key expansion module includes 34 registers, of which 32 registers are used to store the 32 round keys after expansion, one register is used to store the key expansion completion signal, and another register is used as a counter to count the number of round operations. The round key expansion module includes four sets of multiplexers, which have four functions: first, to determine whether the reset signal has reset and initialize the counter to 0; second, to determine whether the current operation is the first round and assign a value to the processed data; third, to determine the current number of round operations and assign a value to the register storing the round keys; and fourth, to determine whether the current operation is the last round, and if so, assign a value to the key expansion completion signal. Furthermore, the round key expansion module also includes a parameter acquisition module, which consists of a multiplexer and is used to output the parameter Ck required for the i-th round operation based on the number of round operations. i .
[0059] The efficient implementation system of the SM4 cryptographic algorithm based on FPGA adopts a combination of cyclic architecture and multi-level pipeline architecture. Figure 2 This is a schematic diagram of the structure of the round key generation module provided in an embodiment of the present invention. Specifically, as shown... Figure 2As shown, the round key expansion module adopts a cyclic architecture. By instantiating a round key expansion module and iteratively using it to perform 32 round operations, 32 sets of round keys are obtained, which helps to reduce the use of hardware resources. Among them, the S-box permutation function during round key generation is implemented by a lookup table S-box to reduce circuit latency and increase the frequency of key expansion. Furthermore, the encryption module contains 32 cascaded round operation sub-modules. The registers corresponding to each round operation sub-module form a 32-stage pipeline, realizing 32-stage pipelined parallel encryption of plaintext. This ensures that data does not need to wait during each round operation, thereby improving the overall throughput of the system circuit.
[0060] Optionally, the encryption module includes multiple registers corresponding to each round operation submodule. The round operation submodule includes: a round operation unit and an algebraic S-box; wherein,
[0061] The round operation unit is used to divide the input data into K blocks bit by bit. i Block K i+1 Block K i+2 and block K i+3 Then, using its own round key RK i Block K i+1 Block K i+2 and block K i+3 Perform an XOR operation and divide the resulting first data into blocks L bit by bit. i Block L i+1 Block L i+2 and block L i+3 Further, block L i Block L i+1 Block L i+2 and block L i+3 Input algebraic expression S-boxes respectively, and perform cyclic shifting on the permutation data of each algebraic expression S-box. Based on the cyclically shifted permutation data and block K... i Perform an XOR operation and store the processing result of the i-th round operation submodule into a register;
[0062] The input data for the round operation submodule is the processing result of the previous round operation submodule. The input data for the 0th round operation submodule is the plaintext to be encrypted. The processing result of the 31st round operation submodule is the ciphertext after the plaintext has been encrypted.
[0063] In this embodiment, the wheel operation submodule includes a wheel operation unit and an algebraic S-box. The algebraic S-box in the wheel operation submodule is used to implement the S-box replacement function. Figure 3 This is a schematic diagram of the structure of the wheel operation submodule provided in an embodiment of the present invention. Optionally, as shown... Figure 3As shown, the algebraic S-box includes: a first merging operation subunit, a composite field inversion subunit, and a second merging operation subunit. The first merging operation subunit includes a set of XOR circuits, the composite field inversion subunit includes an array XOR circuit, an AND gate circuit, and a shift circuit, and the second merging operation subunit includes a set of XOR circuits. In other words, in the algebraic S-box used in this embodiment of the invention, the first merging operation subunit, the composite field inversion subunit, and the second merging operation subunit form a pipelined algebraic S-box through registers.
[0064] Figure 4 This is a flowchart of an efficient implementation system method for the SM4 cryptographic algorithm based on FPGA, provided in an embodiment of the present invention. Figure 4 As shown, this embodiment of the invention also provides an efficient implementation method for the SM4 cryptographic algorithm based on FPGA, which is applied to the above-mentioned efficient implementation system for the SM4 cryptographic algorithm based on FPGA.
[0065] The method includes:
[0066] S1. Obtain the key input by the user and the plaintext to be encrypted;
[0067] S2. Generate 32 round keys Rk based on the key. i , i = 0, 1, 2, ..., 31;
[0068] S3, using 32 round keys Rk i Perform 32 rounds of operations on the plaintext to be encrypted, and obtain the ciphertext after the plaintext has been encrypted.
[0069] Figure 5 This is a schematic diagram of the wheel operation provided in an embodiment of the present invention. It should be noted that the operation process for each wheel is the same, as follows: Figure 5 As shown, the input data is processed according to the following steps:
[0070] Divide the input data of the i-th round of operation into K blocks according to their bits. i Block K i+1 Block K i+2 and block K i+3 ;
[0071] Using the round key RK corresponding to the i-th round operation i and block K i+1 Block K i+2 and block K i+3 Perform an XOR operation to obtain the first data;
[0072] Divide the first data into blocks L by bit. i Block L i+1 Block L i+2 and block L i+3 Then, block L iBlock L i+1 Block L i+2 and block L i+3 Input the algebraic expression S-box into each expression to obtain the permutation data;
[0073] The permutation data of the S-boxes of various algebraic expressions are concatenated and cyclically shifted;
[0074] Based on the permutation data after cyclic shift and block K i Perform an XOR operation and store the result of the i-th round of operation into a register.
[0075] Optionally, the input data of the i-th round of operation is divided into K blocks by bit. i Block K i+1 Block K i+2 and block K i+3 Before the steps, it also includes:
[0076] The processing result of the (i-1)th round operation is obtained from the register corresponding to the (i-1)th round operation submodule and used as the input data for the i-th round operation.
[0077] In this process, the input data for the 0th round of operation is the plaintext to be encrypted, and the processing result of the 31st round of operation is the ciphertext after the plaintext has been encrypted.
[0078] Optionally, the round key RK corresponding to the i-th round operation is used. i and block K i+1 Block K i+2 and block K i+3 The steps to perform an XOR operation to obtain the first data include:
[0079] For block K i+1 Block K i+2 and block K i+3 Perform an XOR operation to obtain the first XOR result;
[0080] The first XOR result is combined with the round key RK corresponding to the i-th round operation. i Perform an XOR operation to obtain the first data.
[0081] Optionally, the first data is divided into blocks L bit by bit. i Block L i+1 Block L i+2 and block L i+3 Then, block L i Block L i+1 Block L i+2 and block L i+3 The steps for obtaining permutation data by inputting algebraic expressions into an S-box include:
[0082] Divide the first data into blocks L by bit. i Block Li+1 Block L i+2 and block L i+3 ;
[0083] Block L i Block L i+1 Block L i+2 and block L i+3 Inputting the algebraic expression into box S yields block L. i The corresponding first permutation data, block L i+1 The corresponding second permutation data, block L i+2 The corresponding third permutation data and block L i+3 The corresponding fourth permutation data.
[0084] Figure 6 This is the algebraic S-box provided in the embodiments of the present invention. Figure 7 This is a schematic diagram illustrating the processing procedure of the algebraic S-box provided in an embodiment of the present invention. For example... Figure 6-7 As shown, compared with the existing algebraic S-box, this embodiment optimizes the algebraic S-box in three ways: first, it merges isomorphic mapping and affine transformation, and inverse isomorphic mapping and inverse affine transformation; second, it reduces the amount of computation by selecting the optimal merging calculation matrix, thereby reducing the overall hardware resource consumption of the method; and third, it divides the algebraic S-box into different calculation modules and inserts registers between different modules to form a pipelined algebraic S-box, thereby improving the overall frequency.
[0085] Furthermore, the above improvements will be explained in detail:
[0086] (1) Combining isomorphic mappings and affine transformation matrices
[0087] The computation of the algebraic S-box consists of two linear affine transformations and one nonlinear finite field inversion. Its algebraic structure is: S-box(a) = I(a·A1+C1)·A2+C2, where a is the 8-bit input of the S-box, and I is the GF(2) finite field inversion function. 8 Multiplication inverses on () are used. Isomorphic mappings and affine transformations, as well as inverse isomorphic mappings and inverse affine transformations, can be combined to further optimize computational efficiency and reduce resource consumption. Specifically, if the isomorphic mapping and inverse isomorphic mapping are defined as T1 and T2 respectively, then S-box(a) = I((a·A1+C1)·T1)·T2·A2+C2, and subsequently S-box(a) = I(a·A1·T1+C1·T1)·T2·A2+C2. Let A1·T1 be denoted as matrix M1, C1·T1 as N1, and T2·A2 as M2, then S-box(a) = I(a·M1+N1)·M2+C2. Let a·M1+N1 be the first merging operation, i.e., merging the isomorphic mapping and the affine transformation; and a·M2+C2 be the second merging operation, i.e., merging the inverse isomorphic mapping and the inverse affine transformation. The optimized algebraic S-box operation process in this embodiment is as follows: Figure 7 As shown.
[0088] (2) Selecting the optimal merge matrix
[0089] During isomorphic mapping, regular bases can be used to transform GF(2) into GF(2). 8 The element of GF(2) is represented as GF(2). 4 A linear polynomial of degree one over y: g(y) = (a1Y) 16 +a0Y), at this point, the multiplication operation requires the modular irreducible polynomial r(y) = y 2 +τy+η,[Y 16 [Y] is called a set of regular bases under this field. 16 , where Y] are the two roots of r(y) = 0. GF(2 4 An element on GF(2) can be represented as GF(2) 2 A linear polynomial of degree one over 1 / Z: a(z) = (b1Z) 4 +b o In this case, the multiplication operation requires a modular irreducible polynomial t(z) = z. 2 +μz+ρ。 [Z 4 , Z] are the two roots of t(z) = 0. GF(2 2 The elements on ) can be represented as a linear polynomial of the first degree on GF(2): b(w) = (c1W) 2 +c o W), whose multiplication requires a modularly irreducible polynomial s(w) = w 2 +w+1. [W 2 , W] are the two roots of s(w) = 0. From GF(2 8 ) to GF(((2) 2 ) 2 ) 2 The formula for calculating the inverse isomorphic mapping matrix of ) is [Y 16 Z 4 W 2 Y 16 Z 4 W, Y 16 ZW 2 Y 16 ZW, YZ 4 W 2 YZ 4 W, YZW 2 YZW).
[0090] According to the irreducible polynomial r(y) = y 2 +τy+η,t(z)=z 2 +μz+ρ,s(w=w 2Different parameters (+w+1) can generate multiple different roots, resulting in different isomorphic mapping matrices and inverse isomorphic mapping matrices, thus leading to different merging matrices M1, N1, and M2, and consequently, different merging operations 1 and 2. After implementing the optimized algebraic S-box on an FPGA, different merging operations 1 and 2 result in different hardware resource consumption. We statistically analyzed all parameter combinations that correctly output the algebraic operation results of the S-box, obtaining 8 parameter combinations. These 8 combinations were implemented on an FPGA using combinational logic, and the hardware resource consumption required for the final implementation of different combinations was analyzed, as shown in Table 3. When the parameter [W...]... 2 [W], [Z] 4 [Z], [Y] 16 When [Y] is [0X5C, 0X5D], [0XC, 0XD], [0XBF, 0XBE], the resource consumption is minimized, requiring 32 LUTs. The resulting first merge operation matrix is:
[0091]
[0092] The second merging operation matrix 2 is:
[0093]
[0094] The algebraic S-box implemented on FPGA based on this parameter combination can reduce the amount of computation, thereby reducing the overall hardware resource consumption when implementing the SM4 cryptographic algorithm on FPGA, while improving overall efficiency.
[0095] Table 1 Resource consumption for implementing algebraic S-boxes in different combinations
[0096] <![CDATA[[W 2 ,W]]]> <![CDATA[[Z 4 ,Z]]]> <![CDATA[[Y 16 ,Y]]]> Resource consumption (LUTs) [0X5C,0X5D] [0XC,0XD] [0XBE,0XBF] 35 [0X5C,0X5D] [0XC,0XD] [0XBF,0XBE] 32 [0X5C,0X5D] [0XD,0XC] [0XEE,0XEF] 86 [0X5C,0X5D] [0XD,0XC] [0XEF,0XEE] 86 [0X5D,0X5C] [0X50,0X51] [0X94,0X95] 33 [0X5D,0X5C] [0X50,0X51] [0X95,0X94] 35 [0X5D,0X5C] [0X51,0X50] [0X98,0X99] 85 [0X5D,0X5C] [0X51,0X50] [0X99,0X98] 86
[0097] Optionally, the step of concatenating the permutation data of each algebraic S-box and performing a cyclic shift includes:
[0098] After concatenating the first, second, third, and fourth permutation data bit by bit, the concatenated data is then cyclically shifted left by 2 bits, 10 bits, 18 bits, and 24 bits respectively to obtain the first, second, third, and fourth shifted data.
[0099] Optionally, based on the permutation data after cyclic shift and block K i The steps of performing an XOR operation and storing the result of the i-th round of operations into the corresponding register include:
[0100] Perform an XOR operation between the first shifted data and the second shifted data to obtain the second data;
[0101] Perform an XOR operation between the third shifted data and the fourth shifted data to obtain the third data;
[0102] The second and third data are XORed with the concatenated data to obtain the fourth data.
[0103] Connect the fourth data with block K i Perform an XOR operation to obtain the processing result block K of the i-th round of operations. i+4 ;
[0104] Block K stores the processing results of the i-th round of operations. i+4 To register.
[0105] Optionally, the input data of the i-th round of operation is divided into K blocks by bit. i Block K i+1 Block K i+2 and block K i+3 The steps include:
[0106] Obtain the input data for the i-th round of operation. The input data is 32 bits.
[0107] Use bits 0-7 of the input data as block K. i Use bits 8 to 15 as block K i+1 Use bits 16-23 as block K i+2 Use bits 24-32 as block K i+3 .
[0108] Figure 8 This is a schematic diagram illustrating the process of extending the round key according to an embodiment of the present invention. Please refer to... Figure 8 The calculation process of the round key and Figure 5 The round operation process is similar. Specifically, taking the generation of the round key RK0 as an example, the key input by the user is first divided into blocks MK bit by bit. i MK block i+1 MK block i+2 and block MK i+3 And using parameter FK i FK i+1 FK i+2 and FK i+3 Perform an XOR operation, then split the resulting fifth data bitwise into blocks ML. i ML block i+1 ML block i+2 and block ML i+3 Further study of block ML i+1 ML block i+2 and block ML i+3 Perform an XOR operation and then AND with the parameter CK. iPerform an XOR operation to obtain the sixth data; divide the sixth data into bitwise blocks ML'. i ML' i+1 ML' i+2 and block ML' i+3 Input the lookup table S-box respectively, and obtain the fifth, sixth, seventh, and eighth permutation data accordingly; concatenate the above fifth, sixth, seventh, and eighth permutation data, and perform an XOR operation on the concatenated data, the concatenated data after left circular shift by 13 bits, and the concatenated data after left circular shift by 23 bits, and then AND with ML. i Perform an XOR operation to obtain the round key RK0.
[0109] Of course, RK1, RK2, ..., RK 31 All of these can be calculated using the process described above, so they will not be repeated here.
[0110] The following simulation experiment further illustrates the efficient implementation system and method of the SM4 cryptographic algorithm based on FPGA provided by this invention.
[0111] Specifically, in this embodiment, the Xilinx xc7z100ffg900-2 development board is selected for simulation on Vivado 2019.1 software.
[0112] The methods compared in the simulation experiments include: a 32-stage pipeline SM4 encryption architecture design method based on lookup tables (Scheme 1), a cyclic SM4 encryption architecture design method (Scheme 2), and a 32-stage pipeline SM4 encryption architecture design method with a 2-stage in-wheel pipeline structure (Scheme 3).
[0113] Next, given a set of plaintext, encryption was performed using Scheme 1, Scheme 2, Scheme 3, and the efficient implementation method of the SM4 cryptographic algorithm based on FPGA provided in this invention. The hardware resource consumption, frequency, and throughput of the four methods were compared, and the results are shown in Table 2.
[0114] Table 2 Simulation Results
[0115] plan Option 1 Option 2 Option 3 This invention Number of LUTs consumed 7466 LUTs 697 LUTs 14852 LUTs 6740 LUTs frequency 250MHz 333MHz 259MHz 342MHz Throughput 30.52Gb / s 1.27Gb / s 33.152Gb / s 43.776Gb / s
[0116] As can be seen from Table 2, because the present invention adopts a cyclic key expansion architecture combined with a lookup table S-box, and a pipelined encryption architecture combined with a pipelined optimized algebraic S-box, it can achieve higher encryption frequency and throughput, while having lower hardware resource consumption.
[0117] As can be seen from the above embodiments, the beneficial effects of the present invention are as follows:
[0118] This invention provides an efficient implementation system and method for the SM4 cryptographic algorithm based on FPGA. The system adopts a cyclic key expansion architecture combined with a lookup table S-box, and a pipelined encryption architecture combined with a pipelined algebraic S-box. The encryption architecture adopted in this invention has a faster encryption frequency, higher encryption throughput, and lower hardware resource consumption, overcoming the shortcomings of low encryption efficiency and high hardware resource consumption in existing technologies, enabling low-resource hardware devices to have faster encryption efficiency.
[0119] In the description of this invention, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.
[0120] The use of terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples" indicates that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. In addition, those skilled in the art can combine and integrate the different embodiments or examples described in this specification.
[0121] Although this application has been described herein in conjunction with various embodiments, other variations of the disclosed embodiments can be understood and implemented by those skilled in the art in carrying out the claimed application by reviewing the accompanying drawings, the disclosure, and the appended claims.
[0122] The above description, in conjunction with specific preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various simple deductions or substitutions can be made without departing from the concept of the present invention, and all such modifications and substitutions should be considered within the scope of protection of the present invention.
Claims
1. A high-efficiency implementation system for the SM4 cryptographic algorithm based on FPGA, characterized in that, include: The system includes a round key expansion module and an encryption module, wherein the encryption module comprises 32 cascaded round operation sub-modules; wherein, The round key expansion module is used to generate 32 round keys based on the key input by the user, and input the 32 round keys into the 32 round operation sub-modules in the encryption module respectively; The encryption module is used to generate ciphertext based on the processing results obtained by the 32 round operation sub-modules performing round operations on the input data using their own round keys; The encryption module includes multiple registers corresponding to each round operation submodule, and each round operation submodule includes: a round operation unit and an algebraic S-box; wherein, The wheel operation unit is used to divide the input data into blocks K by bit. i Block K i+1 Block K i+2 and block K i+3 Then, using its own round key RK i Block K i+1 Block K i+2 and block K i+3 Perform an XOR operation and divide the resulting first data into blocks L bit by bit. i Block L i+1 Block L i+2 and block L i+3 Further, block L i Block L i+1 Block L i+2 and block L i+3 Each algebraic expression S-box is input separately, and the permutation data of each algebraic expression S-box is cyclically shifted. Based on the cyclically shifted permutation data and the block K... i Perform an XOR operation to obtain the first... i The processing results of each round operation submodule are stored in the corresponding register; i =0,1,2……,31; The input data of the wheel operation submodule is the processing result of the previous wheel operation submodule, the input data of the 0th wheel operation submodule is the plaintext to be encrypted, and the processing result of the 31st wheel operation submodule is the ciphertext after the plaintext is encrypted. The round key expansion module adopts a cyclic architecture, which obtains 32 sets of round keys by iteratively using the pre-instantiated round key expansion module to perform 32 round operations.
2. An efficient implementation method for the SM4 cryptographic algorithm based on FPGA, characterized in that, Applied to the system described in claim 1; The efficient implementation method of the FPGA-based SM4 cryptographic algorithm includes: Obtain the key input by the user and the plaintext to be encrypted; Generate 32 round keys Rk based on the key. i , i =0,1,2……,31; Using the 32 round keys Rk i The plaintext to be encrypted is subjected to 32 rounds of operations to obtain the ciphertext after encryption.
3. The method of Claim 2, wherein, Each round of operation processes the input data according to the following steps: The first i The input data for round-robin operations is divided into K blocks by bit. i Block K i+1 Block K i+2 and block K i+3 ; Using the first i The round key RK corresponding to each round operation i and block K i+1 Block K i+2 and block K i+3 Perform an XOR operation to obtain the first data; Divide the first data into blocks L by bit. i Block L i+1 Block L i+2 and block L i+3 Then, block L i Block L i+1 Block L i+2 and block L i+3 Input the algebraic expression S-box into each expression to obtain the permutation data; The permutation data of each algebraic S-box are spliced together and cyclically shifted; Based on the permutation data after cyclic shifting and the block K i Perform an XOR operation and get the first... i The processing results of each round of operations are stored in the corresponding register.
4. The method of claim 3, wherein, The first i The input data for round-robin operations is divided into K blocks by bit. i Block K i+1 Block K i+2 and block K i+3 Before the steps, it also includes: From the i Retrieve the first round operation submodule's corresponding register. i The processing result of round -1 is used as the first round. i Input data for round-robin operations; The input data for the 0th round of operation is the plaintext to be encrypted, and the processing result of the 31st round of operation is the ciphertext after the plaintext has been encrypted.
5. The method of claim 3, wherein, Using the first i The round key RK corresponding to each round operation i and block K i+1 Block K i+2 and block K i+3 The steps to perform an XOR operation to obtain the first data include: For block K i+1 Block K i+2 and block K i+3 Perform an XOR operation to obtain the first XOR result; The first XOR result is then compared with the first... i The round key RK corresponding to each round operation i Perform an XOR operation to obtain the first data.
6. The method of claim 5, wherein, Divide the first data into blocks L by bit. i Block L i+1 Block L i+2 and block L i+3 Then, block L i Block L i+1 Block L i+2 and block L i+3 The steps for obtaining permutation data by inputting algebraic expressions into an S-box include: Divide the first data into blocks L by bit. i Block L i+1 Block L i+2 and block L i+3 ; Block L i Block L i+1 Block L i+2 and block L i+3 Inputting the algebraic expression into box S yields block L. i The corresponding first permutation data, block L i+1 The corresponding second permutation data, block L i+2 The corresponding third permutation data and block L i+3 The corresponding fourth permutation data.
7. The method of claim 6, wherein, The steps of splicing the permutation data of each algebraic S-box and performing cyclic shifting include: After concatenating the first, second, third, and fourth permutation data bit by bit, the concatenated data is shifted left by 2 bits, 10 bits, 18 bits, and 24 bits respectively to obtain the first, second, third, and fourth shifted data.
8. The method of claim 7, wherein, Based on the permutation data after cyclic shifting and the block K i Perform an XOR operation and get the first... i The steps for storing the processing results of each round of operations into the corresponding registers include: Perform an XOR operation between the first shifted data and the second shifted data to obtain the second data; The third shifted data is XORed with the fourth shifted data to obtain the third data. The second data, the third data, and the concatenated data are XORed to obtain the fourth data. Combine the fourth data with block K i Perform an XOR operation to obtain the first... i Processing result block K of round-robin operation i+4 ; storing the first i processing result block K of the wheel wheel operation i+4 to the corresponding register.
9. The method of claim 3, wherein, The first i The input data for round-robin operations is divided into K blocks by bit. i Block K i+1 Block K i+2 and block K i+3 The steps include: acquiring a first i wheel operation input data, the input data being 32 bits; The 0th to 7th bits of the input data are used as block K. i Use bits 8 to 15 as block K i+1 Use bits 16 to 23 as block K i+2 Use bits 24-32 as block K i+3 .
Citation Information
Patent Citations
SM4 algorithm realization system of pipeline structure
CN105049194A
SM4 algorithm operation method, system and device and computer readable storage medium
CN114598451A