An acceleration method of an SM4 block cipher algorithm and an instruction set processor
By employing the SM4 extended instruction set with parallel pipeline and instruction-level parallelism on domestically produced processors, the SM4 block cipher algorithm was accelerated, solving the problem of slow execution speed of domestically produced processors and improving the execution efficiency and flexibility of the algorithm.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI HIGH-PERFORMANCE INTEGRATED CIRCUIT DESIGN CENT
- Filing Date
- 2022-10-19
- Publication Date
- 2026-06-16
AI Technical Summary
How to improve the speed of domestic processors in executing the SM4 block cipher algorithm, especially when using an independent instruction set, is a problem that current technologies have not yet effectively solved.
By employing parallel pipeline and instruction-level parallelism techniques, and through the SM4 extended instruction set, including SM4 round key generation instructions and SM4 round function iteration instructions, the SM4 block cipher algorithm is accelerated. The key expansion and encryption/decryption algorithms are executed in parallel using the fixed-length 32-bit format instructions of the RISC architecture.
It significantly improves the execution speed of the SM4 block cipher algorithm, simplifies the software program, and realizes the parallel potential of key expansion and encryption/decryption algorithms, with good scalability and design flexibility.
Smart Images

Figure CN115658148B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of processor design and information security technology, and in particular to an acceleration method and instruction set processor for the SM4 block cipher algorithm. Background Technology
[0002] Cryptography is a crucial safeguard for information security. Countries worldwide prioritize the research and implementation of cryptographic algorithms, and have successively proposed their own standard cryptographic algorithm systems. The Chinese National Cryptography Algorithm (Guokui) is a series of cryptographic algorithms and specifications independently developed by the State Cryptography Administration of China to ensure the security of commercial cryptography in my country. The SM4 block cipher algorithm, released by the State Cryptography Administration on March 21, 2012, is an important component of my country's standard cryptographic algorithm system and is significant for accelerating the execution speed of processors using the SM4 block cipher algorithm. With the implementation of domestic cryptography laws and cybersecurity laws, the SM4 block cipher algorithm is becoming increasingly widespread, and how to execute it more efficiently has become a research hotspot.
[0003] The SM4 block cipher algorithm is a typical block cipher algorithm, mainly consisting of the SM4 key expansion algorithm and the SM4 encryption / decryption algorithm. Both the plaintext and ciphertext blocks and the key length are 128 bits. The encryption, decryption, and key expansion algorithms all employ a 32-round nonlinear iterative structure. The data decryption and encryption algorithms have the same structure, both including the same 32 rounds of nonlinear round function iterations and a reverse transformation R. The only difference is that the round keys are used in reverse order; the decryption round key is the reverse of the encryption round key.
[0004] The system parameters of the SM4 block cipher algorithm can be represented as FK-(FK0,FK1,FK2,FK3), and the fixed parameters of the algorithm can be represented as CK-(CK0,CK1,…,CK). 31 ), where FK i (i-0,1,2,…,31), CK i (i-0,1,2,…,31) is a 32-bit word used in the key expansion algorithm. The encryption key can be represented as MK-(MK0,MK1,MK2,MK3), where MK... i (i-0,1,2,3) represents a word of length 32 bits. Based on the key expansion algorithm, the encryption key can generate round keys, which are represented as (rk0,rk1,…,rk…). 31 ), where rk i (i-0,1,2,…,31) is a 32-bit word.
[0005] Let the input be (X0,X1,X2,X3) and the round key be rk. Then the round function of the SM4 block cipher algorithm is: F(X0,X1,X2,X3,rk)-X0⊕T(X1⊕X2⊕X3⊕rk), where the synthesis permutation T: Z 32 2→Z 32 2 is an invertible transformation composed of a nonlinear transformation τ and a linear transformation L, i.e., T(·) - L(τ(·)). The nonlinear transformation τ consists of four parallel S-boxes. Let the input be A-(a0,a1,a2,a3) and the output be B-(b0,b1,b2,b3), then: (b0,b1,b2,b3)-τ(A)-(Sbox(a0),Sbox(a1),Sbox(a2),Sbox(a3)), where the data of the Sboxes can be obtained by looking up a table. The output of the nonlinear transformation τ is the input of the linear transformation L, let the input be B∈Z. 32 2. The output is C∈Z 32 2 rules: CL(B)-B⊕(B<<<2)⊕(B<<<10)⊕(B<<<18)⊕(B<<<24).
[0006] The SM4 block cipher algorithm consists of 32 iterations and 1 reverse transformation R; let the plaintext input be (X0, X1, X2, X3) ∈ (Z 32 2) 4 The ciphertext output is (Y0,Y1,Y2,Y3)∈(Z 32 2) 4 The round key is rk i ∈Z 32 2(i-0,1,2,…,31). The iterative process of the encryption algorithm is as follows:
[0007] (1) 32 iterations: X i+4 -F(X i+0 ,X i+1 ,X i+2 ,X i+3 ), i-0,1,2,…,31;
[0008] (2) Reverse order transformation: (Y0,Y1,Y2,Y3)-R(X) 32 ,X 33 ,X 34 ,X 35 )-(X 35 ,X 34 ,X 33 ,X 32 ).
[0009] The decryption transformation of the SM4 block cipher algorithm is the same as its encryption transformation structure; the only difference is the order in which the round keys are used. During decryption, the round key used is (rk).31 ,rk 30 ,…,rk0).
[0010] The key for the SM4 block cipher algorithm is generated from the encryption key using a key expansion algorithm.
[0011] Encryption key MK-(MK0,MK1,MK2,MK3)∈(Z 32 2) 4 The round key generation method is as follows:
[0012] (K0,K1,K2,K3)-(MK0⊕FK0,MK1⊕FK1,MK2⊕FK2,MK3⊕FK3),
[0013] rk i -K i+4 -K i ⊕T'(K i+1 ⊕K i+2 ⊕K i+3 ⊕Ck i ),i-0,1,2,…,31; where:
[0014] (1) T' is the linear transformation L that replaces the synthetic permutation T with L':
[0015] L'(B)-B⊕(B<<<13)⊕(B<<<23)
[0016] (2) The method for determining the value of system parameter FK is as follows:
[0017] FK0-(A3BiBAC6), FK1-(56AA3350), FK2-(677D9197), FK3-(B27022DC).
[0018] (3) The method for determining the value of the fixed parameter CK in the algorithm is as follows:
[0019] CK i (i-0,1,2,…,31) are fixed parameters of the algorithm, let ck i,j For CK i The j Bytes (i-0,1,2,…,31; j-0,1,2,3), i.e. CK i -(ck i,0 ,ck i,1 ,ck i,2 ,ck i,3 )∈(Z 32 2) 4 , then ck i,j -(4i+j)×7(mod 256).
[0020] The SM4 block cipher algorithm, encompassing encryption, decryption, and key expansion, requires significant computational resources, necessitating specialized acceleration techniques. Currently, methods for accelerating SM4 implementation fall into two categories: software and hardware implementations. Software implementations can be further categorized into techniques such as AESNI instruction set acceleration, Bitslice, and SIMD-based parallel computing. However, software implementations are characterized by limited optimization space and applicability, and are susceptible to security threats like side-channel attacks. Hardware implementations, on the other hand, optimize SM4 efficiency through key technologies like composite domain techniques, employing dedicated hardware such as FPGAs, ASICs, and GPUs. Hardware implementations offer high acceleration efficiency but are costly, lack versatility, and scalability. If an Instruction Set Architecture (ISA) extension can be used to accelerate SM4, it can both speed up execution and provide scalability and design flexibility, effectively improving the performance of general-purpose processors executing SM4.
[0021] Currently, the problem of how to improve the performance of domestic processors using independent instruction sets in executing the SM4 block cipher algorithm has not been effectively solved. Therefore, it is urgent to explore a method for accelerating the SM4 block cipher algorithm for domestic processors, thereby improving the speed at which domestic processors execute the SM4 block cipher algorithm. Summary of the Invention
[0022] The technical problem to be solved by the present invention is to provide an acceleration method and instruction set processor for the SM4 block cipher algorithm, which can greatly improve the speed of executing the SM4 block cipher algorithm and simplify the software program.
[0023] The technical solution adopted by this invention to solve its technical problem is as follows: It provides an acceleration method for the SM4 block cipher algorithm, based on the SM4 extended instruction set, using parallel pipeline and instruction-level parallelism to accelerate the implementation of the SM4 block cipher algorithm. The SM4 block cipher algorithm includes an SM4 key expansion algorithm and an SM4 encryption / decryption algorithm. The SM4 extended instruction set adopts a RISC architecture, with instructions in a fixed-length 32-bit format, and both source and destination operands being 256 bits. The SM4 extended instruction set includes SM4 round key generation instructions and SM4 round function iteration instructions. The SM4 round key generation instructions adopt... Multiple SM4 round key parallel generation algorithms are used to accelerate the SM4 key expansion algorithm. These algorithms take the previous four 32-bit intermediate keys K3-K0 and eight 32-bit fixed algorithm parameters CK7-CK0 related to the subsequent eight round keys as input. During the SM4 round key expansion process, one execution is sufficient to generate all eight round keys. The SM4 round function iteration instructions employ a multi-round SM4 iterative parallel execution algorithm to accelerate the SM4 encryption / decryption algorithm. This algorithm takes two sets of unrelated current four intermediate words W3-W0 and W'3-W'0 as input. ’ The 8 round keys rk7 to rk0 used in the subsequent 8 rounds of operation are taken as input. During the encryption / decryption process of the SM4 block cipher algorithm, two sets of 4 unrelated intermediate words are generated after 8 rounds of execution of the SM4 encryption / decryption round function.
[0024] When using parallel pipeline and instruction-level parallelism techniques to accelerate the implementation of the SM4 block cipher algorithm:
[0025] Encryption includes the following steps:
[0026] (A) Generate the initial values (K3, K2, K1, K0) of the round key iteration using general instructions in a general-purpose processor;
[0027] (B) The first SM4 round key generation instruction is executed with the initial value of the round key iteration (K3,K2,K1,K0) and the fixed system parameters CK7~CK0 as input to generate the 8 round keys rk7~rk0 of the SM4 block cipher algorithm;
[0028] (C) Using the round keys rk7~rk4 and the system fixed parameter CK 15 ~CK8 is the input to execute the second SM4 round key generation instruction, generating the 8 round keys rk for the SM4 block cipher algorithm. 15 ~rk8; Simultaneously, using the round keys rk7~rk0 and two sets of unrelated plaintext (W3,W2,W1,W0,W'3,W'2,W'1,W'0) as input, execute the first SM4 round function iteration instruction, completing the 1st to 8th round function iterations of the SM4 encryption algorithm, and obtaining the iteration working word (W(7) 3,W (7) 2,W (7) 1,W (7) 0,W' (7) 3,W' (7) 2,W' (7) 1,W' (7) 0);
[0029] (D) with the round key rk 15 ~rk 12 and system fixed parameters CK 23 ~CK 16 The third SM4 round key generation instruction is executed to generate the eight round keys rk for the SM4 block cipher algorithm. 23 ~rk 16 Simultaneously, using the round key rk 15 ~rk8 and iterative working word (W) (7) 3,W (7) 2,W (7) 1,W (7) 0,W' (7) 3,W' (7) 2,W' (7) 1,W' (7) Taking 0 as input, execute the second SM4 round function iteration instruction, completing the 9th to 16th round function iterations of the SM4 encryption algorithm, and obtaining the iterative working word (W). (15) 3,W (15) 2,W (15) 1,W (15) 0,W' (15) 3,W' (15) 2,W' (15) 1,W' (15) 0);
[0030] (E) with the round key rk 23 ~rk 20 and system fixed parameters CK 31 ~CK 24 The fourth SM4 round key generation instruction is executed to generate the eight round keys rk for the SM4 block cipher algorithm. 31 ~rk 24 Simultaneously, using the round key rk 23 ~rk 16 and iterative working word (W) (15) 3,W (15) 2,W (15) 1,W (15) 0,W' (15) 3,W' (15) 2,W' (15) 1,W' (15)Taking 0 as input, execute the third SM4 round function iteration instruction (VSM4R), completing round function iterations 17-24 of the SM4 encryption algorithm, and obtaining the iterative working word (W). (23) 3,W (23) 2,W (23) 1,W (23) 0,W' (23) 3,W' (23) 2,W' (23) 1,W' (23) 0);
[0031] (F) with the round key rk 31 ~rk 24 and iterative working word (W) (23) 3,W (23) 2,W (23) 1,W (23) 0,W' (23) 3,W' (23) 2,W' (23) 1,W' (23) Taking 0 as input, execute the 4th SM4 round function iteration instruction, completing the 25th to 32nd round function iterations of the SM4 encryption algorithm, and obtaining the iterative working word (W). (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0);
[0032] (G) The iterative workword (W) (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0) Reverse the output to obtain the ciphertext (Y3,Y2,Y1,Y0,Y'3,Y'2,Y'1,Y'0) of the encryption algorithm;
[0033] Decryption includes the following steps:
[0034] (a) Generate the initial values (K3, K2, K1, K0) of the round key iteration using general instructions in a general-purpose processor;
[0035] (b) Iterate the initial values (K3, K2, K1, K0) and the system fixed parameter CK using the round key. 31~CK0 is the source operation data. The SM4 round key generation instruction is executed 4 times in sequence to generate 32 round keys rk for the SM4 block cipher algorithm. 31 ~rk0;
[0036] (c) Using the round key rk 31 Taking rk0 and the ciphertext (Y3,Y2,Y1,Y0,Y'3,Y'2,Y'1,Y'0) as input, the SM4 round function iteration instruction is executed 4 times in sequence, completing 32 round function iterations of the SM4 decryption algorithm, and obtaining the iterative working word (W). (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0);
[0037] (d) The iterative working word (W) (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0) Reverse the output to obtain the plaintext result of the decryption algorithm (W3,W2,W1,W0,W'3,W'2,W'1,W'0).
[0038] The SM4 round key generation instruction adopts a simple arithmetic instruction format with immediate values, specifically VSM4KEYVa.,#b,Vc. This instruction is used to instruct an operation on an operand in a 256-bit source register Va and an 8-bit immediate operand, with the result stored in a 256-bit destination register Vc. In the 32-bit instruction, bits [31:26] represent a 6-bit opcode, bits [25:21] indicate that one of the 32 256-bit register files is selected as the source register Va to store the source operand of the instruction, bits [20:13] represent an 8-bit immediate operand indicating the number of loop iterations, bits [12:5] represent an 8-bit function code used to determine the specific function of the instruction, and bits [4:0] indicate that one of the 32 256-bit register files is selected as the destination register Vc to store the result of the instruction operation.
[0039] The SM4 round key generation instruction is as follows: Based on the previous four intermediate keys K3 to K0, generate the subsequent eight round keys rk7 to rk0 of the SM4 block cipher algorithm; wherein, the intermediate keys K3 to K0 are stored in the high 128 bits of the source register Va, the algorithm fixed parameters CK7 to CK0 are determined according to the immediate value #b, and the generated results rk7 to rk0 are stored in the destination register Vc; the results rk7 to rk0 are respectively equal to the intermediate keys K3 to K0. 11 ~K4, for i equal to 0, 1, 2, 3, 4, 5, 6, 7, K i+4 The generation logic is: K i+4 =K i XOR Temp2 XOR(Temp2<<<13) XOR(Temp2<<<23), where XOR represents bitwise XOR, <<< represents circular left shift, Temp2 is a 32-bit intermediate variable word, and the generation logic of Temp2 is: Temp2[31:0] = SBOX(Temp1). The SBOX(X) function is used to look up a new 32-bit data in parallel based on the 4 bytes of X. Temp1 is a 32-bit intermediate variable word, and the generation logic of Temp1 is: Temp1[31:0] = K i+1 XOR K i+2 XOR K i+3 XOR CK i CK i For a 32-bit algorithm fixed parameter, {CK7~CK0}=SELCK(#b), the SELCK(#b) function is used to determine the algorithm fixed parameter CK during the key expansion process of the SM4 block cipher algorithm. i The values of CK7 to CK0, representing eight 32-bit data points, can be determined by #b according to the SM4 block cipher algorithm. Executing the SM4 round key generation instruction once generates the eight round keys for the SM4 block cipher algorithm. This SM4 round key generation instruction is executed four times sequentially. Each time, the high 128 bits of the source register Va are updated with the high 128 bits of the target register Vc, and the immediate value #b is incremented by 1, thus generating the 32-round key rk for the SM4 block cipher algorithm. 31 ~rk0.
[0040] The SM4 round function iteration instruction adopts a simple arithmetic instruction format in register format, specifically VSM4R Va,Vb,Vc. It is used to instruct two operands in two 256-bit source registers Va and Vb to perform operations, and the result is stored in a 256-bit destination register Vc. In the 32-bit instruction, bits [31:26] represent the 6-bit opcode, bits [25:21] indicate that one of the register files consisting of 32 256-bit registers is selected as the source register Va to store the source operands of the instruction, bits [20:16] indicate that one of the register files consisting of 32 256-bit registers is selected as the source register Vb to store the source operands of the instruction, bits [15:13] are always all "0", bits [12:5] represent the 8-bit function code used to determine the specific function of the instruction, and bits [4:0] indicate that one of the register files consisting of 32 256-bit registers is selected as the destination register Vc to store the result of the instruction.
[0041] The SM4 round function iteration instruction is specifically as follows: Based on the current four intermediate words W3 to W0 and the eight round keys rk7 to rk0 used in the subsequent eight rounds of operation, generate the four intermediate words W after eight rounds of iteration of the SM4 block cipher algorithm. 11 ~W8 performs two 128-bit operations in parallel. W3~W0 are stored in the high or low 128 bits of the source register Va, and rk7~rk0 are stored in the source register Vb. The resulting W... 11 ~W8 is stored in the corresponding high or low bit of the target register Vc; in the 8 iterations, for j equal to 0, 1, 2, 3, 4, 5, 6, 7, the result W j+4 The generation logic of W is: j+4 =W j XOR Temp2XOR(Temp2<<<2)XOR(Temp2<<<10)XOR(Temp2<<<18)XOR(Temp2<<<24), where XOR represents bitwise XOR, <<< represents circular left shift, Temp2 is a 32-bit intermediate variable word, and the generation logic of Temp2 is: Temp2[31:0] = SBOX(Temp1), the SBOX(X) function is used to perform a table lookup operation in parallel based on the 4 bytes of the 32-bit data X to obtain a new 32-bit data, Temp1 is a 32-bit intermediate variable word, and the generation logic of Temp1 is: Temp1[31:0] = W j+1 XOR W j+2 XOR W j+ 3XOR rk jThe SM4 round function iteration instruction is executed once to complete 8 iterations of the encryption and decryption round functions of the two sets of SM4 block cipher algorithms. The SM4 round function iteration instruction is executed 4 times in sequence. Each time, the source register Vb is updated with 8 new round keys, and the data in the source register Va is updated with the data in the generated target register Vc, so as to generate the final 4 words of the two sets of SM4 block cipher algorithms.
[0042] The technical solution adopted by the present invention to solve its technical problem is: to provide an instruction set processor, including a register file, an SM4 round key generation instruction execution unit and an SM4 round function iteration instruction execution unit, wherein the SM4 round key generation instruction execution unit and the SM4 round function iteration instruction execution unit are placed on different execution pipelines and occupy different read and write ports of the register file respectively;
[0043] The SM4 round key generation instruction execution unit has:
[0044] The system has two input terminals, one for a 128-bit operand A and the other for an 8-bit immediate operand B.
[0045] One output terminal is used to output a 256-bit execution result;
[0046] The SM4 round key generation instruction execution unit directly implements shift operations and processes system parameters using hardware logic, and uses hardware lookup tables to implement SBOX operations and pipelined parallel processing of 8 round keys; the SM4 round key generation instruction execution unit can pipeline the SM4 round key generation instructions.
[0047] The SM4 round function iteration instruction execution unit has:
[0048] The system has two input terminals, one for inputting a 258-bit operand A and the other for inputting a 256-bit operand B.
[0049] One output terminal is used to output a 256-bit execution result.
[0050] The SM4 round function iteration instruction execution unit directly implements shift operations and processes system parameters using hardware logic, and uses hardware lookup tables to implement SBOX operations and pipelined parallel processing of 8 round functions. The SM4 round function iteration instruction execution unit can pipeline the SM4 round function iteration instructions.
[0051] The SM4 round key generation instruction execution unit is configured with 8 levels of iterative execution stations and 1 level of output station, with a total execution delay of 9 clock cycles; the SM4 round function iteration instruction execution unit is configured with 8 levels of iterative execution stations and 1 level of output station, with a total execution delay of 9 clock cycles; the instruction set processor supports the parallel pipelined execution of the SM4 round key generation instruction and the SM4 round function iteration instruction.
[0052] Beneficial effects
[0053] By adopting the above-mentioned technical solution, the present invention has the following advantages and positive effects compared with the prior art:
[0054] This invention employs the SM4 round key generation instruction (VSM4KEY) of the multi-round key parallel generation algorithm to achieve parallel generation of multiple round keys, and the SM4 round function iteration instruction (VSM4R) of the multi-round SM4 iterative parallel execution algorithm to achieve multi-round iterative parallel execution of SM4 encryption and decryption round functions, which greatly accelerates the execution speed of the SM4 block cipher algorithm. Using the SM4 extended instruction set of this invention to write SM4 block cipher algorithm programs can complete all the functions of the SM4 key expansion algorithm and the SM4 encryption and decryption algorithm in the SM4 block cipher algorithm, which significantly simplifies the software program, facilitates algorithm writing, and reduces the storage overhead of the algorithm.
[0055] This invention fully realizes the parallel potential of the SM4 block cipher algorithm and the SM4 extended instruction set by using parallel pipeline and instruction-level parallelism techniques, which significantly accelerates the execution speed of the SM4 block cipher algorithm.
[0056] In this invention, the execution delay of both the VSM4KEY and VSM4R instructions is 9 clock cycles. Two pipelines supporting parallel pipelined execution of instructions are set up. The execution speed is improved by directly implementing the shift operation in the algorithm with hardware logic, processing system parameters, fixed parameters and the value of the SBOX box with hardware logic, and implementing the SBOX(x) function with hardware lookup table. Using this processor, the generation of round keys and encryption (decryption) iteration of 9 sets of data that are not related can be completed in as little as 54 (81) clock cycles, which greatly improves the execution speed of the SM4 block cipher algorithm.
[0057] This invention realizes the parallel potential of the key expansion algorithm and encryption / decryption algorithm in the SM4 block cipher algorithm, and has the advantages of easy portability and good scalability, making it easy to integrate into existing execution components of general-purpose processors. It can be applied to RISC processors or dedicated cryptographic chips to improve their performance in executing the SM4 block cipher algorithm. Attached Figure Description
[0058] Figure 1 This is a flowchart of the execution process of the SM4 extended instruction set;
[0059] Figure 2 This is a flowchart of the encryption algorithm implementation in the method to accelerate the SM4 block cipher algorithm;
[0060] Figure 3 This is a flowchart of the decryption algorithm implementation in the method to accelerate the SM4 block cipher algorithm;
[0061] Figure 4 This is a flowchart of a parallel key generation algorithm for multiple SM4 rounds;
[0062] Figure 5 This is a flowchart of the multi-round SM4 iterative parallel execution algorithm;
[0063] Figure 6 It is a simple arithmetic instruction format in immediate format;
[0064] Figure 7 It is a simple arithmetic instruction format in register format;
[0065] Figure 8 This is a structural diagram of the VSM4KEY instruction execution unit;
[0066] Figure 9 This is a structural diagram of the VSM4R instruction execution unit;
[0067] Figure 10 This is a block diagram of the processor core or processor execution pipeline of an embodiment of the present invention. Detailed Implementation
[0068] The present invention will be further illustrated below with reference to specific embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. Furthermore, it should be understood that after reading the teachings of this invention, those skilled in the art can make various alterations or modifications to the invention, and these equivalent forms also fall within the scope defined by the appended claims.
[0069] The SM4 block cipher algorithm mainly consists of two parts: the SM4 key expansion algorithm and the SM4 encryption / decryption algorithm. Both involve 32 rounds of nonlinear iterative operations. The core of the algorithm lies in the round functions of the key expansion algorithm and the encryption / decryption algorithm. Therefore, the key to accelerating the SM4 block cipher algorithm is to fully explore and realize the inherent parallelism of the round functions of the key expansion algorithm and the encryption / decryption algorithm, and to achieve parallel execution of the round functions as much as possible.
[0070] The inventors of this invention discovered parallel potential in both the round key generation of the SM4 key expansion algorithm and the round iteration of the SM4 encryption / decryption algorithm. Parallelism of the round functions can be achieved using dedicated instructions, enabling the parallel generation of multiple round keys and the completion of multiple round iterations of the encryption / decryption round functions in a single operation. Dedicated circuitry can support the parallel pipelined execution of the SM4 round key generation instruction (VSM4KEY) and the SM4 round function iteration instruction (VSM4R). Furthermore, the round functions of the SM4 key expansion algorithm and the SM4 encryption algorithm contain numerous shift and XOR operations. Directly implementing shift and XOR operations using hardware logic can effectively accelerate the execution of the round functions. Implementing system parameters, fixed parameters, and the SBOX(x) function processing using hardware logic can significantly accelerate parameter processing and the execution of the nonlinear transformation τ. Pipeline execution technology allows for the parallel execution of multiple unrelated SM4 block cipher algorithms.
[0071] The embodiments of the present invention relate to an acceleration method for the SM4 block cipher algorithm. This method is based on the SM4 extended instruction set, which adopts a RISC architecture. All instructions use a fixed-length 32-bit format, and both source operands and results are 256 bits. Figure 1 As shown, the SM4 extended instruction set includes an SM4 round key generation instruction (VSM4KEY) for accelerating the SM4 key expansion algorithm and an SM4 round function iteration instruction (VSM4R) for accelerating the SM4 encryption and decryption algorithm. Parallel pipeline and instruction-level parallelism techniques are used to accelerate the implementation of the SM4 block cipher algorithm. The VSM4KEY and VSM4R instructions are executed in parallel pipeline. Multiple sets of data-independent SM4 block cipher algorithms can be executed in parallel at different execution stations in the pipeline to implement their respective SM4 block cipher algorithms and complete their respective round key expansion and encryption / decryption.
[0072] When using parallel pipelines and instruction-level parallelism to accelerate the implementation of the SM4 block cipher algorithm, the specific process of the encryption algorithm in the method for accelerating the SM4 block cipher algorithm is as follows: Figure 2 As shown, it includes the following steps:
[0073] 1) Generate the initial values (K3, K2, K1, K0) of the round key iteration using general-purpose instructions in a general-purpose processor;
[0074] 2) Execute the first SM4 round key generation instruction (VSM4KEY) with the initial iteration value (K3,K2,K1,K0) and the fixed system parameters CK7~CK0 as input to generate the 8 round keys rk7~rk0 of the SM4 block cipher algorithm;
[0075] 3) Using the keys rk7 to rk4 generated in step 2) and the system fixed parameter CK 15~CK8 is the input to execute the second SM4 round key generation instruction (VSM4KEY), which generates the 8 round keys rk for the SM4 block cipher algorithm. 15 ~rk8; Simultaneously, using the keys rk7~rk0 generated in step 2) and two sets of unrelated plaintext (W3,W2,W1,W0,W'3,W'2,W'1,W'0) as input, execute the first SM4 round function iteration instruction (VSM4R) to complete the SM4 encryption algorithm's 1st to 8th round function iterations, obtaining the iteration working word (W (7) 3,W (7) 2,W (7) 1,W (7) 0,W' (7) 3,W' (7) 2,W' (7) 1,W' (7) 0);
[0076] 4) Using the key rk generated in step 3) 15 ~rk 12 and system fixed parameters CK 23 ~CK 16 The third SM4 round key generation instruction (VSM4KEY) is executed to generate the eight round keys rk for the SM4 block cipher algorithm. 23 ~rk 16 Simultaneously, the key rk generated in step 3) 15 ~rk8 and iterative working word (W) (7) 3,W (7) 2,W (7) 1,W (7) 0,W' (7) 3,W' (7) 2,W' (7) 1,W' (7) Taking 0 as input, execute the second SM4 round function iteration instruction (VSM4R), completing the 9th to 16th round function iterations of the SM4 encryption algorithm, and obtaining the iterative working word (W). (15) 3,W (15) 2,W (15) 1,W (15) 0,W' (15) 3,W' (15) 2,W' (15) 1,W' (15) 0);
[0077] 5) Using the key rk generated in step 4) 23 ~rk 20 and system fixed parameters CK 31 ~CK 24The fourth SM4 round key generation instruction (VSM4KEY) is executed to generate the eight round keys rk for the SM4 block cipher algorithm. 31 ~rk 24 Simultaneously, the key rk generated in step 4) 23 ~rk 16 and iterative working word (W) (15) 3,W (15) 2,W (15) 1,W (15) 0,W' (15) 3,W' (15) 2,W' (15) 1,W' (15) Taking 0 as input, execute the third SM4 round function iteration instruction (VSM4R), completing round function iterations 17-24 of the SM4 encryption algorithm, and obtaining the iterative working word (W). (23) 3,W (23) 2,W (23) 1,W (23) 0,W' (23) 3,W' (23) 2,W' (23) 1,W' (23) 0);
[0078] 6) Using the key rk generated in step 5) 31 ~rk 24 and iterative working word iterative working word (W) (23) 3,W (23) 2,W (23) 1,W (23) 0,W' (23) 3,W' (23) 2,W' (23) 1,W' (23) Taking 0 as input, execute the 4th SM4 round function iteration instruction (VSM4R), completing the 25th to 32nd round function iterations of the SM4 encryption algorithm, and obtaining the iterative working word (W). (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0);
[0079] 7) The iterative working word (W) in step 6) (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W'(31) 1,W' (31) 0) Reverse the output to obtain the ciphertext (Y3,Y2,Y1,Y0,Y'3,Y'2,Y'1,Y'0) of the encryption algorithm.
[0080] The specific process of the decryption algorithm in the method for accelerating the SM4 block cipher algorithm is as follows: Figure 3 As shown, it includes the following steps:
[0081] 1) Generate the initial values (K3, K2, K1, K0) of the round key iteration using general-purpose instructions in a general-purpose processor;
[0082] 2) Using initial values (K3, K2, K1, K0) and fixed system parameters CK... 31 ~CK0 is the source operation data. The SM4 round key generation instruction (VSM4KEY) is executed 4 times in sequence to generate 32 round keys rk for the SM4 block cipher algorithm. 31 ~rk0;
[0083] 3) Using the key rk generated in step 2) 31 Taking rk0 and the ciphertext (Y3,Y2,Y1,Y0,Y'3,Y'2,Y'1,Y'0) as input, the SM4 round function iteration instruction (VSM4R) is executed 4 times in sequence, completing 32 round function iterations of the SM4 decryption algorithm, and obtaining the iterative working word (W). (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0);
[0084] 4) The iterative working word (W) in step 3) (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0) Reverse the output to obtain the plaintext result of the decryption algorithm (W3,W2,W1,W0,W'3,W'2,W'1,W'0).
[0085] The SM4 round key generation instruction (VSM4KEY) employs multiple SM4 round key parallel generation algorithms. This instruction, executed once, can generate the keys for all eight rounds of the SM4 block cipher algorithm. The multiple SM4 round key parallel generation algorithms are as follows: Figure 4As shown, its function is to take the previous four 32-bit intermediate keys K3 to K0 and the eight 32-bit algorithm fixed parameters CK7 to CK0 related to the subsequent eight round keys as inputs, and execute it once in the SM4 round key expansion process to complete the generation of eight round keys.
[0086] The SM4 round function iteration instruction (VSM4R) employs a multi-round SM4 iterative parallel execution algorithm, completing eight iterations of the SM4 block cipher algorithm's encryption and decryption round functions in a single execution. The multi-round SM4 iterative parallel execution algorithm is as follows: Figure 5 As shown, its function is to take two sets of unrelated current four intermediate words W3~W0, W'3~W'0 and the eight round keys rk7~rk0 used in the subsequent eight rounds of operation as input, and generate two sets of unrelated four intermediate words after eight rounds of SM4 encryption / decryption in the encryption / decryption process of the SM4 block cipher algorithm.
[0087] The SM4 round key generation instruction (VSM4KEY) uses a simple arithmetic instruction format with immediate values. The instruction format is VSM4KEY Va,#b,Vc, which instructs the user to perform an operation on an operand in a 256-bit source register Va and an 8-bit immediate operand, storing the result in a 256-bit destination register Vc. Figure 6 As shown, bits [31:26] of the 32-bit instruction represent the 6-bit opcode, bits [25:21] indicate that one of the 32 256-bit register files is selected as the source register Va to store the source operand of the instruction, bits [20:13] represent an 8-bit immediate operand indicating the number of loop iterations, bits [12:5] represent the 8-bit function code used to determine the specific function of the instruction, and bits [4:0] indicate that one of the 32 256-bit register files is selected as the destination register Vc to store the result of the instruction.
[0088] The function of the SM4 round key generation instruction (VSM4KEY) is to generate the subsequent eight round keys rk7 to rk0 of the SM4 block cipher algorithm based on the previous four intermediate keys K3 to K0. K3 to K0 are stored in the high 128 bits of the source register Va, the algorithm's fixed parameters CK7 to CK0 are determined by the immediate value #b, and the generated results rk7 to rk0 are stored in the destination register Vc. The operations performed by the SM4 round key generation instruction (VSM4KEY) are as follows:
[0089]
[0090]
[0091] The SELCK(#b) function is used to determine the fixed parameter CK in the key expansion process of the SM4 block cipher algorithm. i The value of is determined by #b, which specifies eight 32-bit data points CK7 to CK0. The specific values are shown in Table 1, where all data are hexadecimal numbers.
[0092] Table 1. Values of the SELCK(#b) function in the SM4 round key generation instruction (VSM4KEY)
[0093]
[0094] Executing the SM4 round key generation instruction (VSM4KEY) once generates 8 round keys for the SM4 block cipher algorithm. Executing this instruction 4 times in sequence, updating Va with the high 128 bits of the generated Vc each time and incrementing the immediate value #b by 1, can generate 32 round keys rk for the SM4 block cipher algorithm. 31 ~rk0.
[0095] The SM4 round function iteration instruction (VSM4R) uses a register-based simple arithmetic instruction format. The instruction format is VSM4R Va,Vb,Vc. It instructs two operands in two 256-bit source registers Va and Vb to be operated on, and the result is stored in a 256-bit destination register Vc. Figure 7 As shown, bits [31:26] of the 32-bit instruction represent the 6-bit opcode; bits [25:21] indicate that one of the 32 256-bit register files is selected as the source register Va to store the source operands of the instruction; bits [20:16] indicate that one of the 32 256-bit register files is selected as the source register Vb to store the source operands of the instruction; bits [15:13] are always all "0"; bits [12:5] represent the 8-bit function code used to determine the specific function of the instruction; and bits [4:0] indicate that one of the 32 256-bit register files is selected as the destination register Vc to store the result of the instruction.
[0096] The SM4 round function iteration instruction (VSM4R) generates the four intermediate words W3 to W0 after eight rounds of the SM4 block cipher algorithm, based on the current four intermediate words W3 to W0 and the eight round keys rk7 to rk0 used in the subsequent eight rounds of operation. 11 ~W8 can perform two 128-bit operations in parallel, where W3~W0 are stored in the high or low 128 bits of the source register Va, and rk7~rk0 are stored in the source register Vb. The resulting W 11~W8 is stored in the corresponding high or low bit of the destination register Vc. The operation performed by the SM4 round function iteration instruction (VSM4R) is as follows:
[0097]
[0098] The SM4 round function iteration instruction (VSM4R) can complete 8 iterations of the encryption and decryption round functions of the two SM4 block cipher algorithms in one go. Each time, Vb is updated with 8 new round keys and Va is updated with the generated Vc. Executing this instruction 4 times in sequence can generate the final 4 words of the two SM4 block cipher algorithms.
[0099] In the operations defined by the SM4 extended instruction set, the SBOX(X) function performs a parallel lookup operation on the four bytes of the 32-bit data X (X[31:24], X[23:16], X[15:8], X[7:0]) to obtain a new 32-bit data, i.e., SBOX(X) = {SBOX(X[31:24]), SBOX(X[23:16]), SBOX(X[15:8]), SBOX(X[7:0])}. Specific values are shown in Table 2, where all data are hexadecimal numbers. For example, if the value of a byte in the lookup table is 0xef, then the value after the SBOX lookup is the value of the e-th row and f-th column in the table, SBOX(0xef) = 0x84.
[0100] Table 2. Byte-based lookup table for SBOX(X) in the SM4 extended instruction set.
[0101] 0 1 2 3 4 5 6 7 8 9 a b c d e f 0 d6 90 e9 fe cc e1 3d b7 16 b6 14 c2 28 fb 2c 05 1 2b 67 9a 76 2a be 04 c3 aa 44 13 26 49 86 06 99 2 9c 42 50 f4 91 ef 98 7a 33 54 0b 43 ed cf ac 62 3 e4 b3 1c a9 c9 08 e8 95 80 df 94 fa 75 8f 3f a6 4 47 07 a7 fc f3 73 17 ba 83 59 3c 19 e6 85 4f a8 5 68 6b 81 b2 71 64 da 8b f8 eb 0f 4b 70 56 9d 35 6 1e 24 0e 5e 63 58 d1 a2 25 22 7c 3b 01 21 78 87 7 d4 00 46 57 9f d3 27 52 4c 36 02 e7 a0 c4 c8 9e 8 ea bf 8a d2 40 c7 38 b5 a3 f7 f2 ce f9 61 15 a1 9 e0 ae 5d a4 9b 34 1a 55 ad 93 32 30 f5 8c b1 e3 a 1d f6 e2 2e 82 66 ca 60 c0 29 23 ab 0d 53 4e 6f b d5 db 37 45 de fd 8e 2f 03 ff 6a 72 6d 6c 5b 51 c 8d 1b af 92 bb dd bc 7f 11 d9 5c 41 1f 10 5a d8 d 0a c1 31 88 a5 cd 7b bd 2d 74 d0 12 b8 e5 b4 b0 e 89 69 97 4a 0c 96 77 7e 65 b9 f1 09 c5 6e c6 84 f 18 f0 7d ec 3a DC 4d 20 79 ee 5f 3e d7 cb 39 48
[0102] Embodiments of the present invention also relate to an instruction set processor, such as... Figure 10 As shown, it includes a register file, an SM4 round key generation instruction execution unit, and an SM4 round function iteration instruction execution unit. The SM4 round key generation instruction execution unit and the SM4 round function iteration instruction execution unit are placed on different execution pipelines and occupy different read and write ports of the register file respectively (they can also be placed on the same execution pipeline and share the register file read and write ports, but this will result in the inability to start or complete the VSM4KEY instruction and the VSM4R instruction at the same time). The execution delay of the VSM4KEY instruction and the VSM4R instruction is 9 clock cycles, and the parallel pipelined execution of the VSM4KEY instruction and the VSM4R instruction is supported, thereby achieving higher computing speed.
[0103] The VSM4KEY instruction execution unit is used to receive and execute VSM4KEY instructions, such as... Figure 8The input includes: a 128-bit operand A (the previous four intermediate keys, from the register file) and an 8-bit immediate operand B (used for hardware lookup to obtain system parameter CK). i+7 ~CK i+0 The output is a 256-bit execution result (write-back register file), which is the 8-round key of the SM4 block cipher algorithm. The VSM4KEY instruction execution unit adopts pipelined execution technology, and one execution can complete one VSM4KEY instruction (generating the 8-round key of the SM4 block cipher algorithm). Executing it continuously for 4 times can generate the 32-round key rk of the SM4 block cipher algorithm. 31 ~rk0; The VSM4KEY instruction execution unit is equipped with 8 levels of iterative execution stations and 1 level of output station, with a total execution delay of 9 clock cycles. It supports pipelined execution and improves the hardware execution speed of instructions by using methods such as direct implementation of shift operations by hardware logic, dedicated hardware logic to process system parameters, hardware lookup table to implement SBOX operations, and pipelined parallel processing of 8 round keys.
[0104] The VSM4R instruction execution unit is used to receive and execute VSM4R instructions, such as... Figure 9 The input signals include: a 258-bit operand A (two sets of uncorrelated intermediate iteration workwords, each set consisting of four 32-bit intermediate iteration workwords from the register file) and a 256-bit operand B (eight 32-bit round keys from the register file). The output signal is a 256-bit execution result (written back to the register file), which is two sets of uncorrelated iteration workwords updated after four rounds of iteration (each set consisting of four 32-bit SM4 iteration workwords). The VSM4R instruction execution unit uses pipelining technology, completing one VSM4R instruction in a single execution. The VSM4R instruction execution unit completes 8 iterations of the SM4 block cipher algorithm's encryption and decryption round functions. Executing these iterations sequentially 4 times generates two sets of unrelated SM4 block cipher algorithm final iteration results (each set includes four 32-bit SM4 iteration working words). The VSM4R instruction execution unit is configured with 8 iteration execution stations and 1 output station, with a total execution delay of 9 clock cycles. It supports instruction pipelined execution and employs methods such as direct hardware logic implementation of shift operations, dedicated hardware logic processing of system parameters, hardware lookup table implementation of SBOX operations, and pipelined parallel processing of the 8 round functions to improve the hardware execution speed of instructions, thereby enhancing the acceleration effect.
[0105] The present invention will be further illustrated below through a specific embodiment: an acceleration method for the SM3 cryptographic hash algorithm in a general-purpose processor.
[0106] Before the first execution of the SM4 round key generation instruction (VSM4KEY) to expand the SM4 round key, the initial round key iteration value (K3, K2, K1, K0) is first generated using a general-purpose instruction in a general-purpose processor. The initial round key iteration value (K3, K2, K1, K0), plaintext, or ciphertext is then loaded into a register. The following details the process of accelerating the encryption algorithm of the SM4 block cipher using the method and processor of this invention:
[0107] (1) During the 1st to 9th clock cycles (steps 1 to 9), the processor accelerating the SM4 block cipher algorithm begins executing the first SM4 round key generation instruction (VSM4KEY). The initial input of the VSM4KEY instruction unit is the initial value of the round key iteration {K3,K2,K1,K0} (as the high 128 bits of the source operand A of the VSM4KEY instruction) and an 8-bit immediate operand 8'b00000000. During execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other data-unrelated SM4 block cipher algorithm programs. At the end of the 9th clock cycle (step 9), the VSM4KEY instruction unit completes the execution of the first VSM4KEY instruction, generating the 8 round keys {rk7~rk0} of the SM4 block cipher algorithm.
[0108] (2) During the 10th to 18th clock cycles (10th to 18th clock cycles), the processor accelerating the SM4 block cipher algorithm begins to execute the second SM4 round key generation instruction (VSM4KEY) and the first SM4 round function iteration instruction (VSM4R) in parallel. The input to the VSM4KEY instruction unit is the round key {rk7~rk4} (as the high 128 bits of the source operand A of the VSM4KEY instruction) and an 8-bit immediate operand 8'b00000001. The input to the VSM4R instruction unit includes two sets of uncorrelated 128-bit plaintext {W (0) 3~W (0) 0, W' (0) 3~W' (0) 0} (as the source operand A of the VSM4R instruction), and a set of 256-bit round keys {rk7~rk0} (as the source operand B of the VSM4R instruction); during execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other data-unrelated SM4 block cipher algorithm programs. At the end of the 18th clock cycle (18th clock cycle), the VSM4KEY instruction unit executes the second VSM4KEY instruction, generating a new set of 8 round keys {rk7~rk0} for the SM4 block cipher algorithm. 15~rk8}; After the VSM4R instruction unit executes the first VSM4R instruction, it completes the first to eighth rounds of function iterations in the SM4 block cipher encryption process, generating the intermediate iteration working word {W (7) 3~W (7) 0, W' (7) 3~W' (7) 0}.
[0109] (3) During the 19th to 27th clock cycles (steps 19-27), the processor accelerating the SM4 block cipher algorithm begins to execute the third SM4 round key generation instruction (VSM4KEY) and the second SM4 round function iteration instruction (VSM4R) in parallel. The input to the VSM4KEY instruction unit is the round key {rk}. 15 ~rk 12} (the high 128 bits of source operand A for the VSM4KEY instruction) and the 8-bit immediate operand 8'b00000010; the inputs to the VSM4R instruction unit include: two sets of uncorrelated 128-bit intermediate iteration workwords {W (7) 3~W (7) 0, W' (7) 3~W' (7) 0} (as the source operand A of the VSM4R instruction), and a 256-bit round key {rk} 15 ~rk8} (as the source operand B of the VSM4R instruction); during execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute other instructions in the SM4 block cipher algorithm program that are not related to the data. At the end of the 27th clock cycle (27th beat), the VSM4KEY instruction unit executes the 3rd VSM4KEY instruction, generating the new 8 round keys {rk} for the SM4 block cipher algorithm. 23 ~rk 16 After the VSM4R instruction unit executes the second VSM4R instruction, it completes the 9th to 16th rounds of the SM4 block cipher encryption algorithm, generating the intermediate iteration working word {W}. (15) 3~W (15) 0, W' (15) 3~W' (15) 0}.
[0110] (4) During the 28th to 36th clock cycles (steps 28 to 36), the processor accelerating the SM4 block cipher algorithm begins to execute the 4th SM4 round key generation instruction (VSM4KEY) and the 3rd SM4 round function iteration instruction (VSM4R) in parallel. The input to the VSM4KEY instruction unit is the round key {rk}. 23 ~rk 20} (the high 128 bits of source operand A for the VSM4KEY instruction) and the 8-bit immediate operand 8'b00000011; the inputs to the VSM4R instruction unit include: two sets of uncorrelated 128-bit intermediate iteration workwords {W (15) 3~W (15) 0, W' (15) 3~W' (15) 0} (as the source operand A of the VSM4R instruction), and a 256-bit round key {rk} 23 ~rk 16}(as the source operand B of the VSM4R instruction); during execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other data-unrelated SM4 block cipher algorithm programs. At the end of the 36th clock cycle (36th clock tick), the VSM4KEY instruction unit executes the 4th VSM4KEY instruction, generating the new 8-round keys {rk} for the SM4 block cipher algorithm. 31 ~rk 24 After executing the third VSM4R instruction, the VSM4R instruction unit completes the 17th to 24th rounds of function iterations in the SM4 block cipher algorithm encryption process, generating the intermediate iteration working word {W}. (23) 3~W (23) 0, W' (23) 3~W' (23) 0}.
[0111] (4) During the 37th to 45th clock cycles (37th to 45th clock cycles), the processor accelerating the SM4 block cipher algorithm begins executing the 4th SM4 round function iteration instruction (VSM4R). The input to the VSM4R instruction unit includes two sets of uncorrelated 128-bit intermediate iteration working words {W (23) 3~W (23) 0, W' (23) 3~W' (23) 0} (as the source operand A of the VSM4R instruction), and a 256-bit round key {rk} 31 ~rk 24} (as the source operand B of the VSM4R instruction); during execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other unrelated SM4 block cipher algorithm programs. At the end of the 45th clock cycle (45th clock cycle), the VSM4R instruction unit executes the 4th VSM4R instruction, completing the 25th to 32nd round function iterations of the SM4 block cipher algorithm encryption process, generating the intermediate iteration working word {W (31) 3~W (31) 0, W' (31) 3~W' (31)0}. At this point, the processor accelerating the SM4 block cipher algorithm has completed the execution of one round of key expansion and encryption round function iteration for the SM4 block cipher algorithm. It only needs to process the obtained intermediate iteration working word {W}. (31) 3~W (31) 0, W' (31) 3~W' (31) The ciphertext can be obtained by reversing the output of 0}.
[0112] Considering that processors accelerating the SM4 block cipher algorithm support fully pipelined parallel instruction execution, under continuous pipelined execution, the generation and encryption iteration of nine sets of unrelated SM4 block cipher round keys can be completed in as little as 54 cycles, greatly accelerating the execution speed of the SM4 block cipher algorithm. If multiple instruction execution units for accelerating the SM4 block cipher algorithm can be set up in the processor, the effect of accelerating the execution of the SM4 block cipher algorithm can be further improved.
[0113] The following details the process of accelerating the decryption algorithm of the SM4 block cipher using the method of this invention:
[0114] (1) During the 1st to 9th clock cycles (steps 1 to 9), the processor accelerating the SM4 block cipher algorithm begins executing the first SM4 round key generation instruction (VSM4KEY). The initial input of the VSM4KEY instruction unit is the initial value of the round key iteration {K3,K2,K1,K0} (as the high 128 bits of the source operand A of the VSM4KEY instruction) and an 8-bit immediate operand 8'b00000000. During execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other data-unrelated SM4 block cipher algorithm programs. At the end of the 9th clock cycle (step 9), the VSM4KEY instruction unit completes the execution of the first VSM4KEY instruction, generating the 8 round keys {rk7~rk0} of the SM4 block cipher algorithm.
[0115] (2) During the 10th to 18th clock cycles (steps 10-18), the processor accelerating the SM4 block cipher algorithm begins executing the second SM4 round key generation instruction (VSM4KEY). The input to the VSM4KEY instruction unit is the round key {rk7~rk4} (as the high 128 bits of the source operand A of the VSM4KEY instruction) and the 8-bit immediate operand 8'b00000001. During execution, the idle execution stations of the VSM4KEY and VSM3R instruction units can execute other instructions in the SM4 block cipher algorithm program that are not related to the data. At the end of the 18th clock cycle (step 18), the VSM4KEY instruction unit completes the execution of the second VSM4KEY instruction, generating the new 8 round keys {rk7~rk4} for the SM4 block cipher algorithm.15 ~rk8}.
[0116] (3) During the 19th to 27th clock cycles (steps 19-27), the processor accelerating the SM4 block cipher algorithm begins executing the third SM4 round key generation instruction (VSM4KEY), where the input to the VSM4KEY instruction unit is the round key {rk}. 15 ~rk 12} (the high 128 bits of the source operand A for the VSM4KEY instruction) and the 8-bit immediate operand 8'b00000010; during execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other data-unrelated SM4 block cipher algorithm programs. At the end of the 27th clock cycle (27th clock cycle), the VSM4KEY instruction unit executes the 3rd VSM4KEY instruction, generating a new 8-round key {rk} for the SM4 block cipher algorithm. 23 ~rk 16}
[0117] (4) During the 28th to 36th clock cycles (steps 28 to 36), the processor accelerating the SM4 block cipher algorithm begins executing the 4th SM4 round key generation instruction (VSM4KEY), where the input to the VSM4KEY instruction unit is the round key {rk}. 23 ~rk 20} (the high 128 bits of the source operand A for the VSM4KEY instruction) and the 8-bit immediate operand 8'b00000011; during execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other data-unrelated SM4 block cipher algorithm programs. At the end of the 36th clock cycle (36th clock cycle), the VSM4KEY instruction unit executes the 4th VSM4KEY instruction, generating a new 8-round key {rk} for the SM4 block cipher algorithm. 31 ~rk 24}
[0118] (5) During the 37th to 45th clock cycles (37th to 45th clock cycles), the processor accelerating the SM4 block cipher algorithm begins executing the first SM4 round function iteration instruction (VSM4R). The input to the VSM4R instruction unit includes two sets of uncorrelated 128-bit ciphertext {W (0) 3~W (0) 0, W' (0) 3~W' (0) 0} (as the source operand A of the VSM4R instruction), and a 256-bit round key {rk} 31 ~rk 24} (as the source operand B of the VSM4R instruction); during execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other data-unrelated SM4 block cipher algorithm programs. At the end of the 45th clock cycle (45th clock tick), the VSM4R instruction unit executes the first VSM4R instruction, completing the 1st to 8th round function iterations of the SM4 block cipher algorithm decryption process, and generating the intermediate iteration working word {W (7) 3~W (7) 0, W' (7) 3~W' (7) 0}.
[0119] (6) During the 46th to 54th clock cycles (46th to 54th clock cycles), the processor accelerating the SM4 block cipher algorithm begins executing the second SM4 round function iteration instruction (VSM4R). The input to the VSM4R instruction unit includes two sets of uncorrelated 128-bit ciphertext {W (7) 3~W (7) 0, W' (7) 3~W' (7) 0} (as the source operand A of the VSM4R instruction), and a 256-bit round key {rk} 23 ~rk 16} (as the source operand B of the VSM4R instruction); during execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other unrelated SM4 block cipher algorithm programs. At the end of the 54th clock cycle (54th clock cycle), the VSM4R instruction unit executes the second VSM4R instruction, completing the 9th to 16th round function iterations of the SM4 block cipher algorithm decryption process, and generating the intermediate iteration working word {W (15) 3~W (15) 0, W' (15) 3~W' (15) 0}.
[0120] (6) During the 55th to 63rd clock cycles (55th to 63rd clock cycles), the processor accelerating the SM4 block cipher algorithm begins executing the third SM4 round function iteration instruction (VSM4R). The input to the VSM4R instruction unit includes two sets of uncorrelated 128-bit ciphertext {W (15) 3~W (15) 0, W' (15) 3~W' (15) 0} (as the source operand A of the VSM4R instruction), and a 256-bit round key {rk} 15~rk8} (as the source operand B of the VSM4R instruction); during execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other data-unrelated SM4 block cipher algorithm programs. At the end of the 63rd clock cycle (63rd clock tick), the VSM4R instruction unit executes the 3rd VSM4R instruction, completing the 17th to 24th round function iterations of the SM4 block cipher algorithm decryption process, and generating the intermediate iteration working word {W}. (23) 3~W (23) 0, W' (23) 3~W' (23) 0}.
[0121] (6) During the 64th to 72nd clock cycles (64th to 72nd clock cycles), the processor accelerating the SM4 block cipher algorithm begins executing the 4th SM4 round function iteration instruction (VSM4R). The input to the VSM4R instruction unit includes two sets of uncorrelated 128-bit ciphertext {W (23) 3~W (23) 0, W' (23) 3~W' (23) 0} (as the source operand A of the VSM4R instruction), and a set of 256-bit round keys {rk7~rk0} (as the source operand B of the VSM4R instruction); during execution, the idle execution stations of the VSM4KEY instruction unit and the VSM3R instruction unit can execute instructions in other data-unrelated SM4 block cipher algorithm programs. At the end of the 72nd clock cycle (72nd clock), the VSM4R instruction unit executes the 4th VSM4R instruction, completing the 25th to 32nd round function iterations of the SM4 block cipher algorithm decryption process, generating the intermediate iteration working word {W (31) 3~W (31) 0, W' (31) 3~W' (31) 0}. At this point, the processor accelerating the SM4 block cipher algorithm has completed the execution of one round of key expansion and decryption round function iterations for the SM4 block cipher algorithm. It only needs to obtain the intermediate iterative working word {W}. (31) 3~W (31) 0, W' (31) 3~W' (31) Reverse the output of 0} to obtain the decrypted plaintext.
[0122] Considering that processors accelerating the SM4 block cipher algorithm support parallel instruction pipelined execution, under continuous pipelined execution, the generation and decryption iteration of nine sets of unrelated SM4 block cipher round keys can be completed in as little as 81 cycles, greatly accelerating the execution speed of the SM4 block cipher decryption algorithm. If multiple instruction execution units for accelerating the SM4 block cipher algorithm can be set in the processor, the effect of accelerating the execution of the SM4 block cipher decryption algorithm can be further improved.
Claims
1. A method for accelerating the SM4 block cipher algorithm, characterized in that, Based on the SM4 extended instruction set, parallel pipelines and instruction-level parallelism are used to accelerate the implementation of the SM4 block cipher algorithm. The SM4 block cipher algorithm includes an SM4 key expansion algorithm and an SM4 encryption / decryption algorithm. The SM4 extended instruction set adopts a RISC architecture, with instructions in a fixed-length 32-bit format, and both source and destination operands are 256 bits. The SM4 extended instruction set includes SM4 round key generation instructions and SM4 round function iteration instructions. The SM4 round key generation instructions employ multiple SM4 round key parallel generation algorithms to accelerate the SM4 key expansion algorithm, using the previous four 32-bit intermediate keys. K3~K0 and eight 32-bit fixed algorithm parameters CK7~CK0 related to the keys of the subsequent eight rounds are used as inputs. During the SM4 round key expansion process, the generation of the eight round keys can be completed in one execution. The SM4 round function iteration instruction adopts a multi-round SM4 iteration parallel execution algorithm to accelerate the SM4 encryption and decryption algorithm. The SM4 iteration parallel execution algorithm takes two sets of unrelated current four intermediate words W3~W0, W'3~W'0 and the eight round keys rk7~rk0 used in the subsequent eight rounds of operation as inputs. During the encryption / decryption process of the SM4 block cipher algorithm, two sets of unrelated four intermediate words are generated after eight rounds of execution of the SM4 encryption and decryption round function.
2. The method for accelerating the SM4 block cipher algorithm according to claim 1, characterized in that, When using parallel pipeline and instruction-level parallelism techniques to accelerate the implementation of the SM4 block cipher algorithm: Encryption includes the following steps: (A) Generate the initial values (K3, K2, K1, K0) of the round key iteration using general instructions in a general-purpose processor; (B) The first SM4 round key generation instruction is executed with the initial value of the round key iteration (K3,K2,K1,K0) and the fixed system parameters CK7~CK0 as input to generate the 8 round keys rk7~rk0 of the SM4 block cipher algorithm; (C) Using the round keys rk7~rk4 and the system fixed parameter CK 15 ~CK8 is the input to execute the second SM4 round key generation instruction, generating the 8 round keys rk for the SM4 block cipher algorithm. 15 ~rk8; Simultaneously, using the round keys rk7~rk0 and two sets of unrelated plaintext (W3,W2,W1,W0,W'3,W'2,W'1,W'0) as input, execute the first SM4 round function iteration instruction, completing the 1st to 8th round function iterations of the SM4 encryption algorithm, and obtaining the iteration working word (W (7) 3,W (7) 2,W (7) 1,W (7) 0,W' (7) 3,W' (7) 2,W' (7) 1,W' (7) 0); (D) with the round key rk 15 ~rk 12 and system fixed parameters CK 23 ~CK 16 The third SM4 round key generation instruction is executed to generate the eight round keys rk for the SM4 block cipher algorithm. 23 ~rk 16 Simultaneously, using the round key rk 15 ~rk8 and iterative working word (W) (7) 3,W (7) 2,W (7) 1,W (7) 0,W' (7) 3,W' (7) 2,W' (7) 1,W' (7) Taking 0 as input, execute the second SM4 round function iteration instruction, completing the 9th to 16th round function iterations of the SM4 encryption algorithm, and obtaining the iterative working word (W). (15) 3,W (15) 2,W (15) 1,W (15) 0,W' (15) 3,W' (15) 2,W' (15) 1,W' (15) 0); (E) with the round key rk 23 ~rk 20 and system fixed parameters CK 31 ~CK 24 The fourth SM4 round key generation instruction is executed to generate the eight round keys rk for the SM4 block cipher algorithm. 31 ~rk 24 Simultaneously, using the round key rk 23 ~rk 16 and iterative working word (W) (15) 3,W (15) 2,W (15) 1,W (15) 0,W' (15) 3,W' (15) 2,W' (15) 1,W' (15) Taking 0 as input, execute the third SM4 round function iteration instruction (VSM4R), completing round function iterations 17-24 of the SM4 encryption algorithm, and obtaining the iterative working word (W). (23) 3,W (23) 2,W (23) 1,W (23) 0,W' (23) 3,W' (23) 2,W' (23) 1,W' (23) 0); (F) with the round key rk 31 ~rk 24 and iterative working word (W) (23) 3,W (23) 2,W (23) 1,W (23) 0,W' (23) 3,W' (23) 2,W' (23) 1,W' (23) Taking 0 as input, execute the 4th SM4 round function iteration instruction, completing the 25th to 32nd round function iterations of the SM4 encryption algorithm, and obtaining the iterative working word (W). (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0); (G) The iterative workword (W) (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0) Reverse the output to obtain the ciphertext (Y3,Y2,Y1,Y0,Y'3,Y'2,Y'1,Y'0) of the encryption algorithm; decryption includes the following steps: (a) Generate the initial values (K3, K2, K1, K0) of the round key iteration using general instructions in a general-purpose processor; (b) Iterate the initial values (K3, K2, K1, K0) and the system fixed parameter CK using the round key. 31 ~CK0 is the source operation data. The SM4 round key generation instruction is executed 4 times in sequence to generate 32 round keys rk for the SM4 block cipher algorithm. 31 ~rk0; (c) Using the round key rk 31 Taking rk0 and the ciphertext (Y3,Y2,Y1,Y0,Y'3,Y'2,Y'1,Y'0) as input, the SM4 round function iteration instruction is executed 4 times in sequence, completing 32 round function iterations of the SM4 decryption algorithm, and obtaining the iterative working word (W). (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0); (d) The iterative working word (W) (31) 3,W (31) 2,W (31) 1,W (31) 0,W' (31) 3,W' (31) 2,W' (31) 1,W' (31) 0) Reverse the output to obtain the plaintext result of the decryption algorithm (W3,W2,W1,W0,W'3,W'2,W'1,W'0).
3. The method for accelerating the SM4 block cipher algorithm according to claim 1, characterized in that, The SM4 round key generation instruction adopts a simple arithmetic instruction format with immediate values, specifically VSM4KEY Va,#b,Vc. This instruction instructs the operation of an operand in a 256-bit source register Va and an 8-bit immediate operand, with the result stored in a 256-bit destination register Vc. In the 32-bit instruction, bits [31:26] represent a 6-bit opcode, bits [25:21] indicate the selection of one of the 32 256-bit register files as the source register Va, storing the source operand, bits [20:13] represent an 8-bit immediate operand indicating the loop iteration count, bits [12:5] represent an 8-bit function code determining the specific instruction function, and bits [4:0] indicate the selection of one of the 32 256-bit register files as the destination register Vc, storing the result of the instruction's operation.
4. The method for accelerating the SM4 block cipher algorithm according to claim 1, characterized in that, The SM4 round key generation instruction is as follows: Based on the previous four intermediate keys K3 to K0, generate the subsequent eight round keys rk7 to rk0 of the SM4 block cipher algorithm; wherein, the intermediate keys K3 to K0 are stored in the high 128 bits of the source register Va, the algorithm fixed parameters CK7 to CK0 are determined according to the immediate value #b, and the generated results rk7 to rk0 are stored in the destination register Vc; the results rk7 to rk0 are respectively equal to the intermediate keys K3 to K0. 11 ~K4, for i equal to 0, 1, 2, 3, 4, 5, 6, 7, K i+4 The generation logic is: K i+4 =K i XOR Temp2 XOR(Temp2<<<13) XOR(Temp2<<<23), where XOR represents bitwise XOR, <<< represents circular left shift, Temp2 is a 32-bit intermediate variable word, and the generation logic of Temp2 is: Temp2[31:0] = SBOX(Temp1), the SBOX(X) function looks up a new 32-bit data in parallel based on the 4 bytes of X, and Temp1 is a 32-bit intermediate variable word, and the generation logic of Temp1 is: Temp1[31:0] = K i+1 XORK i+2 XORK i+3 XOR CK i CK i For a 32-bit fixed parameter of the algorithm, {CK7~CK0}=SELCK(#b), where the SELCK(#b) function is used to determine the fixed parameter CK during the key expansion process of the SM4 block cipher algorithm. i The values of CK7 to CK0, representing eight 32-bit data points, can be determined by #b according to the SM4 block cipher algorithm. Executing the SM4 round key generation instruction once generates the eight round keys for the SM4 block cipher algorithm. This SM4 round key generation instruction is executed four times sequentially. Each time, the high 128 bits of the source register Va are updated with the high 128 bits of the target register Vc, and the immediate value #b is incremented by 1, thus generating the 32-round key rk for the SM4 block cipher algorithm. 31 ~rk0.
5. The method for accelerating the SM4 block cipher algorithm according to claim 1, characterized in that, The SM4 round function iteration instruction adopts a simple arithmetic instruction format in register format, specifically VSM4R Va,Vb,Vc. It is used to instruct two operands in two 256-bit source registers Va and Vb to perform operations, and the result is stored in a 256-bit destination register Vc. In the 32-bit instruction, bits [31:26] represent the 6-bit opcode, bits [25:21] indicate that one of the register files consisting of 32 256-bit registers is selected as the source register Va to store the source operands of the instruction, bits [20:16] indicate that one of the register files consisting of 32 256-bit registers is selected as the source register Vb to store the source operands of the instruction, bits [15:13] are always all "0", bits [12:5] represent the 8-bit function code used to determine the specific function of the instruction, and bits [4:0] indicate that one of the register files consisting of 32 256-bit registers is selected as the destination register Vc to store the result of the instruction.
6. The method for accelerating the SM4 block cipher algorithm according to claim 1, characterized in that, The SM4 round function iteration instruction is specifically as follows: Based on the current four intermediate words W3 to W0 and the eight round keys rk7 to rk0 used in the subsequent eight rounds of operation, generate the four intermediate words W after eight rounds of iteration of the SM4 block cipher algorithm. 11 ~W8 performs two 128-bit operations in parallel. W3~W0 are stored in the high or low 128 bits of the source register Va, and rk7~rk0 are stored in the source register Vb. The resulting W... 11 ~W8 is stored in the corresponding high or low bit of the target register Vc; in the 8 iterations, for j equal to 0, 1, 2, 3, 4, 5, 6, 7, the result W j+4 The generation logic of W is: j+4 =W j XOR Temp2 XOR(Temp2<<<2) XOR(Temp2<<<10) XOR(Temp2<<<18) XOR(Temp2<<<24), where XOR represents bitwise XOR, <<< represents circular left shift, Temp2 is a 32-bit intermediate variable word, and the generation logic of Temp2 is: Temp2[31:0] = SBOX(Temp1), the SBOX(X) function is used to perform a table lookup operation in parallel based on the 4 bytes of the 32-bit data X to obtain a new 32-bit data, Temp1 is a 32-bit intermediate variable word, and the generation logic of Temp1 is: Temp1[31:0] = W j+1 XOR W j+2 XOR W j+3 XOR rk j The SM4 round function iteration instruction is executed once to complete 8 iterations of the encryption and decryption round functions of the two sets of SM4 block cipher algorithms. The SM4 round function iteration instruction is executed 4 times in sequence. Each time, the source register Vb is updated with 8 new round keys, and the data in the source register Va is updated with the data in the generated target register Vc, so as to generate the final 4 words of the two sets of SM4 block cipher algorithms.
7. An instruction set processor, characterized in that, It includes a register file, an SM4 round key generation instruction execution unit, and an SM4 round function iteration instruction execution unit. The SM4 round key generation instruction execution unit and the SM4 round function iteration instruction execution unit are placed on different execution pipelines and occupy different read and write ports of the register file, respectively. The SM4 round key generation instruction execution unit has: The system has two input terminals, one for a 128-bit operand A and the other for an 8-bit immediate operand B. One output terminal is used to output a 256-bit execution result; The SM4 round key generation instruction execution unit directly implements shift operations and processes system parameters using hardware logic, and uses hardware lookup tables to implement SBOX operations and pipelined parallel processing of 8 round keys; the SM4 round key generation instruction execution unit can pipeline the SM4 round key generation instructions. The SM4 round function iteration instruction execution unit has: The system has two input terminals, one for inputting a 258-bit operand A and the other for inputting a 256-bit operand B. One output terminal is used to output a 256-bit execution result. The SM4 round function iteration instruction execution unit directly implements shift operations and processes system parameters using hardware logic, and uses hardware lookup tables to implement SBOX operations and pipelined parallel processing of 8 round functions. The SM4 round function iteration instruction execution unit can pipeline the SM4 round function iteration instructions.
8. The instruction set processor according to claim 7, characterized in that, The SM4 round key generation instruction execution unit is configured with 8 levels of iterative execution stations and 1 level of output station, with a total execution delay of 9 clock cycles; the SM4 round function iteration instruction execution unit is configured with 8 levels of iterative execution stations and 1 level of output station, with a total execution delay of 9 clock cycles; the instruction set processor supports the parallel pipelined execution of the SM4 round key generation instruction and the SM4 round function iteration instruction.