Method, acceleration hardware and polynomial multiplier for performing target transformation
By introducing paired storage and rearrangement units into the acceleration hardware, the data flow is optimized, the bottleneck of multinomial multiplication calculation speed is solved, the computational efficiency and scalability are improved, and the security requirements of quantum computing environment are met.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG ANT SECRET TECH CO LTD
- Filing Date
- 2025-01-10
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, polynomial multiplication faces a bottleneck in computational speed in post-quantum cryptography schemes, especially in quantum computing environments. Existing accelerator designs suffer from insufficient data storage and retrieval efficiency, failing to meet the demands for high-efficiency computation.
By introducing paired storage and rearrangement units into the accelerated hardware, optimizing data flow, ensuring the matching of memory and computing engine, and implementing butterfly operations using data flow graphs, the computational performance of polynomial multiplication is improved.
It achieves speed unification of memory and computing engine, improves the utilization of computing units, enhances the execution efficiency and scalability of polynomial multiplication, and adapts to changes in different security parameters.
Smart Images

Figure CN122309898A_ABST
Abstract
Description
[0001] This application is a divisional application; the parent application number is 2025100453616, the application date is January 10, 2025, and the parent invention is entitled "Method for Performing Target Transformation, Acceleration Hardware and Polynomial Multiplier". Technical Field
[0002] One or more embodiments of this specification relate to the optimization of hardware for accelerating target transformations, and more particularly to accelerating target transformations and the execution of polynomial multiplications by utilizing pairwise storage and rearrangement of data. Background Technology
[0003] The Fast Fourier Transform (FFT) and the Fast Number-Theoretic Transform (NTT) are the most critical steps in accelerating polynomial multiplication, and they have a wide range of applications in communication and encryption. For example, the FFT enables the transformation between the time and frequency domains in digital signal processing, while the NTT and negative wrapping convolution (NWC) using the NTT accelerate polynomial multiplication in finite fields in fully homomorphic hardware acceleration chip designs.
[0004] Especially with the continuous advancement of quantum computers, efficient quantum algorithms can crack the mathematical problems relied upon by mainstream public-key cryptosystems such as RSA and ECC, thus posing a security threat to these systems. Therefore, developing more secure post-quantum cryptosystems has become a new research focus. Post-quantum cryptography (PQC) aims to ensure security even when attackers possess large-scale quantum computers. Currently, most existing post-quantum cryptographic schemes are based on lattice theory, and polynomial multiplication often becomes the main computational bottleneck in these schemes. To accelerate the computation speed of polynomial multiplication, NTT is widely used, which can reduce the time complexity from O(n^2) to O(n^2). 2 The value is reduced to O(nlog2(n)).
[0005] In the various computational scenarios mentioned above, the efficiency of polynomial multiplication, or its contained FFT or NTT operations, is a significant factor affecting computational performance.
[0006] Therefore, it is hoped that there will be improved solutions to increase the speed of polynomial multiplication or its contained FFT or NTT operations, thereby improving the computational performance of related application scenarios. Summary of the Invention
[0007] This specification describes one or more embodiments of a scheme to accelerate the execution of a target transformation by means of pairwise storage and rearrangement of data, thereby improving the performance of the target transformation and the corresponding polynomial multiplication.
[0008] According to a first aspect, a method is provided for performing a target transformation by accelerating hardware, the target transformation comprising N stages of transformation operations, the accelerating hardware comprising N circuit sections corresponding to the N stages, wherein any i-th stage (excluding the first and last stages) corresponds to an i-th circuit section comprising a controller, a processing engine, a memory, and first and second rearrangement units, the transformation operation of the i-th stage comprising:
[0009] The controller reads the first data pair and the second data pair from the first read address and the second read address in the memory of the (i-1)th stage, respectively;
[0010] The first rearrangement unit performs a first rearrangement operation on the first and second data pairs, and outputs the third and fourth data pairs in sequence.
[0011] The processing engine performs radix-2 butterfly operations on the third and fourth data pairs in turn, and outputs the first and second result pairs in turn.
[0012] The second rearrangement unit performs a second rearrangement operation on the first and second result pairs to obtain the fifth data pair and the sixth data pair.
[0013] The controller writes the fifth data pair and the sixth data pair into the first write address and the second write address in the memory of the i-th stage, respectively.
[0014] In one embodiment, the target transformation is performed on an n-point input sequence, where n is a power of 2. The memory of the i-th stage stores the n results obtained after the transformation operation of the i-th stage through consecutive n / 2 addresses, and the values of the first and second read addresses are the same as the values of the first and second write addresses, respectively.
[0015] In one embodiment, the second read address and write address are respectively the first read address and write address plus 2 raised to the power of i-1.
[0016] In one embodiment, the second read address and the write address are respectively the first read address and the write address plus 2 raised to the power of Ni-2.
[0017] In one embodiment, in the transformation operation of the i-th stage, the n / 2 addresses are sequentially and evenly divided into 2^(Ni-1) groups; the transformation operation of the i-th stage further includes:
[0018] After reading the second data pair, the controller reads data from the third read address and the fourth read address in the memory of the (i-1)th stage for two consecutive clock cycles.
[0019] Wherein, if the second read address is the last address in the group, then the third read address is the second read address plus 1; otherwise, the third read address is the first read address plus 1;
[0020] The fourth read address is the third read address plus 2 raised to the power of i-1.
[0021] In one embodiment, in the transformation operation of the i-th stage, the n / 2 addresses are sequentially divided into 2^i groups; the transformation operation of the i-th stage further includes:
[0022] After reading the second data pair, the controller reads data from the third read address and the fourth read address in the memory of the (i-1)th stage for two consecutive clock cycles.
[0023] Wherein, if the second read address is the last address in the group, then the third read address is the second read address plus 1; otherwise, the third read address is the first read address plus 1;
[0024] The fourth read address is the third read address plus 2 raised to the power of Ni-2.
[0025] In one embodiment, the controller reads a first data pair and a second data pair from a first read address and a second read address in the memory of the (i-1)th stage, respectively, including:
[0026] In response to the completion of writing data pairs in all addresses of the first group in the (i-1)th phase, in the first clock cycle, the controller reads a first data pair from the first read address and inputs it into the first rearrangement unit; in the second clock cycle following the first clock cycle, the controller reads a second data pair from the second read address and inputs it into the first rearrangement unit.
[0027] In one embodiment, the processing engine sequentially performs radix-2 butterfly operations on the third and fourth data pairs, respectively, including:
[0028] In the third clock cycle, the processing engine begins to perform a radix-2 butterfly operation on the third data pair, and after t clock cycles, the first result pair is output to the second rearrangement unit.
[0029] In the fourth clock cycle following the third clock cycle, the processing engine begins to perform a radix-2 butterfly operation on the fourth data pair, and after t clock cycles, the resulting second result pair is output to the second rearrangement unit.
[0030] In one embodiment, the controller writes the fifth data pair and the sixth data pair to the first write address and the second write address in the memory of the i-th stage, respectively, including:
[0031] In the t+5th clock cycle, the controller writes the fifth data pair to the first write address;
[0032] In the t+6 clock cycle, the controller writes the sixth data pair to the second write address.
[0033] In one embodiment, the first rearrangement unit performs a first rearrangement operation on the first and second data pairs, including:
[0034] During the first clock cycle, the first rearrangement unit receives the first data pair;
[0035] During the second clock cycle, the first rearrangement unit receives the second data pair and temporarily stores the first data pair;
[0036] In the third clock cycle following the second clock cycle, the first rearrangement unit outputs the first data from the first data pair and the first data from the second data pair as the third data pair, and temporarily stores the second data from the second data pair.
[0037] In the fourth clock cycle following the third clock cycle, the first rearrangement unit outputs the second data from the first data pair and the second data from the second data pair as the fourth data pair.
[0038] In one embodiment, the second rearrangement unit performs a second rearrangement operation on the first and second result pairs, including:
[0039] In the (t+3)th clock cycle, the second rearrangement unit receives the first result pair;
[0040] In the (t+4)th clock cycle, the second rearrangement unit receives the second result pair and temporarily stores the first result pair;
[0041] In the t+5th clock cycle, the second rearrangement unit outputs the first result of the first result pair and the first result of the second result pair as the fifth data pair, and temporarily stores the second result of the second result pair;
[0042] At clock cycle t+6, the second rearrangement unit outputs the second result of the first result pair and the second result of the second result pair as the sixth data pair.
[0043] In one embodiment, the 0th circuit section corresponding to the first stage among the N circuit sections includes a 0th-order controller, a 0th-order processing engine, and a 0th-order memory. The transformation operation of the 0th stage includes:
[0044] The 0th-order processing engine sequentially performs a radix-2 butterfly operation on each of the n / 2 adjacent data pairs in the n-point input sequence arranged in bit reverse order, and outputs n / 2 result pairs in sequence.
[0045] The 0th-order controller writes the n / 2 result pairs sequentially to n / 2 consecutive addresses in the 0th-order memory.
[0046] In one embodiment, the (N-1)th circuit section corresponding to the final stage among the N circuit sections includes an (N-1)th order controller and an (N-1)th order processing engine, and the transformation operation of the (N-1)th stage includes:
[0047] The (N-1)th stage controller sequentially reads n / 2 data pairs from n / 2 addresses in the memory of the (N-2)th stage;
[0048] The (N-1)th order processing engine sequentially performs radix-2 butterfly operations on the n / 2 data pairs and outputs n / 2 result pairs sequentially.
[0049] In one embodiment, the 0th circuit part corresponding to the first stage or the (N-1)th circuit part corresponding to the end stage among the N circuit parts has the same hardware structure as the i-th circuit part and performs the same transformation operation.
[0050] In one embodiment, the target transform is a Fast Fourier Transform (FFT), an Inverse Fast Fourier Transform (IFFT), a Fast Number Theory Transform (NTT), an Inverse Fast Number Theory Transform (INTT), an NTTP transform obtained by fusing preprocessing and NTT in Negative Wrapped Convolution (NWC), or an INTTP transform obtained by fusing INTT and postprocessing in NWC; the N-stage transform operations are performed using either time-decimation (DIT) or frequency-decimation (DIF) methods.
[0051] According to a second aspect, accelerated hardware for performing a target transformation comprising N stages of transformation operations is provided. The accelerated hardware includes N circuit sections corresponding to the N stages, wherein the i-th circuit section corresponding to any i-th stage (excluding the first and last stages) includes a controller, a processing engine, a memory, and first and second rearrangement units; in the transformation operation of the i-th stage:
[0052] The controller is configured to read the first data pair and the second data pair from the first read address and the second read address in the memory of the (i-1)th stage, respectively.
[0053] The first rearrangement unit is configured to perform a first rearrangement operation on the first and second data pairs and output the third and fourth data pairs in sequence.
[0054] The processing engine is configured to perform radix-2 butterfly operations on the third and fourth data pairs in sequence, and output the first and second result pairs in sequence.
[0055] The second rearrangement unit is configured to perform a second rearrangement operation on the first result pair and the second result pair to obtain a fifth data pair and a sixth data pair.
[0056] The controller is also configured to write the fifth data pair and the sixth data pair to the first write address and the second write address in the memory of the i-th stage, respectively.
[0057] In one embodiment, the first rearrangement unit includes a first register, a second register, a first multiplexer, a second multiplexer, and a multiplexing control unit.
[0058] In this configuration, one input of the first multiplexer and one input of the second multiplexer both receive the first data from the currently read data pair.
[0059] The first register receives and temporarily stores the second data in the currently read data pair. The output of the first register is coupled to another input of the first multiplexer and another input of the second multiplexer.
[0060] The output of the first multiplexer is coupled to the input of the second register.
[0061] The second register outputs the first data in the currently rearranged data pair, and the second multiplexer outputs the second data in the currently rearranged data pair.
[0062] The multiplexing control unit is configured to control the first multiplexer and the second multiplexer to alternately select one of their two inputs as the output.
[0063] In one embodiment, the multiplexing control unit includes a third register, a fourth register, and an inverter.
[0064] The output of the third register is coupled to the control terminal of the first multiplexer, the input of the fourth register, and the input of the inverter.
[0065] The output of the fourth register is coupled to the control terminal of the second multiplexer.
[0066] The output of the inverter is coupled to the input of the third register.
[0067] In one embodiment, the first rearrangement unit and the second rearrangement unit have the same hardware structure.
[0068] In one embodiment, the processing engine includes a modular multiplication unit comprising a first, a second, and a third integer multiplier, wherein the first integer multiplier is implemented using a digital signal processor (DSP), and the second and third integer multipliers are implemented using a lookup table (LUT).
[0069] According to a third aspect, a polynomial multiplier is provided, which receives an n-point first input sequence and an n-point second input sequence, and outputs an n-point polynomial multiplication output sequence; the polynomial multiplier includes a first transformation module, a second transformation module, a point-by-point multiplication hardware module, and a third transformation module, wherein the first transformation module, the second transformation module, and the third transformation module all include the acceleration hardware as described in the second aspect.
[0070] The first transformation module is configured to perform NTTP on the first input sequence in DIF form and output a first output sequence;
[0071] The second transformation module is configured to perform NTTP on the second input sequence in DIF form and output a second output sequence;
[0072] The point-by-point multiplication hardware module is configured to perform point-by-point multiplication on the first and second output sequences and output a third output sequence.
[0073] The third transformation module is configured to perform INTTP on the third output sequence in the form of DIT, and output the n-point polynomial multiplication output sequence.
[0074] In the embodiments of this specification, a method for performing a target transformation using accelerated hardware and corresponding accelerated hardware are proposed. In the i-th circuit portion corresponding to any i-th stage (excluding the first and last stages) of the accelerated hardware, a controller, a processing engine, a memory, and first and second rearrangement units are configured. A pair of data is stored at one address in the memory, and the first and second rearrangement units switch between the stored pair in the memory and the computed pair of the butterfly operation in the processing engine, thereby eliminating the mismatch between the two. This achieves an efficient data flow that unifies the speed of the computing unit and the memory, thereby improving the utilization of the computing unit and enhancing the overall transformation performance. Attached Figure Description
[0075] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0076] Figure 1A schematic diagram of the structure of dedicated acceleration hardware for performing DIT-type target transformations according to one embodiment is shown;
[0077] Figure 2 A flowchart is shown for a pair of butterfly operations in an arbitrary intermediate stage i-th stage of a method for performing a target transformation by accelerating hardware, according to one embodiment;
[0078] Figure 3 A data flow diagram (butterfly diagram) of a 16-point DIT NTT according to one embodiment is shown, in which the address access order of each stage, the butterfly cell calculation order and the access address of each data are marked;
[0079] Figure 4 A schematic diagram of the structure of a first rearrangement unit in acceleration hardware according to one embodiment is shown;
[0080] Figure 5 A schematic diagram of the structure of a processing engine in acceleration hardware according to one embodiment is shown;
[0081] Figure 6 A schematic diagram of the structure of a polynomial multiplier according to one embodiment is shown. Detailed Implementation
[0082] The solution provided in this specification will now be described with reference to the accompanying drawings.
[0083] NTT will use polynomials coefficient Convert to INTT is its inverse operation. In mainstream post-quantum cryptography schemes, NTT and INTT are usually defined on a polynomial ring. Furthermore, the negative wrapped convolution (NWC) method is used to reduce the additional computational cost caused by zero padding in polynomial multiplication.
[0084] The following pseudocode illustrates the operation of NWC, which provides an efficient method for calculating polynomial products c(x).
[0085]
[0086] Where a(x) and b(x) are The polynomials in the expression have coefficients a, b, c, and d. n−1 to a0 and b n−1 up to b0, where a i ,b i ∈[0,q). Polynomial multiplication of a(x) and b(x) yields a polynomial c(x) with n coefficients.
[0087] As shown in the pseudocode above, when performing polynomial multiplication, NTT and INTT operations are performed in step 3. Before performing the NTT operation, the preprocessing shown in steps 1 and 2 is performed, and after performing the INTT operation, the postprocessing shown in step 4 is performed.
[0088] To accelerate computation, the preprocessing steps can be integrated into the NTT, and the postprocessing steps can be integrated into the INTT.
[0089] For simplicity, this specification introduces NTTP to represent the transformation obtained by fusing preprocessing and NTT in NWC, and introduces INTTP to represent the transformation obtained by fusing INTT and postprocessing in NWC, as shown in formulas (1) and (2) below:
[0090] (1)
[0091] (2)
[0092] in yes The nth primitive root of unit satisfies Ψ 2n It is a primitive root of unit power 2n, satisfying .
[0093] It can be seen that the algorithmic principles of NTTP and NTT are consistent, except that there is an additional preprocessing factor in the multiplication factors. NTT is a variant of FFT in the finite field, adding a modulo operation compared to FFT, but its algorithmic principle is still consistent with that of the discrete form of FFT. Similarly, the algorithmic principles of INTTP, as well as the inverse transforms such as INTT and IFFT, are consistent with NTTP; the main differences lie in the factors in the multiplication and the final multiplication by the constant 1 / n. These transformations can all be implemented using the same data flow graph (butterfly diagram). Therefore, these transformations can all be implemented using the same hardware architecture to accelerate their operation.
[0094] Therefore, the discussion of embodiments of this disclosure applies to all the above-described transformations, as well as other transformations with the same or similar algorithmic principles. Thus, in the discussion of this disclosure, the target transformation can be used as a general term when referring to general schemes or general operations within schemes. The target transformation is a discrete transformation that transforms the input coefficient sequence into an output coefficient sequence, or its inverse transformation, which can be converted into an N-stage butterfly operation.
[0095] Several algorithms have been proposed, such as the Decimation-In-Time (DIT) method or the Decimation-In-Frequency (DIF) method, to convert the n-point target transformation into an N-order radix-2 butterfly operation, where n=2. N Dedicated accelerators for performing target transformations can be implemented using spatial parallel architectures or pipelined architectures. However, existing accelerator solutions have shortcomings in their design for data storage and retrieval, resulting in insufficient efficiency in data storage and retrieval, thus requiring further improvement in the performance of target transformations.
[0096] Furthermore, in some cases, the scalability of accelerators used to perform target transformations is crucial to adapt to changes in various safety parameters during the standardization of PQC schemes. Several scalable accelerator architectures have been proposed; however, some of these accelerators suffer from poor versatility. When performing computations in each stage, the rules for calculating data read / write addresses differ, requiring the same number of data selection logic sets as the total number of orders, making it difficult to conveniently set arbitrary orders. Alternatively, some accelerators have complex control units, occupying a significant amount of area outside of computation, and some architectures have low utilization rates of computational units. Therefore, achieving both good hardware efficiency and scalability presents a considerable challenge.
[0097] In view of this, an embodiment of this specification proposes a hardware-accelerated target transformation scheme. In this scheme, the execution of the target transformation is accelerated and its performance is improved by pairwise storage and rearrangement of data.
[0098] Specifically, since butterfly operations process a pair of data at a time, with both input and output data in pairs (hereinafter referred to as computation pairs), and memory can only perform one read / write operation per clock cycle, the embodiments in this specification propose storing a pair of data (hereinafter referred to as storage pairs) at a single memory address. This allows two data items to be processed simultaneously with each read / write operation, maintaining consistent data flow speeds for memory access and computation, and improving the efficiency of transformation operations. This is particularly advantageous for pipelined architecture acceleration hardware, as memory access and butterfly operations in the processing engine can be performed simultaneously in pairs on the pipeline, easily filling the pipeline completely and allowing all stages of the pipeline to run concurrently.
[0099] Furthermore, embodiments of this specification reveal that although in an N-order butterfly operation, there is no direct correspondence between the storage pairs of the previous stage and the computation pairs of the current stage within each non-head and tail stage (i.e., intermediate stage), rearranging two storage pairs from two addresses with a preset relationship yields two input computation pairs for two butterfly operations. Then, the two output computation pairs of these two butterfly operations can be rearranged and stored for easy retrieval in the next stage. It can be understood that in this specification, "rearrangement" refers to rearranging two data pairs so that one data point is extracted from each pair to form a new data pair, and the remaining data point from each pair forms another new data pair.
[0100] Therefore, the embodiments of this specification introduce two rearrangement units in each intermediate stage. The first rearrangement unit, located before the processing engine for butterfly operations, rearranges the two storage pairs read from the previous stage into two computation pairs, which serve as input data pairs for the two butterfly operations, respectively. The second rearrangement unit, located after the processing engine, rearranges the two computation pairs, which are the output data pairs obtained from these two butterfly operations, into two storage pairs, which are stored at two addresses in the memory of the current stage for the next stage to read and continue subsequent transformation operations. Thus, by rearranging the data pairs read from or about to be written to memory using the first and second rearrangement units, the mismatch between the data pairs in memory and the data pairs used for computation can be eliminated.
[0101] Based on the above-described pairwise memory access and rearrangement operations, the scheme of the embodiments in this specification implements an efficient data flow that can unify the speed of computing units and memory, thereby improving the utilization of computing units.
[0102] In addition, as will be detailed later, in some embodiments, the data flow control is simple and effectively reduces the hardware resources occupied by the control logic.
[0103] In addition, in some embodiments, a grouped pairwise storage access method is proposed to control the data stream, which ensures that the control logic is consistent in each stage, thus having good scalability and supporting multiple polynomial lengths and data bit widths.
[0104] The following will mainly take the NTTP in the DIT form as an example to describe in detail the solutions of the embodiments of this specification. As mentioned above, NTT, INTT, FFT, IFFT, and INTTP, etc. can all adopt the same hardware architecture, and the main differences lie in the factors in the multiplication and the multiplication with the constant 1 / n. Based on the description of the inventive concepts, technical details, etc. of the embodiments in this specification, those skilled in the art can easily understand the corresponding settings or modifications for various types of target transforms, so they will not be elaborated one by one in the following text.
[0105] Figure 1 Fig. shows a schematic structural diagram of a dedicated acceleration hardware for performing a target transform in the DIT form, where the target transform includes transform operations in N stages.
[0106] As Figure 1 shown, for accelerating the execution of the target transform in the DIT form, in the embodiments of this specification, the dedicated acceleration hardware is designed to include N circuit parts corresponding to N stages. Any i-th stage corresponding to the non-first and non-last stages, the i-th circuit part includes: a controller, a processing engine, a memory, and first and second rearrangement units.
[0107] In the following text, the i-th stage can also be called an intermediate stage, 0 < i < N - 1. That is to say, Figure 1 in the intermediate stages, that is, stages 1 to N - 2 all have the same hardware structure. For the sake of simplicity, Figure 1 the display of the hardware structure after stage 2 is omitted in Figure 1 . The hardware structure of the last stage, that is, stage N - 1 is also basically omitted in Figure 1 because in the acceleration hardware for the DIT form, the circuit part of stage N - 1 has the same hardware structure as the circuit part of the intermediate stage and performs the same transform operations.
[0108] As Figure 1 shown, for the first stage in the acceleration hardware for the DIT form, that is, the circuit part corresponding to stage 0 only includes a controller, a processing engine, and a memory, without a rearrangement unit. An input memory ( Figure 1 labeled as "input" in NWe can first arrange them in reverse bit order, and then store n / 2 pairs of input data, each consisting of two adjacent data points arranged in reverse bit order, in consecutive n / 2 addresses in the input memory.
[0109] Figure 1 Solid arrows represent data streams, and dashed arrows represent control signals. ① above a dashed arrow indicates a read control signal, ② an write control signal, and ③ an enable control signal. For example... Figure 1 As shown, in stage 0, the controller can control the read operation of the input memory and the write operation of the stage 0 memory, and generate the enable signal for the processing engine.
[0110] The transition operation of Stage 0 can be performed as follows under the control of the Stage 0 controller:
[0111] In stage 0, the controller sequentially reads n / 2 pairs of input data from n / 2 addresses in the input memory.
[0112] The processing engine in stage 0 performs radix-2 butterfly operations on each of the n / 2 input pairs in sequence, and outputs n / 2 result pairs in sequence.
[0113] The controller in stage 0 writes the n / 2 result pairs sequentially to consecutive n / 2 addresses in the memory of stage 0.
[0114] by Figure 3 The data flow diagram of the 16-point DIT NTT shown is an example. This 16-point DIT NTT includes four stages (stages 0 to 3) of transformation operations. In stage 0, each pair of adjacent input data in the 16-point input sequence, after being arranged in bit reverse order, is sequentially stored in addresses A0 to A7 of the input memory. Then, as... Figure 3 The address access order below shows that in stage 0, these 8 input pairs are read in the order of addresses A0 to A7, and as follows: Figure 3 As indicated by the sequential numbers in the upper right corner of each butterfly unit, butterfly operations are performed sequentially on the eight input pairs read from addresses A0 to A7 in stage 0.
[0115] In addition, Figure 3 Each address corresponds to two adjacent rows of data (represented by a gray bar), indicating that the read / write address for these two rows at each stage is the address indicated by the leftmost label. For example... Figure 3 As shown, the n / 2 results obtained from each butterfly operation in stage 0 are sequentially written to addresses A0 to A7 of the memory in stage 0.
[0116] In some cases, the input memory of Stage 0 is not necessary; that is, it does not need to store input data, but instead receives input data directly from external or preceding processing circuits. In this case, the processing engine of Stage 0 can sequentially perform radix-2 butterfly operations on each of the directly received n / 2 input pairs, sequentially outputting n / 2 result pairs. The controller of Stage 0 then sequentially writes these n / 2 result pairs into consecutive n / 2 addresses in the Stage 0 memory. These n / 2 input pairs are identical to the n / 2 input pairs sequentially stored in the aforementioned input memory.
[0117] Back Figure 1 In any subsequent intermediate stage, at the i-th order, as mentioned above, to match storage pairs with computation pairs and maintain consistent data flow speeds for storage access and computation, the embodiment of this specification processes butterfly operations in pairs. That is, the i-th order transformation operation can be divided into n / 4 pairs of butterfly operations, where the process for any pair of butterfly operations can be as follows: Figure 2 The steps shown are as follows:
[0118] Step S21: The controller reads the first data pair and the second data pair from the first read address and the second read address in the memory of the (i-1)th stage, respectively.
[0119] Step S22: The first rearrangement unit performs a first rearrangement operation on the first and second data pairs, and outputs the third and fourth data pairs in sequence;
[0120] Step S23: The processing engine performs radix-2 butterfly operations on the third data pair and the fourth data pair respectively, and outputs the first result pair and the second result pair in sequence.
[0121] Step S24: The second rearrangement unit performs a second rearrangement operation on the first and second result pairs to obtain the fifth data pair and the sixth data pair;
[0122] In step S25, the controller writes the fifth data pair and the sixth data pair into the first write address and the second write address in the memory of the i-th stage, respectively.
[0123] The controller, processing engine, and first and second rearrangement units mentioned above all belong to the circuit part of the i-th stage.
[0124] The above five steps can be considered as five sub-stages: reading data, first rearrangement operation, processing engine operation, second rearrangement operation, and writing data, as follows. Figure 1 The solid arrows in stage 1 or 2 indicate the five sub-stages of data flow.
[0125] Let's take stage 1 as an example to describe in detail the circuit part of any intermediate stage i, such as... Figure 1As shown, it includes: a first rearrangement unit 11, a processing engine 12, a second rearrangement unit 13, a memory 14, and a controller 15.
[0126] Memory 14 can be random access memory (RAM) used to store the computation results of the current stage in pairs, that is, one data pair is stored at one address. For example, the two data pairs stored in a storage pair can be concatenated into one data storage at one address in memory 14. Memory 14 can store the n results obtained after the transformation operation of the first stage of the n-point input sequence in pairs through n / 2 consecutive addresses.
[0127] exist Figure 1 In the diagram, the gray half of memory 14 represents the sub-stage of reading data in the next stage, and the white half of memory 14 represents the sub-stage of writing data in the current stage. For example, memory 14 can be a dual-port RAM, which can perform read and write operations simultaneously within one clock cycle, as long as the read and write addresses are not the same. This ensures that the two sub-stages of reading and writing data can run simultaneously during pipelined operation.
[0128] like Figure 1 As shown, the controller 15 controls the read operation of the previous stage memory and the write operation of the current stage memory, and generates the enable signals of the first rearrangement unit 11, the processing engine 12 and the second rearrangement unit 13.
[0129] The data flow of the above five sub-stages can be realized in stage 1 through the control of controller 15.
[0130] In some examples, the same control logic can be used to process any q-th pair of butterfly operations in any intermediate stage at the i-th order. For an N-order transformation of the DIT form, the control logic of the last order can also be the same as that of the intermediate stages.
[0131] For example, for any i-th order, where the two read addresses read from the memory of the previous order and the two write addresses written to the memory of the current order are the same in each pair of butterfly operations, and the n / 4 pairs of butterfly operations for the entire stage can be implemented using the following grouped pairwise memory access (read / write) method:
[0132] The input / output data of this stage can be divided into multiple groups by grouping the input / output data of the interlaced butterfly units in the N-order data flow graph (butterfly graph) of the target transformation. Within each group, the input / output data is further divided into upper and lower halves. These groups are read / written sequentially from top to bottom. Within any group, adjacent pairs of input / output data in the upper and lower halves are read / written alternately from top to bottom. All groups within a stage are accessed in the same way, and the next group is accessed only after the previous group has been accessed.
[0133] by Figure 3 Taking a 16-point DIT NTT data flow diagram as an example. For example, in stage 2, the first four butterfly units are interleaved, thus forming the first group, and the last four butterfly units are interleaved, thus forming the second group. The data within any group is further divided into upper and lower halves. For example, in the first group, the data at addresses A0 and A1 is the upper half, and the data at addresses A2 and A3 is the lower half. The address access order in stage 2 is: first access the first group, then access the second group. Within both the first and second groups, the upper and lower halves of the data are accessed alternately from top to bottom. From this, it can be deduced that... Figure 3 The specific address access order is shown below.
[0134] The execution order of the butterfly operations in Stage 2 is also indicated by the numbers in the upper right corner of the diagram, from 0 to 7. Each pair of butterfly operations executed consecutively constitutes a pair of butterfly operations. For example, the 0th and 1st butterfly units are processed in pairs, with their two computation pairs corresponding to the rearrangement of two memory pairs in addresses A0 and A2, respectively. The computation pair of the 0th butterfly unit corresponds to the upper half of the data in each of the two memory pairs in addresses A0 and A2, while the computation pair of the 1st butterfly unit corresponds to the lower half of the data in each of the two memory pairs in addresses A0 and A2.
[0135] like Figure 3 As shown, the data stream implemented in the embodiments of this specification processes data in pairs through two pairing methods. One pairing method is storage pairs, which are... Figure 3 In each stage, the data consists of every two adjacent rows arranged sequentially from top to bottom. One pair is stored at an address in memory, while the other pairing method is a computation pair, consisting of two input or output data from the processing engine or butterfly unit. The first and second rearrangement units can switch between the two pairing modes by rearranging the data, allowing both the memory and the processing engine to operate according to their respective pairing modes.
[0136] According to the above description, it can be deduced that for the \(i\)-th order (\(0 < i\), i.e., not the first order) among any \(N\) orders, in any pair of butterfly operation processes, the second read / write address is the first read / write address plus \(2\) to the power of \(i - 1\). For example, Figure 2 In the second read address and write address in the shown process are the first read address and write address plus \(2\) to the power of \(i - 1\) respectively.
[0137] In addition, it can be deduced that for the \(i\)-th order (\(0 < i\), i.e., not the first order) among any \(N\) orders, its \(n / 2\) addresses can be sequentially and evenly divided into \(2\) to the power of \(N - i - 1\) groups. And there is the following relationship between the access addresses of \(2\) pairs of consecutive butterfly operation processes:
[0138] If the second read / write address of the current pair is the last address in its group, the first read / write address of the next pair is the second read / write address of the current pair plus \(1\); otherwise, it is the first read / write address of the current pair plus \(1\). As mentioned before, the second read / write address of the next pair is the first read / write address of the next pair plus \(2\) to the power of \(i - 1\).
[0139] From this, the following formula can be deduced to obtain the entire address access sequence of the \(i\)-th stage:
[0140]
[0141] where Addr ij is the \(j\)-th accessed (read / write) address within the \(i\)-th stage, \(i\in\{1, 2,\ldots,N 1\}, j\in\{0, 1,\ldots, \}. For the first read / write address in any \(q\)-th pair of butterfly operation processes, \(j = 2q\), and for the second read / write address in any \(q\)-th pair of butterfly operation processes, \(j = 2q + 1\).
[0142] For the operators in the above formula, "j >> i" means shifting \(j\) to the right by \(i\) bits, "j << i" means shifting \(j\) to the left by \(i\) bits, and the vacant bits are filled with \(0\), while "j % 2" means taking the modulus of \(j\) with respect to \(2\), i.e., the remainder of \(j\) divided by \(2\).
[0143] Therefore, in the above embodiment scheme, the control logic of the data stream is simple, and the control logics of each stage in each order are consistent, so it has good scalability.
[0144] In some examples, the accelerated hardware shown can be run in a pipelined manner. Therefore, after the pipeline is filled, all sub-stages of all stages can run simultaneously, that is, Figure 1 all hardware units in Figure 1 can run simultaneously. Figure 1 Some hardware units, such as the first and second rearrangement units and the processing engine, can also be implemented using a pipelined architecture.
[0145] Still with Figure 1 Let's take stage 1 as an example to describe the pipeline operation in detail.
[0146] The foregoing Figure 2 Step S21 (sub-stage 1) can be executed by controller 15 over two consecutive clock cycles. For example, in the first clock cycle, controller 15 reads a first data pair from the first read address of the memory of the previous stage (stage 0) and inputs it into the first rearrangement unit 11, at which time the first rearrangement unit 11 receives the first data pair; in the second clock cycle, controller 15 reads a second data pair from the second read address of the memory of stage 0 and inputs it into the first rearrangement unit 11, at which time the first rearrangement unit 11 receives the second data pair.
[0147] After the first rearrangement unit 11 receives the first data pair in the first clock cycle, it can execute the aforementioned process in three consecutive clock cycles. Figure 2 Step S22 (sub-stage 2) in the above. For example, in the second clock cycle, the first rearrangement unit 11 temporarily stores the first data pair; in the third clock cycle, the first rearrangement unit 11 temporarily stores the second data in the second data pair and outputs the first data in the first data pair and the first data in the second data pair as the third data pair, at which time the processing engine receives the third data pair; in the fourth clock cycle, the first rearrangement unit outputs the second data in the first data pair and the second data in the second data pair as the fourth data pair, at which time the processing engine receives the fourth data pair.
[0148] In some examples, the first rearrangement unit 11 may have, for example: Figure 4 The hardware structure shown.
[0149] like Figure 4 As shown, the first rearrangement unit 11 includes a first register R1, a second register R0, a first multiplexer M0, a second multiplexer M1, and a multiplexing control unit, wherein the multiplexing control unit outputs control signals f0 and f1 to control the first multiplexer M0 and the second multiplexer M1 to alternately select one of their two inputs as the output.
[0150] One input of the first multiplexer M0 and one input of the second multiplexer M1 both receive the first data d0 from the currently read data pair. Figure 4In this example, both input terminals are the input terminals corresponding to "0", that is, the terminal that will be selected for output when the control signal of the multiplexer is 0. However, this is only an example and not a limitation. Those skilled in the art can make any modifications as needed and change the control signal of the multiplexer accordingly.
[0151] The first register R1 receives and temporarily stores the second data d1 from the currently read data pair in the next clock cycle. The output of the first register R1 is coupled to another input of the first multiplexer M0 and another input of the second multiplexer M1. Figure 4 The middle part is the input terminal corresponding to "1".
[0152] The output of the first multiplexer M0 is coupled to the input of the second register R0.
[0153] The second register R0 outputs the first data x0 in the currently rearranged data pair, and the second multiplexer M1 outputs the second data x1 in the currently rearranged data pair.
[0154] Therefore, the first register R1 is used to directly store the input data d1, and the second register R0 is used to store another input data d0 or to store the data stored in R1 in the previous clock cycle.
[0155] Figure 4 The multiplexing control unit includes a third register F0, a fourth register F1, and an inverter Inv1, but this is merely exemplary and not restrictive, and those skilled in the art can make any modifications as needed.
[0156] The output of the third register F0 is coupled to the control terminal of the first multiplexer M0, the input terminal of the fourth register F1, and the input terminal of the inverter Inv1. The third register F0 outputs the control signal f0.
[0157] The output of the fourth register F1 is coupled to the control terminal of the second multiplexer M1. The fourth register F1 outputs the control signal f1.
[0158] The output of inverter Inv1 is coupled to the input of the third register F0.
[0159] After the multiplexing control unit is running stably, it can output two control signals f0 and f1 that are inverted within one clock cycle, and invert f0 and f1 once in each subsequent clock cycle.
[0160] pass Figure 4 The first rearrangement unit 11 shown can perform the aforementioned operations over three consecutive clock cycles as follows. Figure 2 Step S22 (sub-stage 2):
[0161] In the second clock cycle, d0 and d1 of the first data pair are temporarily stored in the second register R0 and the first register R1, respectively.
[0162] In the third clock cycle, d1 of the second data pair is temporarily stored in the first register R1, and the second register R0 outputs d0 of the first data pair as x0. The second multiplexer M1 outputs d0 of the second data pair as x1. At this time, the output x0 and x1 constitute the third data pair. The second register R0 temporarily stores d1 of the first data pair stored in the first register R1 in the second clock cycle.
[0163] In the fourth clock cycle, the second register R0 outputs d1 from the first data pair as x0, and the second multiplexer M1 outputs d1 from the second data pair stored in the first register R1 in the third clock cycle as x1. At this time, the output x0 and x1 constitute the fourth data pair.
[0164] Next, after the processing engine 12 receives the third data pair in the aforementioned third clock cycle, the processing engine 12 can begin executing the aforementioned... Figure 2 In step S23 (sub-stage 3), the first step involves performing a radix-2 butterfly operation on the third data pair. After t clock cycles (i.e., in the (t+3)th clock cycle), the first result pair is output to the second rearrangement unit 13. After the processing engine 12 receives the fourth data pair in the aforementioned fourth clock cycle, the second step can begin: performing a radix-2 butterfly operation on the fourth data pair. After t clock cycles (i.e., in the (t+4)th clock cycle), the second result pair is output to the second rearrangement unit 13. Here, t clock cycles represents the time required for one butterfly operation.
[0165] For NTT, each butterfly operation includes modular addition, modular subtraction, and modular multiplication, which typically requires multiple clock cycles to complete the calculation, i.e., t>1. However, the processing engine 12 can also adopt a pipelined circuit architecture, so the two butterfly operations can be started in two consecutive clock cycles as described above.
[0166] In one example, processing engine 12 may have, for example: Figure 5 The assembly line structure shown.
[0167] like Figure 5 As shown, the processing engine 12 uses a five-stage pipeline to process butterfly operations, where the first four stages of the pipeline calculate modular multiplication using the Shoup algorithm, and the fifth stage calculates modular addition and modular subtraction. Figure 5The first four stages of the pipeline circuit can be regarded as a modular multiplication unit, which includes three integer multipliers. The first integer multiplier is implemented using a digital signal processor (DSP), while the second and third integer multipliers are implemented using a lookup table (LUT). This optimizes the critical path and yields better synthesis results.
[0168] In addition, such as Figure 5 As shown, the rotation factor (shown in the figure is the rotation factor used for NTTP) and parameter m are pre-stored in the read-only memory (ROM) of the processing engine 12.
[0169] Therefore, adopt Figure 5 The processing engine 12 of the structure requires 5 clock cycles to complete one butterfly operation, that is, t=5 mentioned above.
[0170] Next, after the second rearrangement unit 13 receives the first result pair in the aforementioned (t+3)th clock cycle, the second rearrangement unit 13 can execute the aforementioned process over three consecutive clock cycles. Figure 2 Step S24 (sub-stage 4) in the above. For example, in the (t+4)th clock cycle, the second rearrangement unit 13 temporarily stores the first result pair; in the (t+5)th clock cycle, the second rearrangement unit 13 temporarily stores the second result in the second result pair, and outputs the first result in the first result pair and the first result in the second result pair as the fifth data pair, at which time the memory 14 receives the fifth data pair; in the (t+6)th clock cycle, the second rearrangement unit 13 outputs the second result in the first result pair and the second result in the second result pair as the sixth data pair, at which time the memory 14 receives the sixth data pair.
[0171] The operation of the second rearrangement unit 13 is the same as that of the first rearrangement unit 11, except that the data pairs processed are different. Therefore, the second rearrangement unit 13 can have the same hardware structure as the first rearrangement unit 11, for example... Figure 4 The hardware structure shown will not be described in detail here.
[0172] Next, after the memory 14 receives the fifth data pair in the aforementioned (t+5)th clock cycle, the controller 15 can control the memory 14 to execute the aforementioned process in the next two consecutive clock cycles. Figure 2 Step S25 (sub-stage 5) in the process. For example, in the (t+5)th clock cycle, the controller 15 writes the fifth data pair to the first write address of the memory 14; in the (t+6)th clock cycle, the controller 15 writes the sixth data pair to the second write address of the memory 14.
[0173] The pipelined approach described above can be used to sequentially process all butterfly operations within a stage, where some operations of the two consecutive butterfly operations are performed simultaneously. For example, in the third and fourth clock cycles described above, the controller 15 can then read the corresponding two data pairs from the third and fourth read addresses corresponding to the next pair of butterfly operations, and continue to run according to the pipelined process described above, which will not be elaborated further here.
[0174] Furthermore, in the pipeline, all N stages can run simultaneously. However, since later stages need to read the computation results from earlier stages, there is an initial delay in the i-th stage. Using the aforementioned grouped pairwise memory access (read / write) method, the i-th stage can only begin its first read operation after the first group of butterfly cells in the (i-1)-th stage completes its computation and writes its results to memory. Therefore, in response to the completion of writing all data pairs in the first group of addresses in the (i-1)-th stage, the controller of the i-th stage begins its first read operation in the following first clock cycle, reading the first data pair from the first read address of the memory in the (i-1)-th stage. According to the aforementioned control logic and data flow description, in the following first clock cycle, the (i-1)-th stage completes the write operation at the second read address in the address access sequence of the i-th stage. Therefore, the controller of the i-th stage can begin its second read operation in the following second clock cycle, reading the second data pair from the second read address of the memory in the (i-1)-th stage. Only the first read operation requires waiting; subsequent read operations do not. This can maximize the efficiency of the production line.
[0175] Algorithm 1, written in pseudocode below, illustrates a complete Nth-order transformation operation of a pipelined DIT NTTP. This algorithm utilizes the aforementioned componentized paired memory access scheme. To describe the register behavior in the pseudocode of Algorithm 1—that is, to indicate that the value in the register (the value obtained in the previous sub-stage) is used first in the later sub-stage before being updated—the first sub-stage is placed at the bottom of the code segment, and these sub-stages are launched sequentially from bottom to top according to their conditional statements.
[0176] The pipeline operations represented by these pseudocodes have been described in detail above and will not be repeated here.
[0177] Additionally, it can be understood that since Algorithm 1 describes NTTP, the rotation factor of the butterfly unit executed j times in the i-th stage is:
[0178] ,
[0179] in ,and .
[0180] If Algorithm 1 is applied to other transformations, the value of the rotation factor can be adaptively changed. For the inverse transformation, a multiplication with 1 / n is also added.
[0181] Algorithm 1: Pipeline DIT NTTP
[0182] Input: array .
[0183] Output: Array Z q .
[0184] 1: Function Addr(j, / / Memory read / write address calculation
[0185] 2:
[0186] 3: Return address
[0187] 4: j = 0
[0188] 5: When j < n / 2 / / Stage 0
[0189] 6: mod
[0190] 7: mod
[0191] 8: j = j + 1
[0192] 9: i = 1
[0193] 10: When i < N / / Stage 1 to Stage N-1
[0194] 11:
[0195] 12: j = 0
[0196] 13: When j < n / 2 + 4 / / Any pair of butterfly operations in each stage contains the following 5 sub-stages
[0197] 14: If 4 ≤ Then
[0198] / / At clock t+5 and t+6, the output is rearranged twice in sub-phase 4 and two memory pairs are written in sub-phase 5.
[0199] 15: ← Addr (j − 4, )
[0200] 16:
[0201] 17:
[0202] 18: If 3 ≤ j < / 2 + 3 Then / / then implement the temporary storage operation in sub-stage 4 at clock t+4 and t+5.
[0203] 19:
[0204] 20:
[0205] twenty one:
[0206] 22: If 2 ≤ j < / 2 + 2 then
[0207] / / The 3rd and 4th clock cycles implement the two output rearrangements in sub-stage 2 and begin the two processing engine operations in sub-stage 3. The calculation results are obtained at clock cycles t+3 and t+4 respectively.
[0208] twenty three: ← ⌊ ⌋
[0209] twenty four:
[0210] 25:
[0211] 26:
[0212] 27: mod
[0213] 28: mod
[0214] 29: If 1 ≤ j < / 2 + 1 Then / / The second and third clock cycles implement the temporary storage operation in sub-stage 2.
[0215] 30: SRU0[0] ← ? SRU0[1]:
[0216] 31: SRU0[1] ←
[0217] 32:
[0218] 33: If j < / 2 Then / / The first and second clock cycles implement the reading of the two memory pairs in sub-stage 1.
[0219] 34: ← Addr (j, )
[0220] 35:
[0221] 36:
[0222] 37: j = j + 1
[0223] 38: i = i + 1
[0224] 39: Return A N−1
[0225] The algorithm described above delivers efficient data flow, enabling simultaneous pairwise execution of memory access and butterfly operations within the processing engine on the pipeline, while maintaining consistent speed between memory access and computation. Each clock cycle, the processing engine receives two input data points from a butterfly unit and generates two output data points for that unit. All butterfly units within a stage are computed by the processing engine in a predetermined order.
[0226] This data stream can process consecutive n-point target transformations using an average processing cycle of n / 2. As mentioned earlier, after the (i-1)th stage completes the calculation of the first set of butterfly units and writes the result to memory, the i-th stage can begin its first read operation, thus introducing a 2-stage interval between adjacent stages. i-1 -1 cycle pipeline delay. Additionally, the delay within stage 0 is 7, and the fixed delay within subsequent stages is 9. Therefore, the total pipeline delay of the target transformation data stream can be derived from the following formula:
[0227]
[0228] The DIT NTT / NTTP accelerated hardware architecture described in the embodiments of this specification can be synthesized and implemented on various FPGA platforms. Table 1 below provides a comparison of detailed information regarding hardware implementation results between the embodiments of this specification and prior art, including resource consumption, performance metrics (frequency, latency, throughput), and area-time product as an expression of hardware efficiency. The area is evaluated by converting the BRAM and DSP in the FPGA into an equivalent number of slices. Latency reflects the speed of processing NTT operations; lower latency indicates faster computation. Since the embodiments of this specification are pipelined designs, their ideal application scenario is processing consecutive NTT operations; therefore, the average latency for processing 100 consecutive NTT operations is given in the comparison results. The area-time product is an indicator that considers both hardware resource consumption and computation speed; a lower area-time product indicates better hardware efficiency.
[0229]
[0230] The data for each prior art used as a reference in Table 1 are from the corresponding journal of a certain year described in its design title.
[0231] As can be seen from Table 1, compared with the current mainstream NTT accelerators, the NTT structure proposed in the embodiments of this specification achieves a speed improvement of up to 4.8 times and an area-time product improvement of up to 4.3 times.
[0232] The embodiments described in this specification achieve a good balance between performance and resource utilization, resulting in high hardware efficiency.
[0233] The circuit structure and operation flow of the embodiments in this specification have been described above using the DIT form as an example. By appropriately modifying it, the circuit structure and operation flow of the DIF form can be obtained.
[0234] The topology of each stage in the DIF form is the opposite of that in the DIT form; that is, the i-th stage in the DIF form corresponds to the (N-1)-i-th stage in the DIT form. Therefore, Figure 1 The DIT-form accelerated hardware structure shown can be horizontally flipped to obtain the DIF-form accelerated hardware structure.
[0235] The first stage of the DIF form, stage 0 corresponds to Figure 1 In stage N-1, therefore it has the same Figure 1 The intermediate stage circuits have the same hardware structure and perform the same transformation operations.
[0236] The final stage N-1 of the DIF form corresponds to Figure 1Therefore, the circuitry of stage N-1 consists only of the (N-1)th order controller and the (N-1)th order processing engine, without the need for unit rearrangement. Accordingly, the transformation operation of stage N-1 includes: the (N-1)th order controller sequentially reading n / 2 data pairs from n / 2 addresses in the memory of stage N-2; and the (N-1)th order processing engine sequentially performing radix-2 butterfly operations on each of the n / 2 data pairs, outputting n / 2 result pairs sequentially.
[0237] For each intermediate stage of the DIF form, its circuit part is similar to... Figure 1 The DIT form is the same, except that in the control logic, 'i' in the aforementioned calculation formulas is replaced with 'Ni-1'. For example, in the i-th stage of the transformation operation in the DIF form, n / 2 addresses are sequentially divided into 2 powers of i groups, and the second read address and write address in any pair of butterfly operations mentioned above are the first read address and write address plus 2 powers of Ni-2, respectively.
[0238] In addition, the DIF-based processing engine performs modular multiplication after modular addition and modular subtraction.
[0239] Since the pipeline structure proposed in the embodiments of this specification can generate two outputs in each cycle, a pipelined polynomial multiplier can be implemented based on the embodiments of this specification. It is only necessary to input two polynomials into two parallel NTTP units, and pass the result of the two data points multiplied by the Barrett algorithm to an INTTP unit to obtain the result of polynomial multiplication.
[0240] Figure 6 A schematic diagram of a polynomial multiplier according to one embodiment is shown. The polynomial multiplier receives an n-point first input sequence and an n-point second input sequence, and outputs an n-point polynomial multiplication output sequence.
[0241] like Figure 6 As shown, the polynomial multiplier includes a first transformation module 61, a second transformation module 62, a point-by-point multiplication hardware module 63, and a third transformation module 64, wherein the first transformation module 61, the second transformation module 62, and the third transformation module 64 can all be implemented using the acceleration hardware described in the embodiments of this specification.
[0242] The first transformation module 61 can be an acceleration hardware for NTTP in DIF form according to an embodiment of this specification. It performs NTTP in DIF form on the first input sequence, i.e. the coefficient sequence of polynomial a(x), and outputs the first output sequence a'.
[0243] The second transformation module 62 can also be an acceleration hardware for NTTP in the form of DIF according to the embodiments of this specification. Its structure can be the same as that of the first transformation module 61. It performs NTTP in the form of DIF on the second input sequence, that is, the coefficient sequence of the polynomial b(x), and outputs the second output sequence b'.
[0244] The point-by-point multiplication hardware module 63 can perform point-by-point multiplication on the first and second output sequences a' and b' to output a third output sequence c'. In the case of a pipelined architecture, this point-by-point multiplication hardware module can be implemented using the Barrett algorithm, performing point-by-point multiplication on the two data outputs of the first transformation module 61 and the second transformation module 62 in each clock cycle.
[0245] The third transformation module 64 can be acceleration hardware for INTTP in the form of DIT according to embodiments of this specification (e.g., Figure 1 As shown), it performs INTTP in the form of DIT on the third output sequence c', and outputs the n-point polynomial multiplication output sequence, that is, the coefficient sequence of the polynomial c(x).
[0246] Since the DIF form requires a naturally ordered input sequence to produce a bit-reversed output sequence, while the DIT form requires a bit-reversed input sequence to produce a naturally ordered output sequence, the bit-reversal process is eliminated in the polynomial multiplier described above, thus improving operating efficiency.
[0247] The polynomial multipliers according to the embodiments of this specification can be applied to a variety of other devices, such as post-quantum cryptography hardware accelerators.
[0248] Those skilled in the art will recognize that, in one or more of the examples above, the functions described in this invention can be implemented using hardware, software, firmware, or any combination thereof. When implemented in software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium.
[0249] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made on the basis of the technical solution of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for performing a target transformation by accelerating hardware, the target transformation comprising N stages of transformation operations, the accelerating hardware comprising N circuit sections corresponding to the N stages, wherein any i-th stage (excluding the first and last stages) corresponds to an i-th circuit section comprising a controller, a processing engine, a memory, and first and second rearrangement units, the transformation operation of the i-th stage comprising: The controller reads the first data pair and the second data pair from the first read address and the second read address in the memory of the (i-1)th stage, respectively; One address stores one data pair; The first rearrangement unit performs a first rearrangement operation on the first and second data pairs, and outputs the third and fourth data pairs in sequence. The processing engine performs radix-2 butterfly operations on the third and fourth data pairs in turn, and outputs the first and second result pairs in turn. The second rearrangement unit performs a second rearrangement operation on the first and second result pairs to obtain the fifth data pair and the sixth data pair. The controller writes the fifth data pair and the sixth data pair into the first write address and the second write address in the memory of the i-th stage, respectively.
2. The method of claim 1, wherein, The target transformation is performed on an n-point input sequence, where n is a power of 2. The memory of the i-th stage stores the n results obtained after the transformation operation of the i-th stage through consecutive n / 2 addresses, and the values of the first and second read addresses are the same as the values of the first and second write addresses, respectively.
3. The method of claim 2, wherein, The second read address and write address are respectively the first read address and write address plus 2 raised to the power of i-1; or, The second read address and write address are respectively the first read address and write address plus 2 raised to the power of Ni-2.
4. The method of claim 2, wherein, In the transformation operation of the i-th stage, the n / 2 addresses are sequentially divided into 2^Ni-1 groups or 2^i groups; The transformation operations in stage i also include: After reading the second data pair, the controller reads data from the third read address and the fourth read address in the memory of the (i-1)th stage for two consecutive clock cycles. Wherein, if the second read address is the last address in the group, then the third read address is the second read address plus 1; otherwise, the third read address is the first read address plus 1; When the data is divided into 2^Ni-1 equal groups, the fourth read address is the third read address plus 2^i-1; when the data is divided into 2^i equal groups, the fourth read address is the third read address plus 2^Ni-2.
5. The method of claim 1, wherein, The controller reads the first data pair and the second data pair from the first read address and the second read address in the memory of stage i-1, respectively, including: In the first clock cycle, the controller reads a first data pair from the first read address and inputs it into the first rearrangement unit; in the second clock cycle following the first clock cycle, the controller reads a second data pair from the second read address and inputs it into the first rearrangement unit.
6. The method of claim 5, wherein, The processing engine sequentially performs radix-2 butterfly operations on the third and fourth data pairs, including: In the third clock cycle, the processing engine begins to perform a radix-2 butterfly operation on the third data pair, and after t clock cycles, the first result pair is output to the second rearrangement unit. In the fourth clock cycle following the third clock cycle, the processing engine begins to perform a radix-2 butterfly operation on the fourth data pair, and after t clock cycles, the resulting second result pair is output to the second rearrangement unit.
7. The method of claim 6, wherein, The controller writes the fifth and sixth data pairs into the first and second write addresses in the memory of the i-th stage, respectively, including: In the t+5th clock cycle, the controller writes the fifth data pair to the first write address; In the t+6 clock cycle, the controller writes the sixth data pair to the second write address.
8. The method according to claim 1, wherein, The first rearrangement unit performs a first rearrangement operation on the first and second data pairs, including: During the first clock cycle, the first rearrangement unit receives the first data pair; During the second clock cycle, the first rearrangement unit receives the second data pair and temporarily stores the first data pair; In the third clock cycle following the second clock cycle, the first rearrangement unit outputs the first data from the first data pair and the first data from the second data pair as the third data pair, and temporarily stores the second data from the second data pair. In the fourth clock cycle following the third clock cycle, the first rearrangement unit outputs the second data from the first data pair and the second data from the second data pair as the fourth data pair.
9. The method according to claim 6, wherein, The second rearrangement unit performs a second rearrangement operation on the first and second result pairs, including: In the (t+3)th clock cycle, the second rearrangement unit receives the first result pair; In the (t+4)th clock cycle, the second rearrangement unit receives the second result pair and temporarily stores the first result pair; In the t+5th clock cycle, the second rearrangement unit outputs the first result of the first result pair and the first result of the second result pair as the fifth data pair, and temporarily stores the second result of the second result pair; At clock cycle t+6, the second rearrangement unit outputs the second result of the first result pair and the second result of the second result pair as the sixth data pair.
10. The method according to claim 2, wherein, The 0th circuit section corresponding to the first stage among the N circuit sections includes a 0th-order controller, a 0th-order processing engine, and a 0th-order memory. The transformation operations of the 0th stage include: The 0th-order processing engine sequentially performs a radix-2 butterfly operation on each of the n / 2 input pairs consisting of two adjacent data points arranged in bit-reversed order, and sequentially outputs n / 2 result pairs. The 0th-order controller writes the n / 2 result pairs sequentially to n / 2 consecutive addresses in the 0th-order memory.
11. The method according to claim 2, wherein, The (N-1)th circuit section corresponding to the final stage among the N circuit sections includes an (N-1)th order controller and an (N-1)th order processing engine. The transformation operations in the (N-1)th stage include: The (N-1)th stage controller sequentially reads n / 2 data pairs from n / 2 addresses in the memory of the (N-2)th stage; The (N-1)th order processing engine sequentially performs radix-2 butterfly operations on the n / 2 data pairs and outputs n / 2 result pairs sequentially.
12. The method according to claim 1, wherein, The 0th circuit part corresponding to the first stage or the (N-1)th circuit part corresponding to the end stage among the N circuit parts has the same hardware structure as the i-th circuit part and performs the same transformation operation.
13. The method according to claim 1, wherein, The target transform is Fast Fourier Transform (FFT), Inverse Fast Fourier Transform (IFFT), Fast Number Theory Transform (NTT), Inverse Fast Number Theory Transform (INTT), NTTP transform obtained by fusing preprocessing and NTT in Negative Wrapping Convolution (NWC), or INTTP transform obtained by fusing INTT and postprocessing in NWC. The transformation operations of the N stages are performed using either time-decimation (DIT) or frequency-decimation (DIF) methods.
14. Acceleration hardware for performing a target transformation, the target transformation comprising N stages of transformation operations, the acceleration hardware comprising N circuit sections corresponding to the N stages, wherein the i-th circuit section corresponding to any i-th stage (excluding the first and last stages) comprises a controller, a processing engine, a memory, and first and second rearrangement units; in the transformation operation of the i-th stage: The controller is configured to read a first data pair and a second data pair from the first read address and the second read address in the memory of the (i-1)th stage, respectively; one address stores one data pair; The first rearrangement unit is configured to perform a first rearrangement operation on the first and second data pairs and output the third and fourth data pairs in sequence. The processing engine is configured to perform radix-2 butterfly operations on the third and fourth data pairs in sequence, and output the first and second result pairs in sequence. The second rearrangement unit is configured to perform a second rearrangement operation on the first result pair and the second result pair to obtain a fifth data pair and a sixth data pair. The controller is also configured to write the fifth data pair and the sixth data pair to the first write address and the second write address in the memory of the i-th stage, respectively.
15. The acceleration hardware according to claim 14, wherein, The first rearrangement unit includes a first register, a second register, a first multiplexer, a second multiplexer, and a multiplexing control unit. In this configuration, one input of the first multiplexer and one input of the second multiplexer both receive the first data from the currently read data pair. The first register receives and temporarily stores the second data in the currently read data pair. The output of the first register is coupled to another input of the first multiplexer and another input of the second multiplexer. The output of the first multiplexer is coupled to the input of the second register. The second register outputs the first data in the currently rearranged data pair, and the second multiplexer outputs the second data in the currently rearranged data pair. The multiplexing control unit is configured to control the first multiplexer and the second multiplexer to alternately select one of their two inputs as the output.
16. The acceleration hardware according to claim 15, wherein, The multiplexing control unit includes a third register, a fourth register, and an inverter. The output of the third register is coupled to the control terminal of the first multiplexer, the input of the fourth register, and the input of the inverter. The output of the fourth register is coupled to the control terminal of the second multiplexer. The output of the inverter is coupled to the input of the third register.
17. The acceleration hardware according to claim 14, wherein, The first rearrangement unit and the second rearrangement unit have the same hardware structure.
18. The acceleration hardware according to claim 14, wherein, The processing engine includes a modular multiplication unit, which includes a first, a second, and a third integer multiplier, wherein the first integer multiplier is implemented using a digital signal processor (DSP), while the second and third integer multipliers are implemented using a lookup table (LUT).
19. A polynomial multiplier that receives an n-point first input sequence and an n-point second input sequence, and outputs an n-point polynomial multiplication output sequence; the polynomial multiplier includes a first transformation module, a second transformation module, a point-by-point multiplication hardware module, and a third transformation module, wherein the first transformation module, the second transformation module, and the third transformation module all include the acceleration hardware according to claim 14; The first transformation module is configured to perform NTTP on the first input sequence in DIF form and output a first output sequence; The second transformation module is configured to perform NTTP on the second input sequence in DIF form and output a second output sequence; The point-by-point multiplication hardware module is configured to perform point-by-point multiplication on the first and second output sequences and output a third output sequence. The third transformation module is configured to perform INTTP on the third output sequence in the form of DIT, and output the n-point polynomial multiplication output sequence.