Index generating apparatus for scalable ntt and method thereof
By optimizing the NTT index generation device and method, the problems of large memory requirements and complex access patterns in NTT computing were solved, resulting in a reduction of hardware resources and an increase in computing speed.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HARBIN UNIV OF SCI & TECH
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-30
AI Technical Summary
Existing NTT computation processes suffer from problems such as high memory requirements and complex memory access patterns, which have become key factors restricting further improvements in the hardware performance of post-quantum cryptography and fully homomorphic cryptography.
An index generation device and method for scalable NTT are proposed, including a continuous counter, a stage counting module, and an index parsing module. The continuous counter outputs multiple continuous values, the stage counting module iterates, and the index parsing module generates the index value corresponding to each round of the PE array. The index generation algorithm is optimized to reduce hardware resources.
It improves the speed of generating index factors in scalable NTT, reduces the use of hardware resources, increases the overall computing speed, and simplifies the control circuit.
Smart Images

Figure CN122309899A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer technology, and specifically relates to an index generation device and method for scalable NTT. Background Technology
[0002] Around 1990, Shor and Grover published two papers demonstrating that quantum computers could perform factorization of large integers in polynomial time. This showed that existing classical public-key cryptography algorithms could be broken by sufficiently large and stable quantum computers. Around 2019, quantum computers were gradually released both domestically and internationally, making research in the field of post-quantum cryptography (PQC) an urgent priority. Post-quantum cryptography is a type of cryptography resistant to quantum computer attacks, primarily encompassing five types of algorithms: hash-based cryptography, multivariate-based cryptography, encoding-based cryptography, hypersingular homology cryptography, and lattice-based cryptography. Due to its advantages such as computational simplicity, ease of parallelization, and ability to implement almost all the functions of traditional public-key cryptography, lattice cryptography is considered highly promising.
[0003] Most of these lattice ciphers are based on difficult problems such as error learning, ring learning with errors (RLWE), and module learning with errors (MLWE). The core operation in the encryption process is polynomial vector multiplication, and the time complexity of traditional polynomial multiplication is O(n^2). 2 ), where n is the number of polynomial coefficients. Statistically, the number of coefficients n in lattice-based post-quantum cryptography algorithms is almost greater than or equal to 256, while the number of coefficients n in lattice-based fully homomorphic cryptography even reaches tens of thousands. Therefore, the computational efficiency of polynomial multiplication determines the encryption efficiency of lattice cryptography. Among various polynomial multiplication methods, number-theoretic transformation (NTT) can reduce the complexity of polynomial multiplication to O(nlogn). Therefore, NTT is a major part of lattice cryptography implementation. NTT can be seen as implementing FFT operations on a finite field. Although the computational complexity of FFT is lower than that of DFT, FFT still involves a large number of complex number operations, which NTT perfectly solves. Lattice-based cryptographic schemes perform mathematical operations on a polynomial ring Rq, where q is the modulus and N is a power of 2. For an nth-order polynomial defined on Rq, the following equation is given:
[0004] ,
[0005] Let ω be an nth primitive root of unit modulus q. For satisfy and Therefore, the NTT formula for a(x) is:
[0006]
[0007] Among them, A i are the coefficients of an nth-order polynomial, and Similarly, for the inverse NTT transform INTT, it is necessary to... Up to calculate the modular inverse and The formula for calculating INTT is:
[0008]
[0009] While NTT can simplify polynomial multiplication, it also suffers from high memory requirements and complex memory access patterns, which will be key factors limiting further improvements in the hardware performance of post-quantum cryptography and fully homomorphic cryptography.
[0010] Since some lattice-based encryption involves NTT calculations with different numbers of points, scalable NTT can not only adapt to different numbers of points, but also to different moduli, widths of the arithmetic unit array, and depths of the arithmetic unit array. Summary of the Invention
[0011] The problem this invention aims to solve is to improve the speed of generating index factors for scalable NTT while reducing the hardware resources used. It proposes an index generation device and method for scalable NTT.
[0012] To achieve the above objectives, the present invention provides the following technical solution:
[0013] An index generation device for Scalable NTT includes a continuous counter, a stage counting module, and an index parsing module. The continuous counter is connected to the stage counting module and the index parsing module, respectively, and the stage counting module is connected to the index parsing module.
[0014] The continuous counter includes a register, an adder, and a subtractor. One input of the adder is connected to the output of the register, and the other input of the adder is a constant. One input of the subtractor is connected to the output of the adder, and the other input of the subtractor is a constant. The output of the register is connected to the stage counting module.
[0015] Furthermore, the continuous counter consists of W adders, W subtractors, and W registers; the continuous counter outputs multiple consecutive W numbers in each cycle; the stage counting module receives each output value from the W registers and iterates over the stage number; the index parsing module concatenates the values of the continuous counter according to the stage number from the stage counting module to generate the index value corresponding to each round of the PE array; when the NTT index is generated, the stage counting module outputs an end signal, where W represents the width of the PE array.
[0016] An index generation method for scalable NTT, implemented using the aforementioned index generation device for scalable NTT, is used for N-point radix-2 DIF of a 2×2 PE array. RN The NTT index generation process includes the following steps:
[0017] S1. Define the number of operation points of the external input as N. The stage counting module calculates the total number of operation levels s based on the number of operation points of the external input. The calculation formula is:
[0018] s = log2N -1;
[0019] Set D to represent the depth of the PE array and W to represent the width of the PE array, and set D=2 and W=2 to complete the initialization;
[0020] S2. Construct an outer loop based on the stage counting module. Set the outer loop variable i to correspond to the i-th level in the NTT operation. The outer loop variable i starts from 0 and iteratively increases with a step size of D. When i is less than s-1, proceed to the next step. When i is equal to s-1, the outer loop ends.
[0021] S3. Construct an inner loop based on a continuous counter. For the i-th layer in the NTT operation of step S2, set the traversal of the butterfly operation group to be controlled by the inner loop variable count.
[0022] During each iteration of the inner loop, a first basic counter Count1 and a second basic counter Count2 are generated based on the inner loop variable count, with the following expression:
[0023] Count1 = count + 0
[0024] Count2 = count + 1;
[0025] When i=0, the index parsing module generates the NTT index based on the first base counter Count1 and the second base counter Count2. The generated expression is:
[0026]
[0027]
[0028]
[0029]
[0030] Among them, Index1 L Index2 is the first least significant index. L Index1 is the second least significant index. H Index2 is the first high-order index. H It is the second most significant index;
[0031] When i≠0, the re-bit function is first called to perform bit reversal on the first basic counter Count1 and the second basic counter Count2 to obtain the first intermediate variable n1 and the second intermediate variable n2, and then the hybrid shift operation is performed.
[0032] Shift the lower [s-1-i:0] bits of Count1 left by i+1 bits and add them to the lower [i-1:0] bits of n1 to generate the first low-order index; increment the first low-order index by 1 and then shift it left by i bits to obtain the first high-order index.
[0033] Shift the lower [s-1-i:0] bits of Count2 left by i+1 bits and add them to the lower [i-1:0] bits of n2 to generate the second low-order index; increment the second low-order index by 1 and then shift it left by i bits to obtain the second high-order index;
[0034] The inner loop variable count starts from 0 and iterates in increments of 2 until it reaches N / 2, then returns to step S2 for the next iteration;
[0035] S4. Based on the first low-order index, first high-order index, second low-order index, and second high-order index obtained in steps S2 and S3, control the addressing of the coefficient memory until the inner and outer loops are all completed, thus completing the generation of the index sequence for the entire N-point NTT operation.
[0036] An index generation method for scalable NTTs, implemented using the aforementioned index generation device for scalable NTTs, is used for radix-2 DITs of a 2×2 PE array. NR The NTT index generation process includes the following steps:
[0037] Step 1. Define the number of operation points of the external input as N. The stage counting module calculates the total number of operation levels s based on the number of operation points of the external input. The calculation formula is:
[0038] s = log2N -1;
[0039] Set D to represent the depth of the PE array and W to represent the width of the PE array, and set D=2 to complete the initialization;
[0040] S2. Construct an outer loop based on the stage counting module. Set the outer loop variable i to correspond to the i-th level in the NTT operation. The outer loop variable i starts from 0 and iteratively increases with a step size of D. When i is less than s-1, proceed to the next step. When i is equal to s-1, the outer loop ends.
[0041] S3. Construct an inner loop based on a continuous counter. For the i-th layer in the NTT operation of step S2, set the traversal of the butterfly operation group to be controlled by the inner loop variable count.
[0042] During each iteration of the inner loop, a first basic counter Count1 and a second basic counter Count2 are generated based on the inner loop variable count, with the following expression:
[0043] Count1 = count + 0
[0044] Count2 = count + 1;
[0045] When i=0, the re-bit function is called to perform bit reversal on the first basic counter Count1 and the second basic counter Count2 to obtain the first intermediate variable n1 and the second intermediate variable n2, and then the hybrid shift operation is performed.
[0046] Assign the lower [s-1:0] bits of n1 to the first low-order index, increment the first low-order index by 1, and then shift it left by s bits to obtain the first high-order index;
[0047] Assign the lower [s-1:0] bits of n2 to the second lowest bit index, increment the second lowest bit index by 1, and then shift it left by s bits to obtain the second highest bit index;
[0048] When i≠0, the re-bit function is called to perform bit reversal on the first basic counter Count1 and the second basic counter Count2 to obtain the first intermediate variable n1 and the second intermediate variable n2, and then the hybrid shift operation is performed.
[0049] Shift the lower [i-1:0] bits of n1 left by s-i+1 bits and add the result of the (s-1)th to the ith bits of n1 to generate the first low-order index; increment the first low-order index by 1 and then shift it left by si bits to obtain the first high-order index.
[0050] Shift the lower [i-1:0] bits of n2 left by s-i+1 bits and add the result of the (s-1)th to the ith bits of n2 to generate the second low-order index; increment the second low-order index by 1 and then shift it left by si bits to obtain the second high-order index.
[0051] The inner loop variable count starts from 0 and iteratively increases in steps of W until it reaches N / 2, then returns to step S2 for the next iteration;
[0052] S4. Based on the first low-order index, first high-order index, second low-order index, and second high-order index obtained in steps S2 and S3, control the addressing of the coefficient memory until the inner and outer loops are all completed, thus completing the generation of the index sequence for the entire N-point NTT operation.
[0053] The beneficial effects of this invention are:
[0054] The present invention discloses an index generation method for scalable NTT, proposing an N-point radix-2 (including DIF) of a 2x2 PE array based on a continuous counting module. RN and DIT NR The NTT index generation algorithm and hardware structure are extended to W×D PE arrays. The index generation module uses a brand-new NTT coefficient generation algorithm, which reduces the resources occupied by the index generation module, simplifies the overall control circuit, and reduces the computation cycle required.
[0055] The present invention discloses an index generation method for scalable NTT, using an N-point radix-2DIT array of a 2x2 PE array. NR The NTT index generation algorithm, assuming N meets the 2x2 PE array point requirement, can generate 4 indexes within one cycle. These indexes can be parsed to extract the coefficients required for the PE array from the index storage module, based on radix-2 DIT. NR The index features required for NTT and the N-point radix-2 DIT of the 2x2 PE array NR The NTT index generation algorithm can generate two consecutive values using multiple accumulators, and then concatenate these two values according to different NTT operation stages, which can reduce a lot of resource consumption.
[0056] The present invention provides an index generation method for scalable NTTs, which improves the speed of generating index factors for scalable NTTs, reduces the hardware resources used, and improves the overall computing speed of scalable NTTs by optimizing existing index generation algorithms that support scalable conflict-free NTTs. Attached Figure Description
[0057] Figure 1 This is a schematic diagram of the structure of an index generation device for scalable NTT according to the present invention;
[0058] Figure 2 This is a schematic diagram of the continuous counter in an index generation device for scalable NTT as described in this invention. Detailed Implementation
[0059] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only for explaining the invention and are not intended to limit the invention; that is, the described specific embodiments are merely a part of the embodiments of the invention, and not all of them. The components of the specific embodiments of the invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations, and the invention may also have other embodiments.
[0060] Therefore, the following detailed description of specific embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected specific embodiments of the invention. All other specific embodiments obtained by those skilled in the art based on these specific embodiments without inventive effort are within the scope of protection of this invention.
[0061] To further understand the invention's content, features, and effects, the following specific embodiments are provided, along with detailed descriptions in conjunction with the accompanying drawings:
[0062] Example 1:
[0063] An index generation device for Scalable NTT includes a continuous counter, a stage counting module, and an index parsing module. The continuous counter is connected to the stage counting module and the index parsing module, respectively, and the stage counting module is connected to the index parsing module.
[0064] The continuous counter includes a register, an adder, and a subtractor. One input of the adder is connected to the output of the register, and the other input of the adder is a constant. One input of the subtractor is connected to the output of the adder, and the other input of the subtractor is a constant. The output of the register is connected to the stage counting module.
[0065] Furthermore, the continuous counter consists of W adders, W subtractors, and W registers; the continuous counter outputs multiple consecutive W numbers in each cycle; the stage counting module receives each output value from the W registers and iterates over the stage number; the index parsing module concatenates the values of the continuous counter according to the stage number from the stage counting module to generate the index value corresponding to each round of the PE array; when the NTT index is generated, the stage counting module outputs an end signal, where W represents the width of the PE array.
[0066] Furthermore, such as Figure 1 As shown, the index generation device for scalable NTT starts working when it receives a signal of en port 1. Its internal continuous counters, counter_1 to counter_w, output W consecutive numbers. In the s-th stage, the stage port of the special counter module outputs s. When counter_1 to counter_w have generated a total of N / 2 signals, the stage port outputs s+1 in the next cycle. The output value of the counter_1 port of the special counter increases from 0 to N / 2 - w - 1 in each stage. The output signal of the counter_2 port is 1 greater than the output signal of the counter_1 port, the output signal of the counter_3 port is 3 greater than the output signal of the counter_1 port, ..., the output signal of the counter_w port is w - 1 greater than the output signal of the counter_1 port. The stage port and the counter_1 to counter_w ports of the continuous counter module are connected to the index parsing module. The index parsing module concatenates the values from port counter_1 to port counter_w based on the value of the stage port, generating a set of index values needed for the PE array. The index parsing module generates a total of 2*W index values. When the special counter module outputs the final round number of the last stage, the NTT_flag port outputs 1, indicating that all stage coefficients of NTT have been generated. The INTT process is similar.
[0067] Furthermore, such as Figure 2The diagram shows the structure of a continuous counter in an index generation device for a scalable NTT. For an N-point NTT, the register width in this diagram is log₂N - 1 bits. One input of the adder is the output value of the register, and the other input is a constant. The adder adds the two values, and the output is log₂N - 1 bits. One input of the subtractor is the output of the adder, and the other input is the constant 1. It performs an unsigned decrement operation, so that its output is still log₂N - 1 bits. Before the operation begins, all registers are internally filled with 0. When the operation begins, each register sends its internal value to the adder. Each adder receives an output of 1, 2, ..., W. Due to the storage properties of the scalable NTT, the required index values start from 0. Therefore, the outputs of the adders are subtracted by 1 to obtain values from 0 to W-1. These are the values needed to generate the first round of indexing. The output value of the Wth adder is stored in W registers. After the next rising edge arrives, each register outputs W. After passing through the adder and subtractor, values from W to W are obtained. The number of W. Each stage of the continuous counter requires... The round operation follows a similar process to the one described above. In the final round, numbers from N / 2 - W to N / 2 - 1 are generated. The value output from register 1 is sent to the stage jump module, and the continuous counter starts a new round of counting. When the stage jump module recognizes that the value output from register 1 is N / 2 - W, it increments the stage number by D in the next cycle. When the stage number is greater than log₂N, 0 is output, which pulls the Flag signal high, stops the register from outputting data, and stops the number generation module from working.
[0068] Furthermore, the index generation algorithm for scalable NTT provided by this invention supports N-point radix-2 NTT operations and N-point radix-2 INTT operations, and can be applied to different PE arrays.
[0069] According to the properties of binary, if the difference between two numbers equals 2 raised to the power of k, then these two numbers will differ by only one bit in their binary representation. Furthermore, since the indices of each pair of coefficients participating in the CT butterfly operation or GS butterfly operation in each stage of radix-2 DIT NTT, radix-2 DIF NTT, radix-2 DIT INTT, and radix-2 DIF INTT are spaced apart by powers of 2, the difference between the index pairs of two adjacent stages is twice the difference between the index pairs of the previous stage, depending on the input order and the stage increment. In scalable NTT, the PE array is composed of two-dimensional PE operation units. Due to the special connections between PE operation units in different columns, when a set of coefficients is input into the PE array, the final output of the PE array undergoes several CT butterfly operations or GS butterfly operations at the depth of the PE array. Therefore, for our index to be applicable to the PE array, the index interval of each pair of coefficients needs to be a power of 2, and the index interval of adjacent different pairs of coefficients should either increase or decrease exponentially to the base 2. From the above binary properties, it can be deduced that only one bit is different in the binary expression of the index between each pair of coefficients required for each operation of the PE array. The binary expression of adjacent index pairs is also different from the binary expression of the previous pair of indexes in a higher bit position. Assuming the PE array consists of D PE operation units vertically and W PE operation units vertically, it can be seen from the above that there are D different bits in the binary expression of the parameters required by the PE array.
[0070] As described above, the process of accumulating a binary number is remarkably similar to the growth of different bits in the binary expression of adjacent indices required for each calculation of the PE array. Therefore, we can integrate this binary growth process into the index generation pattern of the PE array, which greatly simplifies the implementation of coefficient generation for scalable NTT. We define the indices of the two parameters required for a PE operation unit as the high-order index and the low-order index, respectively. In stage s, the binary expression of each pair of high-order and low-order indices is expressed in stage i. k The bits are different; 0 represents the i-th bit of the lower-order index in binary. k Bit value, with 1 representing the i-th bit of the high-order index binary. k Bit value. The PE array requires a set of indices in each round of computation. There are D distinct bits in the binary expression of these indices, and these D bits move from the leftmost to the rightmost edge or vice versa as the NTT stage i increases. The D-1 bits are then used as 2-1 bits. D-1 A distinct D-1 bit binary representation, each binary number used for a different index pair, ik By concatenating the bits with the lower D-1 bits of the aforementioned binary number, a PE array can be formed. Each round of operation requires a set of D-bit binary indices with different coefficients. For example, using a radix-2 DIFNTT of a 2x2 PE array... RN In stage 0, the four indices required for each round in each PE array are different only in the 0th and 1st bits in their binary representation. In stage 2, the four indices required for each round in each PE array are different only in the 2nd and 3rd bits in their binary representation.
[0071] To accommodate different rounds within the same stage, we extend the bit width of the binary number mentioned above. If the NTT point number N is n bits, then the binary number is extended to n-1 bits. Except for the lower D bits, the remaining bits are used to represent different rounds within the same stage.
[0072] Example 2:
[0073] An index generation method for scalable NTTs, based on the index generation device for scalable NTTs described in Embodiment 1, is used for N-point radix-2 DIF of a 2×2 PE array. RN The NTT index generation process includes the following steps:
[0074] S1. Define the number of operation points of the external input as N. The stage counting module calculates the total number of operation levels s based on the number of operation points of the external input. The calculation formula is:
[0075] s = log2N -1;
[0076] Set D to represent the depth of the PE array and W to represent the width of the PE array, and set D=2 and W=2 to complete the initialization;
[0077] S2. Construct an outer loop based on the stage counting module. Set the outer loop variable i to correspond to the i-th level in the NTT operation. The outer loop variable i starts from 0 and iteratively increases with a step size of D. When i is less than s-1, proceed to the next step. When i is equal to s-1, the outer loop ends.
[0078] S3. Construct an inner loop based on a continuous counter. For the i-th layer in the NTT operation of step S2, set the traversal of the butterfly operation group to be controlled by the inner loop variable count.
[0079] During each iteration of the inner loop, a first basic counter Count1 and a second basic counter Count2 are generated based on the inner loop variable count, with the following expression:
[0080] Count1 = count + 0
[0081] Count2 = count + 1;
[0082] When i=0, the index parsing module generates the NTT index based on the first base counter Count1 and the second base counter Count2. The generated expression is:
[0083]
[0084]
[0085]
[0086]
[0087] Among them, Index1 L Index2 is the first least significant index. L Index1 is the second least significant index. H Index2 is the first high-order index. H It is the second most significant index;
[0088] When i≠0, the re-bit function is first called to perform bit reversal on the first basic counter Count1 and the second basic counter Count2 to obtain the first intermediate variable n1 and the second intermediate variable n2, and then the hybrid shift operation is performed.
[0089] Shift the lower [s-1-i:0] bits of Count1 left by i+1 bits and add them to the lower [i-1:0] bits of n1 to generate the first low-order index; increment the first low-order index by 1 and then shift it left by i bits to obtain the first high-order index.
[0090] Shift the lower [s-1-i:0] bits of Count2 left by i+1 bits and add them to the lower [i-1:0] bits of n2 to generate the second low-order index; increment the second low-order index by 1 and then shift it left by i bits to obtain the second high-order index;
[0091] The inner loop variable count starts from 0 and iterates in increments of 2 until it reaches N / 2, then returns to step S2 for the next iteration;
[0092] S4. Based on the first low-order index, first high-order index, second low-order index, and second high-order index obtained in steps S2 and S3, control the addressing of the coefficient memory until the inner and outer loops are all completed, thus completing the generation of the index sequence for the entire N-point NTT operation.
[0093] The pseudocode for this embodiment is shown in Table 1:
[0094] Table 1:
[0095]
[0096] This embodiment further illustrates the 256-point radix-2 DIF of a 2x2PE array. RN The pseudocode for the NTT index generation method is shown in Table 2:
[0097] Table 2:
[0098]
[0099] In Tables 1 and 2, s represents the bit width of the two counters count1 and count2, N represents the number of points, D represents the depth of the PE array, W represents the width of the PE array, count1 and count2 represent the binary numbers to be concatenated, i and count are the cycle numbers, i represents the stage of the NTT operation, and re-bit(counter1) represents the bit flipping operation on counter1.
[0100] Radix-2 DIF of a 2x2 PE array RN The NTT index generation algorithm outputs four values: the PE operation unit in the first row and first column of the PE array, and the index values of the required coefficients corresponding to index1. L With index1 H The index value of the coefficient pair required for the PE operation unit in the second row and first column of the PE array is index2. L With index2 HLines 4 to 23 of the above algorithm represent the NTT stage loop. Each loop increments the PE array depth D. This is because the four parameters input to the PE array complete D stages of CT or GS butterfly operations after the PE array finishes calculation. In line 4, i cycles from 0 to s-1, increasing by D in each loop. Lines 5 to 22 of the algorithm form the second for loop. count1 and count2 represent the binary numbers to be concatenated. Since one binary number can generate one coefficient pair, and the PE array uses a 2x2 array, two binary numbers are needed. Lines 6 and 7 represent the iterative process of the two binary numbers. count2 is always one more than count1. From a binary perspective, comparing the binary number of count2 with count1 shows that only the last bit is different. The PE array calculates... The required index values of the four parameters must have two distinct bits in their binary representation. Therefore, the last bit of the binary representations of count2 and count1 is concatenated with 0 and 1 respectively. This ensures that two distinct bits are present in the binary representations of the four index values generated by the algorithm each time. As the NTT calculation stages increase, the different binary bits of the four index values required for each calculation of the PE array also need to be shifted. Lines 8 to 20 of the algorithm implement this function, where count1[s-1:0] represents taking the lower s-1 bits of count1, and count2[s-1:0] represents taking the lower s-1 bits of count2. Lines 8 to 20 of the algorithm are divided into two parts. When i is 0, lines 9 to 13 of the algorithm are executed, where count1[s-1-i]... [:0] represents the lower s-1-i digits of count1, count2[s-1-i :0] represents the lower s-1-i digits of count2, count1[si :s-1] represents the si-s-1 digits of count1, and count2[si :s-1] represents the si-s-1 digits of count2. Shift count1 one position to the left and assign it to index1. L Shift count1 one position to the left and increment it by 1, then assign the result to index1. H Shift count2 one position to the left and assign it to index2. L Shift count2 one position to the left and increment it by 1, then assign the result to index2. H, Therefore, index1 L With index1 H and index2 L With index2 HThe binary representation of count1 and count2 differs in only one bit at any given time, because the last bit of the binary representation of count1 and count2 is different, therefore index1 L With index1 H and index2 L With index2 H Between these two pairs of indices, only two bits are different at any given time; when i is 1, execute lines 14 to 21 of the above algorithm, first performing a bit reversal operation on count1 and count2, shifting the lower si bits of count1 to the left by i+1 bits and adding the lower i-1 bits of n1 to assign to index1. L The value of shifting the lower si bits of count1 to the left by i+1 bits, plus the value of shifting the lower i-1 bits of n1 to the left by 1, is assigned to index1. H Shift the lower si bits of count2 to the left by i+1 bits and add the lower i-1 bits of n2 to index2. L Shift the lower si bits of count2 to the left by i+1 bits, add the lower i-1 bits of n1, add 1, shift i bits to the left, and assign the result to index2. H Because shift operations can be replaced by concatenation in hardware implementation, lines 9 and 12 of the above algorithm can be understood as concatenating 1s to the right of count1 or count2; lines 10 and 11 can be understood as concatenating 0s to the right of count1 or count2 to form an s+1-bit binary number; lines 17 and 19 concatenate 0s to the right of the si-th bit of count1, and then concatenate the high i-th bit of count1 after bit reversal; concatenate 0s to the right of the si-th bit of count2, and then concatenate the high i-th bit of count2 after bit reversal; lines 18 and 20 concatenate 1s to the right of the si-th bit of count1, and then concatenate the high i-th bit of count1 after bit reversal; concatenate 1s to the right of the si-th bit of count2, and then concatenate the high i-th bit of count2 after bit reversal.
[0101] Example 3:
[0102] An index generation method for scalable NTTs, based on an index generation device for scalable NTTs described in Embodiment 1, is used for radix-2 DITs of a 2×2 PE array. NR The NTT index generation process includes the following steps:
[0103] S1. Define the number of operation points of the external input as N. The stage counting module calculates the total number of operation levels s based on the number of operation points of the external input. The calculation formula is:
[0104] s = log2N -1;
[0105] Set D to represent the depth of the PE array and W to represent the width of the PE array, and set D=2 to complete the initialization;
[0106] S2. Construct an outer loop based on the stage counting module. Set the outer loop variable i to correspond to the i-th level in the NTT operation. The outer loop variable i starts from 0 and iteratively increases with a step size of D. When i is less than s-1, proceed to the next step. When i is equal to s-1, the outer loop ends.
[0107] S3. Construct an inner loop based on a continuous counter. For the i-th layer in the NTT operation of step S2, set the traversal of the butterfly operation group to be controlled by the inner loop variable count.
[0108] During each iteration of the inner loop, a first basic counter Count1 and a second basic counter Count2 are generated based on the inner loop variable count, with the following expression:
[0109] Count1 = count + 0
[0110] Count2 = count + 1;
[0111] When i=0, the re-bit function is called to perform bit reversal on the first basic counter Count1 and the second basic counter Count2 to obtain the first intermediate variable n1 and the second intermediate variable n2, and then the hybrid shift operation is performed.
[0112] Assign the lower [s-1:0] bits of n1 to the first low-order index, increment the first low-order index by 1, and then shift it left by s bits to obtain the first high-order index;
[0113] Assign the lower [s-1:0] bits of n2 to the second lowest bit index, increment the second lowest bit index by 1, and then shift it left by s bits to obtain the second highest bit index;
[0114] When i≠0, the re-bit function is called to perform bit reversal on the first basic counter Count1 and the second basic counter Count2 to obtain the first intermediate variable n1 and the second intermediate variable n2, and then the hybrid shift operation is performed.
[0115] Shift the lower [i-1:0] bits of n1 left by s-i+1 bits and add the result of the (s-1)th to the ith bits of n1 to generate the first low-order index; increment the first low-order index by 1 and then shift it left by s-i bits to obtain the first high-order index.
[0116] The result of shifting the lower [i-1:0] bits of n2 left by s-i+1 bits and then adding the bits from the (s-1)-th to the i-th bit of n2 is used to generate the second lowest index; the second lowest index is incremented by 1 and then shifted left by s-i bits to obtain the second highest index;
[0117] The inner loop variable count starts from 0 and iteratively increments in steps of W until it reaches N / 2, and then returns to step S2 for the next iteration;
[0118] S4. Based on the first lowest index, first highest index, second lowest index, and second highest index obtained in steps S2 and S3, control the address addressing of the coefficient memory until both the inner and outer loops are completely finished, completing the generation of the index sequence for the entire N-point NTT operation.
[0119] Furthermore, the radix-2 DIT of the 2x2 PE array NR The pseudo-code of the NTT index generation method is shown in Table 3:
[0120] Table 3
[0121]
[0122] The most essential difference between Example 3 and Example 2 is that it adopts a decimation-in-time (DIT) architecture instead of the decimation-in-frequency (DIF) in Example 2, which directly leads to a complete reversal of the addressing logic and operation flow. Specifically, it is manifested as follows: Example 3 is forced to rely entirely on the re-bit data to construct the index throughout the process, unlike Example 2 which only mixes and uses it at specific stages; at the same time, it introduces the maximum span offset (1<<s, i.e., N / 2) for long-distance data exchange at the initial level (i = 0), which is contrary to the logic of Algorithm 2 that starts processing from locally adjacent data. The unique bit splicing formula of this algorithm is designed specifically for the DIT architecture to ensure conflict-free access to parallel memory banks under the reverse data flow.
[0123] In Example 3, s represents the bit width of the two counters count1 and count2, N represents the number of points (N is 256 in this algorithm), D represents the depth of the PE array, W represents the width of the PE array, count1 and count2 represent the binary numbers to be concatenated, i and count are the loop numbers, i represents the stage of the NTT operation, and re-bit(counter1) represents the bit-flipping operation on counter1. Lines 4 and 5 of the algorithm represent the loop, where i loops from 0 to s-1, increasing by D in each loop, and count1 loops from 0 to N / 2, increasing by w in each loop. Lines 6 and 7 represent the loop content, where in each loop, count is incremented by 0 and assigned to count1, and count is incremented by 1 and assigned to count2. Lines 8 to 23 are divided into two parts. When i is 0, the first part is executed, which consists of lines 8 to 15. Lines 9 and 10 represent the bit-flipping operations on count1 and count2 respectively. Lines 11 to 14 represent the calculation process. Line 11 assigns the lower s-1 bits of n1 to index1_l. Line 12 assigns the result of adding 1 to the lower s-1 bits of n1 to index1_h by left-shifting by s bits. Line 13 assigns the lower s-1 bits of n2 to index2_l. Line 14 assigns the result of adding 1 to the lower s-1 bits of n2 to index2_h by left-shifting by s bits. When i is 1, the second part is executed, which consists of lines 16 to 23. Lines 17 and 18 represent bit reversal operations on count1 and count2 respectively. Lines 19 to 22 represent the calculation process. Line 19 indicates that the result of left shifting the lower i-1 bits of n1 by s-i+1 bits, plus the result of the (s-1)th to ith bits of n1, is assigned to index1_l. Line 20 indicates that the result of left shifting the lower i-1 bits of n1 by s-i+1 bits, plus the result of the (s-1)th to ith bits of n1, plus 1, is left shifted by si bits and assigned to index2_h. Line 21 indicates that the result of left shifting the lower i-1 bits of n2 by s-i+1 bits, plus the result of the (s-1)th to ith bits of n2, is assigned to index1_l. Line 22 indicates that the result of left shifting the lower i-1 bits of n2 by s-i+1 bits, plus the result of the (s-1)th to ith bits of n2, plus 1, is left shifted by si bits and assigned to index2_h.
[0124] DIT NR This means that the input data is stored in memory sequentially. Because the storage structure uses a co-address computation data stream structure, after the NTT or INTT calculation is completed, the data is stored in memory in bit-reversed order. DIT NR The change process of the index value difference of the coefficient pair is related to DIF. RNThe order of change of the index value difference between the coefficient pairs is reversed. Therefore, these two algorithms are similar in structure, both using multiple binary numbers to construct the index values needed for each PE array calculation, except that in a 2x2 PE array with a radix-2 DIT... NR In the NTT index generation algorithm, the different binary bits of each index move from the leftmost end to the rightmost end of the index binary expression as the NTT calculation stage increases.
[0125] Lines 11 to 14 of the above algorithm concatenate 0s or 1s to the left of n1 and n2 to form the index needed for each stage when i is 0. Lines 19 to 22 of the above algorithm concatenate 0s or 1s to the right of the lower i bits of n1 and n2, and then concatenate the higher si bits of n1 or n2. Therefore, the radix-2 DIT of the 2x2 PE array... NR The NTT requires different bits in the index binary each time by moving the rightmost bit of the index binary representation to the leftmost bit. The principle is the same as the radix-2 DIF of a 2x2 PE array. RN The NTT index generation algorithm is similar.
[0126] For different PE arrays, matching can be achieved by simply increasing the number of counters and indices in the two algorithms described above. The number of counters in both algorithms should be the same as the number of PE operation units in each column of the PE array, while the number of output indices should be twice the number of PE operation units in each column of the PE array. The bit width of the counter is one less than the number of binary bits used in the NTT calculation.
[0127] It should be noted that relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0128] Although this application has been described above with reference to specific embodiments, various modifications can be made and components can be replaced with equivalents without departing from the scope of this application. In particular, as long as there is no structural conflict, the features in the specific embodiments disclosed in this application can be combined with each other in any way. The lack of an exhaustive description of these combinations in this specification is merely for the sake of brevity and resource conservation. Therefore, this application is not limited to the specific embodiments disclosed herein, but includes all technical solutions falling within the scope of the claims.
Claims
1. An index generation apparatus for scalable NTT, characterized in that, It includes a continuous counter, a stage counting module, and an index parsing module. The continuous counter is connected to the stage counting module and the index parsing module, respectively. The stage counting module is connected to the index parsing module. The continuous counter includes a register, an adder, and a subtractor. One input of the adder is connected to the output of the register, and the other input of the adder is a constant. One input of the subtractor is connected to the output of the adder, and the other input of the subtractor is a constant. The output of the register is connected to the stage counting module.
2. The index generation apparatus for scalable NTT according to claim 1, characterized in that, The continuous counter consists of W adders, W subtractors, and W registers. The continuous counter outputs multiple consecutive W numbers per cycle. The stage counting module receives each output value from the W registers and iterates over the stage number. The index parsing module concatenates the values of the continuous counter based on the stage number from the stage counting module, generating the index value corresponding to each round of the PE array. After the NTT index generation is complete, the stage counting module outputs an end signal, where W represents the width of the PE array.
3. A method for generating an index for scalable NTTs, implemented using an index generation apparatus for scalable NTTs as described in any one of claims 1-2, characterized in that, N-point radix-2 DIF for a 2×2 PE array RN The NTT index generation process includes the following steps: S1. Define the number of operation points of the external input as N. The stage counting module calculates the total number of operation levels s based on the number of operation points of the external input. The calculation formula is: s = log2N -1; Set D to represent the depth of the PE array and W to represent the width of the PE array, and set D=2 and W=2 to complete the initialization; S2. Construct an outer loop based on the stage counting module. Set the outer loop variable i to correspond to the i-th level in the NTT operation. The outer loop variable i starts from 0 and iteratively increases with a step size of D. When i is less than s-1, proceed to the next step. When i is equal to s-1, the outer loop ends. S3. Construct an inner loop based on a continuous counter. For the i-th layer in the NTT operation of step S2, set the traversal of the butterfly operation group to be controlled by the inner loop variable count. During each iteration of the inner loop, a first basic counter Count1 and a second basic counter Count2 are generated based on the inner loop variable count, with the following expression: Count1 = count + 0; Count2 = count + 1; When i=0, the index parsing module generates the NTT index based on the first base counter Count1 and the second base counter Count2. The generated expression is: ; ; ; ; Among them, Index1 L Index2 is the first least significant index. L Index1 is the second least significant index. H Index2 is the first high-order index. H It is the second most significant index; When i≠0, the re-bit function is first called to perform bit reversal on the first basic counter Count1 and the second basic counter Count2 to obtain the first intermediate variable n1 and the second intermediate variable n2, and then the hybrid shift operation is performed. Shift the lower [s-1-i:0] bits of Count1 left by i+1 bits and add them to the lower [i-1:0] bits of n1 to generate the first low-order index; increment the first low-order index by 1 and then shift it left by i bits to obtain the first high-order index. Shift the lower [s-1-i:0] bits of Count2 left by i+1 bits and add them to the lower [i-1:0] bits of n2 to generate the second low-order index; increment the second low-order index by 1 and then shift it left by i bits to obtain the second high-order index; The inner loop variable count starts from 0 and iterates in increments of 2 until it reaches N / 2, then returns to step S2 for the next iteration; S4. Based on the first low-order index, first high-order index, second low-order index, and second high-order index obtained in steps S2 and S3, control the addressing of the coefficient memory until the inner and outer loops are all completed, thus completing the generation of the index sequence for the entire N-point NTT operation.
4. A method for generating an index for scalable NTTs, implemented using an index generation apparatus for scalable NTTs as described in any one of claims 1-2, characterized in that, radix-2 DIT for 2×2 PE arrays NR The NTT index generation process includes the following steps: Step 1. Define the number of operation points of the external input as N. The stage counting module calculates the total number of operation levels s based on the number of operation points of the external input. The calculation formula is: s = log2N -1; Set D to represent the depth of the PE array and W to represent the width of the PE array, and set D=2 to complete the initialization; S2. Construct an outer loop based on the stage counting module. Set the outer loop variable i to correspond to the i-th level in the NTT operation. The outer loop variable i starts from 0 and iteratively increases with a step size of D. When i is less than s-1, proceed to the next step. When i is equal to s-1, the outer loop ends. S3. Construct an inner loop based on a continuous counter. For the i-th layer in the NTT operation of step S2, set the traversal of the butterfly operation group to be controlled by the inner loop variable count. During each iteration of the inner loop, a first basic counter Count1 and a second basic counter Count2 are generated based on the inner loop variable count, with the following expression: Count1 = count + 0; Count2 = count + 1; When i=0, the re-bit function is called to perform bit reversal on the first basic counter Count1 and the second basic counter Count2 to obtain the first intermediate variable n1 and the second intermediate variable n2, and then the hybrid shift operation is performed. Assign the lower [s-1:0] bits of n1 to the first low-order index, increment the first low-order index by 1, and then shift it left by s bits to obtain the first high-order index; Assign the lower [s-1:0] bits of n2 to the second lowest bit index, increment the second lowest bit index by 1, and then shift it left by s bits to obtain the second highest bit index; When i≠0, the re-bit function is called to perform bit reversal on the first basic counter Count1 and the second basic counter Count2 to obtain the first intermediate variable n1 and the second intermediate variable n2, and then the hybrid shift operation is performed. Shift the lower [i-1:0] bits of n1 left by s-i+1 bits and add the result of the (s-1)th to the ith bits of n1 to generate the first low-order index; increment the first low-order index by 1 and then shift it left by s-i bits to obtain the first high-order index. Shift the lower [i-1:0] bits of n2 left by s-i+1 bits and add the result of the (s-1)th to the ith bits of n2 to generate the second lower index; increment the second lower index by 1 and then shift it left by s-i bits to obtain the second higher index. The inner loop variable count starts from 0 and iteratively increases in steps of W until it reaches N / 2, then returns to step S2 for the next iteration; S4. Based on the first low-order index, first high-order index, second low-order index, and second high-order index obtained in steps S2 and S3, control the addressing of the coefficient memory until the inner and outer loops are all completed, thus completing the generation of the index sequence for the entire N-point NTT operation.