Method and apparatus for multiplication of large-scale matrices by powers of integers modulo 2.
By optimizing the matrix block parallel processing and embedding modulo constraints, the problem of wasted hardware resources and large computational latency in large-scale matrix integer power multiplication operations is solved, thus achieving more efficient computation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING UNIV
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-30
Smart Images

Figure CN122309905A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of cryptography, and in particular to a method and apparatus for multiplication of large-scale matrices by powers of integers modulo 2. Background Technology
[0002] With the rapid development of quantum computing technology, traditional public-key cryptography systems theoretically face the risk of being effectively cracked, and the security of existing cryptographic systems is seriously challenged. To address the security threats posed by quantum computers, the field of cryptography has proposed a class of post-quantum cryptographic algorithms based on lattice hard problems. For example, unstructured lattice post-quantum cryptographic algorithms based on the Learning With Errors (LWE) problem and its variants have attracted widespread attention from academia and industry due to their explicit security assumptions and independence from algebraic structures. They have been studied and promoted by international standardization organizations such as the National Institute of Standards and Technology (NIST) and the International Organization for Standardization / International Electrotechnical Commission (ISO / IEC).
[0003] Unstructured lattice post-quantum cryptography algorithms commonly involve modulo-integrity multiplication operations between large-scale matrices in core steps such as key generation, encryption, and decryption, and the results of these operations typically need to be modulo 2. q Constraints are applied on the integer ring. Taking the typical unstructured lattice cryptography algorithm FrodoKEM as an example, its key computational process includes matrix multiplication operations of size N×N and N×8, where N=640 / 976 / 1344, the matrix elements are fixed-point integers, and the operation result needs to be modulo 2. 16 Truncation is performed. The matrix multiplication operations described above account for a major proportion of the overall algorithm execution time, and their computational efficiency directly determines the throughput and response performance of the cryptographic algorithm.
[0004] In existing hardware implementations, matrix integer modular multiplication typically involves first calculating the full-width matrix result, and then truncating the result. Taking FPGA hardware implementation as an example, a DSP multiplier first calculates the full-width integer product, and then performs a modulo-2 multiplication on the result. q Truncation processing. However, this implementation does not fully utilize the operational characteristics of powers of 2, generating and accumulating a large number of high-order partial products that do not contribute to the final result during the calculation process, resulting in low hardware resource utilization, a large number of addition levels, and large computational latency. Summary of the Invention
[0005] This application provides a method and apparatus for multiplication of large-scale matrices by powers of integers modulo 2, in order to solve the problems of low hardware resource utilization, numerous addition levels, and large computational latency.
[0006] Firstly, this application provides a method for multiplication of large-scale matrices by powers of integers modulo 2, comprising: Obtain the first matrix, the second matrix, and the modulus parameter, wherein the modulus parameter is used to determine the target bit width of the modulo operation; The first matrix and the second matrix are each divided into multiple blocks to form multiple block matrix pairs; For each of the said block matrix pairs, the modulo multiplication result is calculated by performing multiple vector inner product operations: The single vector inner product operation includes: performing a partial product expansion based on multiple vector element pairs participating in the vector inner product operation, and retaining the effective partial products whose bit weights are lower than the target bit width after expansion according to the target bit width; obtaining the modulo result of the single vector inner product based on the effective partial products; collecting the modulo results of all vector inner products of the same block matrix pair to obtain the modulo multiplication result of the block matrix pair; and accumulating the modulo multiplication results of all block matrix pairs to obtain the final modulo multiplication result.
[0007] Secondly, this application provides a multiplication apparatus for large-scale matrices raised to powers of integers modulo 2, comprising: The parameter acquisition module is used to acquire the first matrix, the second matrix, and the modulus parameter, wherein the modulus parameter is used to determine the target bit width of the modulo operation; The preprocessing module is used to divide the first matrix and the second matrix into multiple blocks to form multiple block matrix pairs; The partial product generation module is used to calculate the modulo multiplication result for each of the said block matrix pairs by performing multiple vector inner product operations: The single vector inner product operation includes: performing a partial product expansion based on multiple vector element pairs participating in the vector inner product operation, and retaining the effective partial products whose bit weights are lower than the target bit width after expansion according to the target bit width; obtaining the modulo result of the single vector inner product based on the effective partial products; collecting the modulo results of all vector inner products of the same block matrix pair to obtain the modulo multiplication result of the block matrix pair; and accumulating the modulo multiplication results of all block matrix pairs to obtain the final modulo multiplication result.
[0008] As can be seen from the above technical solutions, this application provides a method and apparatus for multiplying large-scale matrices by powers of 2 modulo 2. The method includes: obtaining a first matrix, a second matrix, and a modulus parameter; then dividing the first matrix and the second matrix into multiple blocks to form multiple block matrix pairs; for each block matrix pair, calculating the modulo multiplication result by performing multiple vector inner product operations; wherein the vector inner product operation includes: expanding the multiplicative partial product based on multiple vector element pairs participating in the vector inner product operation, and retaining the effective partial product whose bit weight is lower than the target bit width after expansion according to the target bit width; obtaining the modulo result of a single vector inner product based on the effective partial product; collecting the modulo results of all vector inner products of the same block matrix pair to obtain the modulo multiplication result of the block matrix pair; and accumulating the modulo multiplication results of all block matrix pairs to obtain the final modulo multiplication result. The method retains the effective part of the product with a bit weight lower than the target bit width based on the target bit width, and discards the high-bit part of the product that does not contribute to the modulo operation result, so that the subsequent vector inner product operation only processes the effective low-bit data, thereby avoiding the generation and accumulation of invalid high-bit parts, reducing hardware resource consumption and the number of addition stages. Attached Figure Description
[0009] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0010] Figure 1 A flowchart illustrating the method for multiplying large-scale matrices to powers of 2 modulo 2 provided in this application embodiment; Figure 2 This is a schematic diagram illustrating partial product generation and effective bit window capture provided in an embodiment of this application. Figure 3 This is a schematic diagram of the basic structure of the CSA addition method provided in the embodiments of this application; Figure 4 This is a schematic diagram of the CSA compressed tree structure provided in an embodiment of this application; Figure 5 A schematic diagram of partial product group summation and CSA addition tree constructed based on sign bit integration, provided for embodiments of this application; Figure 6 This is a schematic diagram of the overall scheduling of matrix multiplication provided in an embodiment of this application; Figure 7 This is a schematic diagram of the overall circuit structure provided in the embodiments of this application. Detailed Implementation
[0011] The embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following examples do not represent all embodiments consistent with this application.
[0012] This application provides a method for multiplication of large-scale matrices by powers of integers modulo 2, such as... Figure 1 As shown, the method includes the following steps S100-S300.
[0013] S100: Obtain the first matrix, the second matrix, and the modulus parameter, wherein the modulus parameter is used to determine the target bit width of the modulo operation.
[0014] The first and second matrices are the two input matrices involved in the multiplication operation, where the elements are signed integers represented using two's complement.
[0015] In unstructured lattice post-quantum cryptography algorithms, the first and second matrices can be generated by a random number generator or are publicly defined parameters agreed upon by the algorithm. The first matrix can have dimensions such as 640×640, and the bit width of each element can be set according to algorithm requirements, such as 16 bits. In hardware implementation, the first and second matrices can be stored in on-chip memory or input row by row via an external interface.
[0016] In this embodiment, for ease of explanation, the first matrix is defined as matrix A and the second matrix is defined as matrix B, where the dimension of matrix A is X×N and the dimension of matrix B is N×Y.
[0017] The modulus parameter q specifies the exponent value of the modulus to the power of 2 in the modulo operation, determining the effective binary bit width of the multiplication result. The modulus parameter can be configured through hardware registers or provided by immediate values from instructions, and its value range depends on the algorithm requirements. For example, in FrodoKEM, q is 16.
[0018] In this embodiment, the modulus parameter is converted into a target bit width Q, used to control the truncation depth of the partial product and the bit width constraint of the intermediate data, where Q = q. The elements of matrices A and B are all signed integers represented in two's complement, with bit widths of m and n respectively, where m and n can be different. For ease of hardware implementation, this embodiment assumes that the element bit widths of A and B are both 16 bits, and are the same as the target bit width Q. However, in actual implementation, the element bit width can be greater than, equal to, or less than Q, and this embodiment still applies.
[0019] The i-th element of the k-th element in each row of matrix A is specified as... (0≤k≤N-1, 0≤i≤m-1), the j-th element of the l-th column in matrix B is represented by... It represents (0≤l≤N-1, 0≤j≤n-1).
[0020] S200: Divide the first matrix and the second matrix into multiple blocks to form multiple block matrix pairs.
[0021] A block matrix pair is two sub-matrices that are split from corresponding positions in the first matrix and the second matrix. They are fed into the block matrix multiplication unit for operation. The result of the multiplication of the block matrix pair is the complete result matrix of the block multiplication of the first matrix and the block multiplication of the second matrix. Each element of the result matrix needs to be obtained through subsequent vector inner product operations.
[0022] For example, the first matrix is divided into multiple first blocks, and the second matrix is divided into multiple second blocks. In this embodiment, the block size is 4×4, that is, each block contains 4 rows and 4 columns of elements. The choice of block size depends on the number of computing units, storage depth and parallelism requirements of the target hardware platform.
[0023] The block partitioning process is executed by the top-level controller. The controller calculates the total number of blocks based on the total number of rows and columns of the matrix, and assigns a unique identifier to each block. The blocks of matrix A are denoted as A0. ij The partition of matrix B is denoted as B0. jk Where i, j, and k are block indices. A ij With B jk A block matrix pair is formed, and the result of its multiplication is denoted as C. ik C of all block matrix pairs ik The final result matrix C is obtained by summing the results.
[0024] S300: For each of the block matrix pairs, the modulo multiplication result is calculated by performing multiple vector inner product operations. Each single vector inner product operation includes: expanding the partial product based on the multiple vector element pairs participating in the vector inner product operation, and retaining the effective partial product whose bit weight is lower than the target bit width after expansion; obtaining the modulo result of the single vector inner product based on the effective partial product; aggregating the modulo results of all vector inner products of the same block matrix pair to obtain the modulo multiplication result of the block matrix pair; and summing the modulo multiplication results of all block matrix pairs to obtain the final modulo multiplication result.
[0025] For each pair of partitioned matrices, multiple vector dot product operations are performed. The vector dot product operation involves multiplying a row vector from the first matrix partition with a column vector from the second matrix partition and then summing the results. In this embodiment, the vector dot product operation is the smallest computational unit constituting the partitioned matrix multiplication. Each vector dot product operation receives several pairs of elements from the two partitions and outputs a modulo scalar result, which corresponds to an element in the result matrix.
[0026] By scheduling multiple vector inner product operation units in parallel, each element in the block result matrix is calculated sequentially. Within each vector inner product operation, multiple pairs of matrix elements involved in the multiplication are expanded into partial products. Taking advantage of the modulo 2 power operation, all high-order invalid bits are truncated according to the target bit width at the moment of partial product generation, and only the valid partial products with bit weights lower than the target bit width are retained.
[0027] These effective partial products are then compressed and summed, and combined into the modulo result of a first vector inner product. The sum of all inner product results of a single block matrix pair constitutes the multiplication result of that block pair. Finally, the multiplication results of all block pairs are accumulated according to their corresponding positions to complete the modular multiplication operation of the entire large matrix.
[0028] The large-scale matrix integer power-2 multiplication method provided in this embodiment solves the technical problems of serious resource waste, large number of addition levels, and large operation latency in the existing hardware implementation by combining matrix partitioning with fine optimization of the vector inner product level.
[0029] In traditional schemes, each scalar multiplication in matrix multiplication requires generating a product with a full bit width. The higher-order partial products are discarded during subsequent modulo operations, resulting in significant computational and storage resources being used to generate unused data. This embodiment, at the most fundamental step of vector inner product operation—the partial product expansion—actively identifies and discards all partial products with bit weights equal to or exceeding the target bit width, based on the modulus parameter. This eliminates the generation of invalid higher-order data at the source of the computation process, allowing subsequent addition compression networks to process only valid partial products with strictly constrained bit widths. Therefore, the hardware logic for generating, transmitting, and accumulating invalid higher-order data is completely eliminated, directly reducing circuit area and dynamic power consumption.
[0030] In traditional schemes, when adding multiple partial products with different weights, it is necessary to pad the shorter bit-width vector with zeros to extend it to the longest bit-width. This operation introduces a large number of redundant addition units and unnecessary signal flipping. This embodiment categorizes all valid partial products according to their weight shift amounts, ensuring that all vectors within the same group have the same bit width, thus eliminating the need for zero-padding during intra-group compression. Furthermore, weight alignment is used between groups for accumulation, avoiding additional logic overhead caused by bit-width differences. This structured compression strategy significantly reduces the number of adder stages and interconnect complexity.
[0031] In traditional schemes, the modulo operation is placed at the very end of the multiplication process, resulting in invalid high-order bits of the partial product running through the entire addition tree, leading to resource waste. This embodiment embeds the modulo constraint within the compression process of each carry-preserving adder. Any carry signal whose bit weight reaches or exceeds the target bit width due to left shift is directly cut off at the hardware connection level and will not be propagated to the next level. This mechanism ensures that the bit width of the entire compressed network is always locked within the target bit width, and any invalid parts of the partial product are discarded promptly, thus reducing resource waste. Furthermore, since the tree topology of the compressed network constitutes a forward cut set, it meets the conditions for pipeline insertion, allowing pipeline insertion to shorten the critical path. This significantly reduces the logic depth of each addition stage and increases the circuit's maximum operating frequency.
[0032] This embodiment also employs a matrix partitioning strategy to decompose large-scale matrix multiplication into several independently schedulable block matrix pairs. This allows vector inner product operation units to be reused with fixed bit width and size, eliminating the need to redesign hardware for matrices of different sizes. The combination of partitioning and accumulation enables this embodiment to seamlessly adapt to matrix multiplication of arbitrary dimensions, while providing a natural scheduling unit for pipelined parallelism, further improving overall throughput.
[0033] In summary, the technical solution described in this embodiment fundamentally reconstructs the computational paradigm of modulo 2 power matrix multiplication through a series of interconnected steps, including block division, partial product truncation, weight grouping and compression, and embedded modulo operation. In hardware implementation, this translates to smaller chip area, lower power consumption, shorter latency, and higher clock frequency, providing a practical solution for the deployment of post-quantum cryptography algorithms on resource-constrained platforms.
[0034] The following provides a detailed description of vector dot product operations. In some embodiments, when expanding the partial product of multiple vector elements participating in the vector dot product operation, the process includes: extracting the sign bit and value bit of the vector elements; generating a partial product of value bits corresponding to the value bits of the multiplier based on the value bits; generating a partial product of sign bits corresponding to the sign bits of the multiplier based on the sign bits; and expanding the partial product of multiplication based on the partial product of value bits and the partial product of sign bits.
[0035] By extracting the sign bit and the value bit and generating partial products of the value bits and the sign bit respectively, the two's complement multiplication is transformed into unsigned numerical operations, simplifying the multiplication kernel and uniformly supporting the processing of positive and negative numbers.
[0036] Specifically, vector elements are individual values in a matrix that participate in the vector dot product operation. Each pair of vector elements consists of a multiplicand and a multiplier, both represented in two's complement form. To perform correct two's complement multiplication, the multiplier and multiplicand are determined based on the sign bits of the two vector elements. If the two sign bits are opposite, the negative number is used as the multiplier and the positive number as the multiplicand; if the two sign bits are the same, there are no special requirements for the allocation of the multiplier and multiplicand.
[0037] The sign bit is the highest-order binary bit in two's complement representation used to indicate whether a number is positive or negative. A sign bit of 0 indicates that the number is positive or zero, and a sign bit of 1 indicates that the number is negative. The value bits are the remaining lower-order bits in two's complement representation, excluding the sign bit, and represent the binary value of the absolute value of the number.
[0038] In the vector dot product operation unit, whenever a pair of vector elements (denoted as a and b) is received, the sign bit and value bit of the pair of vector elements are first extracted, with the sign bit denoted as s. a and s b Each bit is taken from the highest bit of its two's complement representation; the numerical bits are denoted as a. val and b val It is taken from the remaining low-order bits except for the most significant bit.
[0039] During the expansion process, the sign bit and the value bit are extracted first. For each pair of vector elements participating in the vector inner product operation, the highest bit in the two's complement representation of the pair of elements is taken as the sign bit, and the remaining low-order bit string is taken as the value bit. After the sign bit and the value bit are separated, the sign bit is independently reserved for subsequent sign processing path, while the value bit is sent to the unsigned value operation path.
[0040] This embodiment assumes that vector elements with smaller bit widths are used as multipliers to reduce the number of partial products. Here, let the multiplier be b and the multiplicand be a, with bit widths of n and m respectively, where n ≤ m. Their values can be expanded according to the definition of two's complement as follows: ; Based on the above expansion, this application decomposes the multiplication a×b into the sum of n partial product terms. For the binary representation of the multiplier b, its value can be expanded according to the two's complement definition as the sum of the highest negative weight and the lowest positive weight. According to this expansion, the multiplication a×b is decomposed into the sum of multiple partial product terms, each partial product being generated by a certain bit of the multiplier.
[0041] The generation of the partial product of numerical bits corresponds to the lower n-1 bits (j = 0, 1, ..., n-2) of the multiplier b. Each bit... The weights are positive, and the original partial product generated is the product of the multiplicand a and a. The logic and the result, that is and Perform a bitwise AND operation to generate the original partial product. Each original part product The product needs to be shifted left by j bits according to the corresponding multiplier bit order j to reflect the actual weight of that bit in the multiplication. After the shift, the partial product has the correct multiplication weight, which is the partial product corresponding to the numerical bits of the multiplier.
[0042] The carry compensation signal does not participate in the generation of the current partial product set, but is retained independently. Its specific integration method will be explained in detail later during the parallel compression of the partial product set. In some embodiments, generating the sign-bit partial product corresponding to the multiplier sign bit based on the sign bit includes: in response to the sign bit representing a negative number, performing a bitwise NOT operation on the numerical bits to obtain the NOT numerical bits; defining the arithmetic compensation amount corresponding to the bitwise NOT operation as a carry compensation signal with a predetermined weight; constructing the partial product from the NOT numerical bits; and supplying the carry compensation signal as an implicit carry input to a carry-retaining adder that compresses the partial product set with corresponding weights to generate the sign-bit partial product.
[0043] By separating the inversion operation of the partial product of the sign bit from the arithmetic compensation amount into a carry compensation signal and injecting it into the carry-holding adder, signed multiplication compensation can be completed without adding an additional adder, thus reducing hardware overhead.
[0044] For the highest bit of the multiplier b, i.e., the sign bit s b (j = n-1), its weight is negative (-2) n-1 If s b =0, then the partial product corresponding to that bit is the zero vector. If s b If the value is 1, then the contribution of that position is -a·2. n-1 In the corresponding two's complement operation, -a is equivalent to inverting a bit by bit and adding 1. In this embodiment, a distributed processing method is used to generate the partial product of the sign bit.
[0045] When the sign bit s b When representing a negative number, first perform a bitwise inversion operation on the numerical bits of the multiplicand a to obtain the inverted numerical bits ~a. This inverted result is used as the base part product of the sign bit part product, and its weight shift is n-1.
[0046] Meanwhile, the arithmetic compensation "+1" corresponding to the bitwise NOT operation is extracted independently and defined as a carry compensation signal with a single bit. The weight of the carry compensation signal is set to correspond to the weight shift of the sign bit, that is, it has the same predetermined weight 2 as the sign bit weight. n-1 When the sign bit s b When representing a positive number, the fundamental partial product is a zero vector, and the carry compensation signal is also zero. Therefore, regardless of sb Regardless of the value, the partial product of the sign bit can be uniformly determined by s. b &(~a) calculate.
[0047] The carry compensation signal is not included in the partial product set along with the basic partial product. Instead, it is separated and configured as an implicit carry input in the subsequent compression stage, and supplied to the carry-retaining adder that compresses the partial product set with the corresponding weight level (weight shift amount is n-1).
[0048] The carry compensation signal is directly connected to the carry input of the carry-retaining adder. Regardless of whether the sign bit is 0 or 1, arithmetic compensation can be completed through this path. Through the above separation operation, the generation of the sign bit partial product is distributed into two parallel paths. The basic partial product enters the partial product set to participate in compression, while the carry compensation signal is directly injected into the carry input port of the compression tree, which is the sign bit partial product corresponding to the multiplier sign bit.
[0049] In some embodiments, retaining the effective partial product with a bit weight lower than the target bit width after expansion, specifically includes the following steps: determining the weight shift amount based on the multiplier bit order associated with the expanded partial product; if the weight shift amount is greater than or equal to the target bit width, discarding the partial product; if the weight shift amount is less than the target bit width, truncating the least significant bits from the partial product, wherein the number of the least significant bits is equal to the target bit width minus the weight shift amount; if the target bit width is greater than or equal to the multiplicand bit width, using the logical AND result of the multiplicand sign bit and the corresponding multiplier bit, filling the missing high bits in the partial product to obtain an effective partial product with a bit width equal to the target bit width.
[0050] By discarding, truncating, and filling in the sign bit according to the weight shift amount, the effective bit window is truncated during the partial product generation stage, eliminating invalid high-order partial products from the source and reducing the amount of subsequent calculations.
[0051] Given that the final result needs to be modulo 2 q That is, the final result only needs to retain q bits, and any weight greater than or equal to 2 q Since all data bits are invalid, this embodiment performs an immediate truncation operation based on the target bit width q during the partial product generation stage. For any partial product generated from the j-th bit of the multiplier and aligned by shifting it left by j bits, its weight shift amount is j.
[0052] If the weight shift amount j is greater than or equal to the target bit width q, then the least significant bit weight of this partial product has reached 2. q All bits of the above are modulo 2. q The result is of no contribution, so this part of the product is directly discarded in this embodiment. It does not consume any hardware resources and does not participate in any subsequent calculations.
[0053] If the weight shift amount j is less than the target bit width q, the lower bits of the partial product may contribute to the modulo operation result. In this case, the lowest qj bits are truncated from the partial product and retained as the effective partial product. The number of truncated bits is exactly equal to the target bit width minus the weight shift amount, and the weights of these bits in the final result range from 2^qj to 1. j Up to 2 q-1 .
[0054] When the target bit width q is greater than or equal to the multiplicand bit width m, for the partial product that satisfies the weight shift amount j≤qm, the high-order bits of the qj bits retained after truncation contain gaps due to insufficient multiplicand bit width.
[0055] This embodiment employs a sign bit padding strategy, that is, filling the missing high-order bits with the sign bit s of the multiplicand. a With corresponding multiplier digits Logic and Results This padding operation expands the higher bits of the partial product according to the sign characteristics of the multiplicand, thereby maintaining the numerical correctness of signed number multiplication in the sense of modular arithmetic.
[0056] Taking a target bit width q = 8, a multiplicand bit width m = 6, and a multiplier bit width n = 5 as an example, Figure 2 As shown. When the weight shift j ≥ 7, the partial product is completely discarded. When j = 0 to 6, the lower 8-j bits are truncated from the corresponding partial product. For the partial product where j = 0 to 2, since 8-j > 6, there is a gap in the higher bits after truncating. The logic AND result fills the missing high bits, expanding the bit width of the effective part product to 8 bits.
[0057] After the above truncation and padding operations, all retained valid partial products have a uniform bit width q, and the weights have been aligned to their corresponding positions. In this embodiment, the j-th valid partial product vector of the k-th pair of vector element multiplications in the vector inner product is denoted as... This identifier is used for indexing and scheduling of the partial product in subsequent weighted grouping and parallel compression phases.
[0058] Through the above steps, this embodiment achieves real-time truncation based on the target bit width while expanding the partial product of multiplication, eliminating the generation of invalid high-bit partial products from the data source and laying the foundation for subsequent weighted grouping and parallel compression. The independent separation and subsequent injection of the carry compensation signal in the sign bit partial product enables the sign bit processing of positive and negative number multiplication to be uniformly implemented under the same hardware architecture, eliminating the need to add an additional adder for negative number multiplication.
[0059] In some embodiments, the method includes: obtaining the position of the multiplier binary bit associated with each partial product in the effective partial products; grouping effective partial products with the same binary weight into the same set according to the binary weight represented by the position of the multiplier binary bit, forming multiple sets of partial products; performing multi-level parallel compression on the effective partial products in each set of partial products to obtain a compression result; and adding the compression result after shifting and aligning it according to the corresponding binary weight to obtain the modulo result of the single vector inner product.
[0060] By grouping the effective partial products of the same binary weights into the same set and performing multi-level parallel compression and shift-aligned addition, zero-padding operations are eliminated, the number of addition stages is reduced, and the parallelism is improved.
[0061] In this embodiment, the effective partial product is the intermediate binary vector obtained in the previous stage after truncation and padding, with a unified bit width of the target bit width q and aligned weights. Where k is the index of the vector element pair, and j is the multiplier position associated with the partial product. Each valid partial product is associated with the position of the multiplier bit that generated it, which is the multiplier position j. The binary weight represented by the multiplier bit position is the actual weight value of that bit in the multiplication expansion, 2. j .
[0062] First, a weighted classification operation is performed. For all N pairs of vector element multiplications involved in a single vector dot product operation, each pair of multiplications produces several effective partial products. This embodiment combines all valid partial products with the same multiplier position j. If they are grouped into the same set, it is denoted as the set of partial products G. j .
[0063] This classification operation allows subsequent compression processes to be processed in parallel within the same weight level, eliminating the need for low-bit zero padding.
[0064] Then, multi-level parallel compression is performed on the effective partial products in each of the partial product sets to obtain the compression result, specifically including the following steps: The partial product set is divided into groups of a predetermined number of vectors to obtain multiple sets of valid partial products; For each set of valid partial products, the input is fed into a compression network to be compressed by a carry-preserving adder, and the output is a sum vector and a carry vector; The compression network includes multiple carry-holding adders. The carry vector is shifted left by one bit before participating in the next compression stage. In each compression stage, the signal portion of the left-shifted carry vector whose bit weight reaches or exceeds the target bit width is configured as invalid on the transmission path. The sum vector, carry vector, and ungrouped remaining vectors of all the outputs are used as the input vectors for the next level of compression. The grouping and compression are repeated until two output vectors are output. The output vector is summed using a carry-pass adder to obtain the summation result. By restricting the bit width of the summation result within the target bit width, a compressed result of the partial product set is obtained.
[0065] By grouping by a preset number, compressing with a carry-preserving adder, shifting the carry left, and configuring invalid carry, multi-level parallel compression with embedded modulo is achieved, reducing hardware resource consumption and shortening the critical path.
[0066] This embodiment uses a tree-structured compressed network composed of carry-holding adders (CSAs) to process each partial product set G. j Parallel compression is performed. A carry-preserving adder is a three-input, two-output adder unit. Its inputs are three binary vectors of the same bit width, and its outputs are a sum vector and a carry vector. The core characteristic of this adder is that the sum of the sum vector and the carry vector equals the sum of the three input vectors, and the carry vector needs to be left-shifted by one bit before participating in the next stage of operation to match its actual weight.
[0067] For a single set of partial products G j It contains N effective partial product vectors with a bit width of q. First, the set is divided into multiple vector groups by a preset number of vectors. In this embodiment, the preset number is three, that is, three vectors are grouped together. If N is an integer multiple of 3, all vectors are completely grouped, and all vectors are divided into n groups, with three vectors in each group. If N = 3n + 1, the first 3n vectors are divided into n groups, and the remaining 1 vector is treated as an ungrouped vector and goes directly to the next level. If N = 3n + 2, the first 3n vectors are divided into n groups, and the remaining 2 vectors are treated as ungrouped vectors and go directly to the next level.
[0068] Therefore, it is constructed Figure 3 The CSA basic unit can perform the addition of three addends. It converts the addition of three numbers into the addition of two numbers through an internal full adder, and eliminates the significant delay required for serial carry. Figure 3 Taking the addition of three 4-bit numbers as an example, this explains how CSA reduces the number of addends and the delay, laying the foundation for the later introduction of the addition compression tree where the addends decrease layer by layer from 3N+i to 2N+i.
[0069] For each group of three valid partial product vectors, they are fed into a carry-preserving adder. This adder performs full addition on each bit of the three vectors in parallel, producing the sum bit and carry bit for that bit. The sum bits of all bits form the sum vector, and the carry bits of all bits form the carry vector.
[0070] After the first stage of compression, each partial product set G j The output contains several groups of generated sum vectors, carry vectors, and possibly ungrouped vectors. These vectors are then used as input vectors for the next stage of compression, and the grouping and compression operations are repeated until the number of input vectors is reduced to two.
[0071] In some embodiments, in each compression stage, the signal portion of the left-shifted carry vector whose bit weight reaches or exceeds the target bit width is configured to be invalid on the transmission path, including: In each compression stage, when the carry-preserving adder outputs a carry vector, the bit weight of the carry vector after shifting it one bit to the left is determined based on the target bit width. The portion of the bit weight lower than the target bit width is passed to the next level of compression; The portion of the bit weight that reaches or exceeds the target bit width is configured as invalid.
[0072] By determining the carry-left shift weight and only propagating carry bits with weights lower than the target bit width, the embedded modulus is accurately implemented, ensuring that the compression process is always constrained within the target bit width and avoiding invalid carry propagation.
[0073] This embodiment performs an embedded modulo operation in each compression stage. After shifting the carry vector output by the carry-preserving adder one bit to the left, the original i-th carry signal corresponds to weight 2. i+1 Based on the target bit width q, if the bit weight i+1 of the carry signal is less than q, then the signal is in modulo 2. q Within the effective range of the operation, the carry signal is allowed to be passed to the next level of compression; if i+1≥q, then the signal has exceeded the effective range of the target bit width and does not contribute to the final modulo operation result. In this embodiment, the carry signal is configured to be invalid in the hardware transmission path, that is, its connection with subsequent circuits is actively cut off, so that it does not participate in any subsequent operations. This process is executed in parallel at each level of the compression tree and at the output of each carry-retaining adder, ensuring that the bit width of all intermediate data is always constrained within the target bit width q throughout the entire compression process.
[0074] In some embodiments, the step of using a carry-pass adder to sum the output vectors to obtain a summation result includes: inputting the two output vectors into the carry-pass adder, performing binary addition through the carry-pass adder to obtain a full-width addition result; extracting the low-order portion of the full-width addition result with a number of bits equal to the target bit width; and determining the low-order portion as the summation result.
[0075] The result is summed by a carry-through adder and the low-order bits are truncated. The compressed result is then merged into a standard binary number and the final modulo operation is performed to ensure that the bit width of the result strictly conforms to the target bit width.
[0076] When the partial product set G j After multiple levels of compression, when only two output vectors remain, these two vectors still exist in a carry-preserving format: one is a sum vector, and the other is a carry vector, whose numerical sum equals G. j The sum of the products of all valid parts within the range.
[0077] This embodiment uses a carry-propagating adder to combine the two vectors into a standard binary number. The carry-propagating adder performs binary addition with carry propagation, adding each bit of the two vectors bit by bit and propagating the carry, and outputting a full-width addition result. The bit width of this addition result may reach q+1 bits, meaning that the most significant bit may generate a carry output.
[0078] Because this application is limited to module 2 q The operation only requires retaining the lowest q bits in the final result. Therefore, in this embodiment, the lowest q bits are truncated from the full-width addition result of the carry-passing adder, discarding all bits higher than q. The truncated q-bit binary number is defined as the partial product set G. j The compression result is the result of all weights being 2. j The effective part of the product in modulo 2 q The summation in a meaningful sense.
[0079] With the partial product set G j Taking a vector of 4 as an example, Figure 4 As shown. The first stage of compression inputs three vectors into a carry-retaining adder, outputting a sum vector and a carry vector, with the remaining vector passed directly to the next stage. The second stage of compression inputs the sum vector, carry vector, and the remaining vector from the first stage back into the carry-retaining adder, outputting a new sum vector and a new carry vector. At this point, the number of input vectors has been reduced to two, and the compression process terminates. The final stage uses a carry-passing adder to sum the vectors and truncates the lower q bits, obtaining H. j .
[0080] Since there are n distinct multiplier positions (j = 0, 1, ..., n-1) in a single vector dot product, there are n corresponding partial product sets G0 to G... {n-1} This embodiment deploys n CSA addition trees with the same structure in parallel, and simultaneously performs G... j Compression is performed, and n summary vectors H0, H1, ..., H are obtained in parallel. {n-1} , respectively representing all N multiplications with weight 2 0 ,2 1 ,…,2 n-1 Total contribution.
[0081] Then, a carry-pass adder is used to sum the two output vectors and truncate the lower q bits to obtain G.j The compression result.
[0082] This embodiment obtains n summary vectors H. j Then, they need to be summed into a single inner product result. Because H j Represents weight 2 j The sum of these accumulative terms should actually be H. j ×2 j Therefore, each H j Shift left by j bits, padding the lower bits with zeros, to obtain an intermediate vector with a width of q+j. To unify the modulo 2... q Under constraints, after shifting, only the low q bits are retained (i.e., the high bits exceeding q are truncated), so that the bit width of all intermediate vectors is unified to q and the weights are aligned.
[0083] Subsequently, this embodiment again utilizes a compression tree constructed from carry-preserving adders to perform multi-level parallel compression on these n shifted and aligned vectors. This compression process is identical to the aforementioned partial product set compression: grouping three vectors together, CSA compression, carry left shift, inset modulo (discarding carry with bit weight ≥ q), repeated compression to two vectors, carry-passing adders for summation, and truncating the lower q bits. The final result is a single vector inner product modulo 2. q The result of modulo operation in the meaningful sense is denoted as Result.
[0084] Taking N=4, m=6, n=5, q=8 as an example, Figure 5 As shown, the five partial product sets G0 to G4 are compressed to obtain H0 to H4. H0 is not shifted, H1 is shifted left by 1 bit, H2 by 2 bits, H3 by 3 bits, and H4 by 4 bits. After shifting, the lower 8 bits are truncated. The five 8-bit vectors are input into the CSA compression tree, and after multi-stage compression and a carry-passing adder, the 8-bit inner product result is output.
[0085] Through the aforementioned multi-level parallel compression, embedded modulo operation, sign bit integration, and cross-weight accumulation mechanisms, this embodiment efficiently completes the modulo-2 of large-scale vector inner products without any multipliers, relying solely on carry-preserving adders and carry-passing adders. q The operation achieves dual optimization of hardware resource consumption and computation latency.
[0086] In some embodiments, the modulo result of the inner product of all vectors in the same block matrix pair is used to obtain the modulo multiplication result of the block matrix pair, including: For each element position in the result matrix of the multiplication of the block matrix pairs, assign a vector inner product operation; In a pipelined manner, input vectors corresponding to the positions of each element are continuously fed into the vector inner product calculation hardware in consecutive calculation cycles; after the pipeline is filled, the vector inner product calculation hardware continuously outputs the modulo result of the element positions in other calculation cycles. The continuously output modulo results are sequentially written into the corresponding positions of the multiplication result matrix of the block matrix pair to obtain the modulo multiplication result of the block matrix pair.
[0087] In this embodiment, a block matrix pair is a multiplication unit consisting of a block of the first matrix and a block of the second matrix. For a block matrix multiplication of size m×m, the resulting matrix is also m×m, where each element position corresponds to an independent vector inner product operation. First, a vector inner product operation is assigned to each element position of the resulting matrix, requiring a total of m×m operations.
[0088] When hardware resources are sufficient, multiple vector dot product calculation hardware units are placed in parallel to complete the calculation of all or part of the element positions within one cycle. When hardware resources are limited, a pipelined approach is used to time-division multiplex the vector dot product calculation hardware: in consecutive calculation cycles, each cycle feeds a batch of input vector pairs corresponding to several element positions to the hardware, which has a fixed pipeline depth x. In the first x-1 cycles, the pipeline is in the filling stage, with no valid results output; from the x-th cycle onwards, the pipeline is fully filled, and thereafter, the modulo result of one element position is continuously output in each cycle.
[0089] Taking a block dimension of m=4, a pipeline depth of x=3, and four sets of parallel hardware as an example, a total of 16 vector inner product operations are required, which is equivalent to 4 rounds of processing input vector groups, each group consisting of 4 vector pairs. In the first to third cycles, the first to third groups of input vector pairs are continuously fed in, each group containing 4 input vector pairs, filling the pipeline. From the fourth cycle onwards, each cycle outputs a set of modulo results while simultaneously feeding in a new set of input vector pairs; by the seventh cycle, all 4 sets of results have been output. The output modulo results are written to storage units sequentially according to their row and column indices in the result matrix. When the results for all m×m element positions have been written, the complete modulo multiplication result of the block matrix pair is obtained.
[0090] At the overall scheduling level of large-scale matrix multiplication, the first and second matrices are each divided into multiple blocks, and the blocks of the first matrix are generated sequentially according to a predetermined order. Taking a first matrix of 640×640, a second matrix of 640×8, and a block dimension of 4×4 as an example, the first matrix is divided into 160×160 blocks, and the second matrix is divided into 160×2 blocks. Each time a 4×4 block of the first matrix is generated in parallel, that block is scheduled to perform block matrix multiplication with the corresponding block of the second matrix, and the calculated block results are accumulated to the corresponding position in the target output matrix. When all blocks of the first matrix have been generated and all multiplications have been accumulated, the final modulo multiplication result of the first and second matrices is obtained. The specific block generation and scheduling process is as follows: Figure 6 As shown.
[0091] In this example, the size of the block matrix is 4, so matrices A, B, and E can be represented by the following formula using the block matrix as the basic unit: ; ; ; Matrix multiplication is denoted as: Where 1≤i, j≤160. That is, every time four rows of A are generated from left to right, a complete multiplication with matrix B is performed from top to bottom, and the corresponding four rows of matrix E are continuously accumulated. When all of matrix A has been generated, the large matrix multiplication is also completed.
[0092] The algorithm structure proposed in this application is implemented in a circuit structure, such as... Figure 7 As shown.
[0093] Through the aforementioned pipeline scheduling and hardware resource reuse mechanism, this embodiment ensures that the vector inner product calculation hardware continues to operate at full load while reducing the demand for the number of parallel computing units, thus achieving an optimized balance between computing throughput and hardware resource overhead.
[0094] Based on the above-described method for multiplying large-scale matrices by powers modulo 2, some embodiments of this application also provide a device for multiplying large-scale matrices by powers modulo 2, comprising: The parameter acquisition module is used to acquire the first matrix, the second matrix, and the modulus parameter, wherein the modulus parameter is used to determine the target bit width of the modulo operation; The preprocessing module is used to divide the first matrix and the second matrix into multiple blocks to form multiple block matrix pairs; The partial product generation module is used to calculate the modulo multiplication result for each of the said block matrix pairs by performing multiple vector inner product operations: The single vector inner product operation includes: performing a partial product expansion based on multiple vector element pairs participating in the vector inner product operation, and retaining the effective partial products whose bit weights are lower than the target bit width after expansion according to the target bit width; obtaining the modulo result of the single vector inner product based on the effective partial products; collecting the modulo results of all vector inner products of the same block matrix pair to obtain the modulo multiplication result of the block matrix pair; and accumulating the modulo multiplication results of all block matrix pairs to obtain the final modulo multiplication result.
[0095] To further achieve efficient pipelined operation of vector inner product, some embodiments also include a valid bit window truncation module, a same-bit width grouping module, and a carry-preserving adder compression network.
[0096] In this embodiment, the single vector inner product operation is jointly implemented by a partial product generation module, a valid bit window truncation module, a same-width grouping module, and a carry-preserving adder compression network. First, the partial product generation module generates a partial product of the numerical bits corresponding to the multiplier's numerical bits and a partial product of the sign bits corresponding to the multiplier's sign bits for each pair of vector elements participating in the inner product operation. The generation of the sign bit partial product employs a distributed processing method, where the arithmetic compensation amount corresponding to the bit-by-bit inversion operation is independently designated as a carry compensation signal. This signal does not enter the compression network with the partial product but is instead supplied as an implicit carry input to the subsequent compression process.
[0097] Subsequently, the effective bit window truncation module truncates each partial product in real time according to the target bit width. If the weight shift of the partial product is greater than or equal to the target bit width, the partial product is completely discarded; if the weight shift is less than the target bit width, bits lower than the target bit width minus the weight shift are truncated from the partial product. When the target bit width is greater than the multiplicand bit width, the high-bit gaps are filled with the result of the logical AND operation between the sign bit of the multiplicand and the corresponding multiplier bit, resulting in an effective partial product with a bit width uniformly equal to the target bit width.
[0098] The same-width grouping module groups all valid partial products with the same binary weight into the same partial product set. Each partial product set is independently input into a tree-structured compression network composed of carry-retaining adders. The compression network performs group compression in groups of three vectors. In each compression stage, the carry-retaining adder outputs the sum vector and the carry vector. After the carry vector is shifted left by one bit, the signal portion whose bit weight reaches or exceeds the target bit width is configured as invalid in the transmission path. After multiple compression stages until two vectors are output, the carry-transfer adder is used to sum these two vectors and truncate the low-order bits to obtain the compressed result of the partial product set. Simultaneously, during the compression process at the corresponding weight level, the carry compensation signal separated in the previous stage is integrated into the compression network as an implicit carry input, completing the accumulation of arithmetic compensation in the sign-bit partial product.
[0099] After the partial product set compression results of each weight level are generated in parallel, they are left-shifted and aligned according to their respective binary weights. All the shifted and aligned vectors are then input into the carry-preserving adder compression network for multi-level compression and summation, finally obtaining the modulo result of the single vector inner product in the sense of modulo 2 to the power of q.
[0100] The technical solutions of this invention are not limited to the specific implementations described in the above embodiments. Based on the technical concept of this invention, those skilled in the art can adopt other alternative solutions to achieve the same or equivalent large matrix modulo multiplication operations according to different application scenarios and hardware resource constraints.
[0101] One possible alternative is to use block convolutional multiplication. This scheme divides the two integers involved in the operation into multiple data blocks with a fixed bit width, and uses several small-bit-width multipliers to calculate the product between the blocks in parallel or time-sharing. All block products are accumulated using diagonal convolution, and carry-over during the accumulation process is uniformly processed. Finally, the remaining bit width is truncated once to obtain the modulo 2 power q multiplication result. This scheme has an intuitive structure, makes it easy to balance circuit area and operation timing, and allows for the construction of complete matrix multiplication operations based on this multiplication unit.
[0102] Another possible alternative is to use fast large number multiplication. This approach includes implementations such as Karatsuba multiplication, Toom-Cook multiplication, fast Fourier transform, and number-theoretic transforms. The core idea is to transform large integer multiplication into low-dimensional sub-multiplication operations or frequency-domain convolution operations, thereby significantly reducing the number of multiplications. During execution, this approach can also retain only the intermediate coefficients that contribute to the low q bits of the modulo 2 power result, and truncate the result at the end of the calculation. This approach can further reduce computational complexity and improve throughput in scenarios with extremely large bit widths of multiplication.
[0103] The aforementioned alternative solutions differ from the technical solution of this invention in their specific implementation structure, but both can achieve the core function of multiplying large matrices by integer powers modulo 2, and provide flexible choices in application scenarios with different hardware resource constraints and performance requirements. By describing the above alternative embodiments in the specification, the scope of protection of this invention is further clarified and expanded, to prevent others from circumventing the technical solution defined by the claims of this invention merely through formal modifications or substitutions.
[0104] Similar parts between the embodiments provided in this application can be referred to mutually. The specific implementation methods provided above are only a few examples under the overall concept of this application and do not constitute a limitation on the scope of protection of this application. For those skilled in the art, any other implementation methods extended from the solution of this application without creative effort shall fall within the scope of protection of this application.
Claims
1. A method for multiplication of large-scale matrices by powers modulo 2, characterized in that, include: Obtain the first matrix, the second matrix, and the modulus parameter, wherein the modulus parameter is used to determine the target bit width of the modulo operation; The first matrix and the second matrix are each divided into multiple blocks to form multiple block matrix pairs; For each of the said block matrix pairs, the modulo multiplication result is calculated by performing multiple vector inner product operations; The single vector inner product operation includes: performing a partial product expansion based on multiple vector element pairs participating in the vector inner product operation, and retaining the effective partial products whose bit weights are lower than the target bit width after expansion according to the target bit width; obtaining the modulo result of the single vector inner product based on the effective partial products; collecting the modulo results of all vector inner products of the same block matrix pair to obtain the modulo multiplication result of the block matrix pair; and accumulating the modulo multiplication results of all block matrix pairs to obtain the final modulo multiplication result.
2. The method for multiplication of large-scale matrices by powers modulo 2 according to claim 1, characterized in that, The expansion of the partial product based on the multiple vector element pairs participating in the vector inner product operation includes: Extract the sign and value bits of the vector elements; Based on the numerical bits, generate a partial product of numerical bits corresponding to the multiplier numerical bits; Based on the sign bit, generate a sign bit partial product corresponding to the sign bit of the multiplier; The multiplicative partial product is expanded based on the partial product of the numerical bits and the partial product of the sign bits.
3. The method for multiplication of large-scale matrices by powers modulo 2 according to claim 1, characterized in that, The step of obtaining the modulo result of the single vector inner product based on the effective partial product includes: Obtain the position of the multiplier binary bit associated with each partial product in the effective partial products; Based on the binary weights represented by the positions of the binary bits of the multiplier, valid partial products with the same binary weights are grouped into the same set, forming multiple sets of partial products; Multi-level parallel compression is performed on the effective partial products in each of the partial product sets to obtain the compression result; The compression results are shifted and aligned according to the corresponding binary weights, and then added together to obtain the modulo result of the single vector inner product.
4. The method for multiplication of large-scale matrices by powers modulo 2 according to claim 3, characterized in that, The step of performing multi-level parallel compression on the effective partial products in each of the partial product sets to obtain the compression result includes: The partial product set is divided into groups of a predetermined number of vectors to obtain multiple sets of valid partial products; For each set of valid partial products, the input is fed into a compression network to be compressed by a carry-preserving adder, and the output is a sum vector and a carry vector; The compression network includes multiple carry-holding adders. The carry vector is shifted left by one bit before participating in the next compression stage. In each compression stage, the signal portion of the left-shifted carry vector whose bit weight reaches or exceeds the target bit width is configured as invalid on the transmission path. The sum vector, carry vector, and ungrouped remaining vectors of all the outputs are used as the input vectors for the next level of compression. The grouping and compression are repeated until two output vectors are output. The output vector is summed using a carry-pass adder to obtain the summation result. By restricting the bit width of the summation result within the target bit width, a compressed result of the partial product set is obtained.
5. The method for multiplication of large-scale matrices by powers modulo 2 according to claim 4, characterized in that, In each compression stage, the signal portion of the carry vector that is shifted left and whose bit weight reaches or exceeds the target bit width is configured to be invalid on the transmission path, including: In each compression stage, when the carry-preserving adder outputs a carry vector, the bit weight of the carry vector after shifting it one bit to the left is determined based on the target bit width. The portion of the bit weight lower than the target bit width is passed to the next level of compression; The portion of the bit weight that reaches or exceeds the target bit width is configured as invalid.
6. The method for multiplication of large-scale matrices by powers modulo 2 according to claim 4, characterized in that, The process of using a carry-pass adder to sum the output vector to obtain the summation result includes: The two output vectors are input to the carry-pass adder, and binary addition is performed through the carry-pass adder to obtain the full-width addition result; From the full-width addition result, extract the low-order bits that are equal to the target bit width; The lower-order portion is determined as the summation result.
7. The method for multiplication of large-scale matrices by powers modulo 2 according to claim 2, characterized in that, The step of generating the sign-bit partial product corresponding to the sign bit of the multiplier based on the sign bit includes: In response to the sign bit indicating a negative number, a bitwise inversion operation is performed on the value bits to obtain the inverted value bits; The arithmetic compensation amount corresponding to the bit-inverting operation is defined as a carry compensation signal with a predetermined weight. The inverted numerical bits form a partial product; The carry compensation signal is used as an implicit carry input and fed into a carry-preserving adder that compresses the set of partial products with corresponding weights to generate the sign bit partial product.
8. The method for multiplication of large-scale matrices by powers modulo 2 according to claim 2, characterized in that, The step of retaining the effective portion of the product whose bit weight is lower than the target bit width after expansion, based on the target bit width, includes: The weight shift amount is determined based on the multiplier position of the expanded partial product; If the weight shift amount is greater than or equal to the target bit width, then the partial product is discarded; If the weight shift amount is less than the target bit width, then the least significant bits are truncated from the partial product, wherein the number of the least significant bits is equal to the target bit width minus the weight shift amount; If the target bit width is greater than or equal to the multiplicand bit width, the result of the logical AND of the sign bit of the multiplicand and the corresponding multiplier bit is used to fill the missing high bits in the partial product, thus obtaining an effective partial product with a bit width equal to the target bit width.
9. The method for multiplication of large-scale matrices by powers modulo 2 according to claim 1, characterized in that, The modulo result of the inner product of all vectors in the same set of block matrix pairs is used to obtain the modulo multiplication result of the block matrix pairs, including: For each element position in the result matrix of the multiplication of the block matrix pairs, assign a vector inner product operation; In a pipelined manner, input vectors corresponding to the positions of each element are continuously fed into the vector inner product calculation hardware in consecutive calculation cycles; after the pipeline is filled, the vector inner product calculation hardware continuously outputs the modulo result of the element positions in other calculation cycles. The continuously output modulo results are sequentially written into the corresponding positions of the multiplication result matrix of the block matrix pair to obtain the modulo multiplication result of the block matrix pair.
10. A device for multiplying large-scale matrices by powers modulo 2, characterized in that, include: The parameter acquisition module is used to acquire the first matrix, the second matrix, and the modulus parameter, wherein the modulus parameter is used to determine the target bit width of the modulo operation; The preprocessing module is used to divide the first matrix and the second matrix into multiple blocks to form multiple block matrix pairs; The partial product generation module is used to calculate the modulo multiplication result for each of the said block matrix pairs by performing multiple vector inner product operations: The single vector inner product operation includes: performing a partial product expansion based on multiple vector element pairs participating in the vector inner product operation, and retaining the effective partial products whose bit weights are lower than the target bit width after expansion according to the target bit width; obtaining the modulo result of the single vector inner product based on the effective partial products; collecting the modulo results of all vector inner products of the same block matrix pair to obtain the modulo multiplication result of the block matrix pair; and accumulating the modulo multiplication results of all block matrix pairs to obtain the final modulo multiplication result.