A method and related apparatus for SHA-3 lightweight hardware implementation
By integrating the delayed ρ-shift mechanism and write-back operation, and combining it with the time-division multiplexing operation module, the problem of high resource consumption in traditional SHA-3 hardware implementation is solved, achieving more efficient use of hardware resources.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING UNIV OF AERONAUTICS & ASTRONAUTICS
- Filing Date
- 2026-05-06
- Publication Date
- 2026-06-19
Smart Images

Figure CN122247635A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of post-quantum cryptography, and more specifically, to a lightweight hardware implementation method and related apparatus for SHA-3. Background Technology
[0002] With the continuous development of quantum computing technology, traditional public-key cryptography algorithms based on large integer factorization and discrete logarithm problems (such as RSA and ECC) are theoretically at risk of being cracked by quantum algorithms. Therefore, post-quantum cryptography (PQC) has gradually become an important research direction in the field of cryptography. Among the various proposed PQC algorithms, lattice-based cryptographic algorithms (such as Kyber and Dilithium) are considered one of the most promising candidates due to their relatively good security and efficiency.
[0003] In the implementation of these post-quantum cryptographic algorithms, hash functions are a crucial fundamental operator. For example, in algorithms such as Kyber and Dilithium, SHA-3 and its derivative function SHAKE are widely used in key stages such as random number generation, message digest calculation, and key derivation. Therefore, in actual hardware implementations, hash modules are often frequently invoked, and their resource consumption and performance directly impact the efficiency of the entire cryptographic system.
[0004] The SHA-3 algorithm is based on the Keccak structure. Its core computation process consists of five transformations: θ, ρ, π, χ, and ι, and requires multiple rounds of repeated computation. Because its internal state size reaches 1600 bits and complex data transformations are required in each round of computation, its hardware implementation often consumes significant logic and storage resources. Summary of the Invention
[0005] In view of the above-mentioned problems in the existing technology, the purpose of this application is to provide a lightweight hardware implementation method and related device for SHA-3, which is beneficial to reduce the consumption of logic resources and storage resources.
[0006] The first aspect of this application provides a lightweight hardware implementation method for SHA-3, the method comprising: Initialize the state matrix and absorb the input data into the state matrix; Start the round function calculation process: When the current round number is less than or equal to the preset round number, the target data is read from the state matrix, and the target data is sequentially written back by θ transformation, χ transformation, ι transformation, fused ρ transformation and π transformation to obtain the transformed state matrix; When the current round number is greater than the preset round number, the hash result is calculated and output based on the transformed state matrix.
[0007] Optionally, the write-back operation of sequentially performing θ transformation, χ transformation, ι transformation, fused ρ transformation, and π transformation on the target data to obtain the transformed state matrix includes: The target data is sequentially transformed using the θ, χ, and ι transformations to obtain intermediate data; and the read storage address is transformed using the π transformation to obtain the target storage address. The offset is determined by looking up a table based on the target storage address; The intermediate data is transformed using the offset to obtain the transformed target data. The transformed target data is written back to the SRAM according to the target storage address to obtain the transformed state matrix.
[0008] Optionally, performing θ-transform, χ-transform, and ι-transform on the target data sequentially to obtain intermediate data includes: Perform an θ-transform on the target data to obtain the target data after column mixing and diffusion; The target data after column mixing and diffusion is subjected to χ transform to obtain the target data after nonlinear transformation; The target data after nonlinear transformation is subjected to ι transformation to obtain intermediate data.
[0009] Optionally, the θ transform and the ι transform are executed by the first operation module, and the π transform and the χ transform are executed by the second operation module. The first operation module and the second operation module work in a time-division multiplexing manner.
[0010] Optionally, the method further includes: Each lane of the state matrix is mapped to a storage address in SRAM, and each storage address corresponds to a group of 64 physical storage units; Reading target data from the state matrix includes: The target data is read from the SRAM according to the obtained storage address.
[0011] Optionally, the storage address of each lane in the SRAM can be calculated in the following way: Storage address = lane_index × 64 + bit_offset, lane_index = 5 × y + x, Where lane_index represents the index of the lane, x represents the column coordinate of the state matrix, y represents the row coordinate of the state matrix, and bit_offset represents the bit offset within the lane.
[0012] Optionally, when the current round is the first round, the obtained storage address is a preset initial storage address; when the current round is not the first round, the obtained storage address is the target storage address in the previous round.
[0013] A second aspect of this application provides a lightweight hardware implementation apparatus for SHA-3, the apparatus comprising: A data input unit is used to initialize the state matrix and absorb input data into the state matrix; The round function calculation unit is used to initiate the round function calculation process. When the current round number is less than or equal to the preset round number, the target data is read from the state matrix, and the target data is sequentially written back by θ transformation, χ transformation, ι transformation, fused ρ transformation and π transformation to obtain the transformed state matrix; When the current round number is greater than the preset round number, the hash result is calculated and output based on the transformed state matrix.
[0014] A third aspect of this application provides an electronic device, including: a processor and a memory; The processor is connected to a memory, wherein the memory is used to store computer programs and the processor is used to invoke the computer programs to execute the methods as described in the first aspect of the embodiments of this application.
[0015] A fourth aspect of this application provides a computer-readable storage medium storing a computer program, the computer program including program instructions, which, when executed by a processor, perform the method as described in the first aspect of this application.
[0016] Compared to traditional SHA-3 hardware implementations, this application firstly features a delayed ρ-shift mechanism. This means that shift operations are not performed immediately during intermediate computation phases, but rather through address mapping during the final write-back. This avoids performing numerous shift operations in each round of computation, reducing the number of shift logic circuits, lowering resource consumption such as lookup tables, and reducing data movement during the computation phase, thus effectively reducing hardware implementation complexity. Secondly, the write-back operation integrates ρ and π transformations, reducing one data movement and thus decreasing the number of memory accesses. Furthermore, since the shift operation is completed during the address calculation phase, no additional shift circuits are required. In summary, the lightweight SHA-3 hardware implementation method and related apparatus provided in this application are beneficial for reducing the consumption of logic and storage resources. Attached Figure Description
[0017] Figure 1 A flowchart illustrating a lightweight hardware implementation method for SHA-3 provided in one embodiment of this application is shown. Figure 2 A flowchart illustrating a round function calculation method provided in one embodiment of this application is shown; Figure 3 This paper shows a schematic diagram of the structure of a lightweight hardware implementation device for SHA-3 provided in one embodiment of this application; Figure 4 A schematic diagram of the structure of a computer device provided in one embodiment of this application is shown. Detailed Implementation
[0018] In the Keccak algorithm, the internal state consists of a 5×5×64-bit three-dimensional state matrix, containing a total of 1600 bits of data. This state matrix is typically represented as: S[x][y][z] in: x ∈ {0,1,2,3,4} y ∈ {0,1,2,3,4} z ∈ {0,1,...,63} In each round of computation, the algorithm needs to perform five operations sequentially on the state matrix: θ (Theta) transformation, ρ (Rho) transformation, π (Pi) transformation, χ (Chi) transformation, and ι (Iota) transformation. These operations together constitute the Keccak round function, and it needs to be repeated 24 times to complete a full SHA-3 hash calculation.
[0019] In traditional hardware implementations, these transformations are typically performed sequentially, with intermediate results written back to registers or memory after each step, before being read by the next step to continue computation. While this approach is structurally simple, it results in numerous intermediate data read / write operations, increasing system latency and power consumption. Let's consider a specific lane for understanding. For example, with an initial storage location of (x=1, y=2), the target storage location after the π transformation is: (x', y') = (y, 2x+3ymod 5) = (2, (2*1+3*2)=2+6=8 mod5=3) = (2,3). Assuming offset = 5, the first round is as follows: Step 1 (θ): Read SRAM, perform θ, and write back to SRAM.
[0020] Step 2 (ρ): Read SRAM (1,2), shift the data left by 5 bits, and write it back to (1,2).
[0021] Step 3 (π): Read (1,2) and write it down to (2,3).
[0022] Step 4 (χ): Read → Calculate → Write.
[0023] Step 5 (ι): Read → Add constant → Write.
[0024] Rounds two through twenty-four follow the same sequence as round one. It can be seen that each round requires five read-and-write operations to SRAM, necessitating frequent memory access and increasing system power consumption. The ρ transformation requires a cyclic shift operation for each lane, which typically requires additional shift logic circuitry, thus increasing lookup table resource consumption. The π transformation requires rearranging the lanes in the state matrix, necessitating data shifting. If the ρ and π steps are executed separately in the hardware implementation, it would result in two data shift operations, increasing latency and power consumption.
[0025] In view of the above-mentioned problems in the existing technology, the purpose of this application is to provide a lightweight hardware implementation method and related device for SHA-3, which is beneficial to reduce the consumption of logic resources and storage resources.
[0026] The present application will be further described below with reference to specific embodiments.
[0027] Please refer to Figure 1 This document illustrates a flowchart of a lightweight hardware implementation method for SHA-3 provided in an embodiment of this application. The method includes the following steps: Step 10: Initialize the state matrix and absorb the input data into the state matrix; Step 20: Start the round function calculation process.
[0028] Initializing the state matrix involves setting all elements of the 5×5×64-bit three-dimensional state matrix to zero, or to the initial values specified by the algorithm. Ingesting input data into the state matrix involves dividing the input data into several blocks according to a specific grouping, such as the bit rate; then, for each data block, an XOR operation is performed with the first r bits of the state matrix.
[0029] Furthermore, the method also includes: Each lane of the state matrix is mapped to a storage address in SRAM, and each storage address corresponds to a group of 64 physical storage units; Reading target data from the state matrix includes: The target data is read from the SRAM according to the obtained storage address.
[0030] The storage address of each lane in the SRAM can be calculated in the following way: Storage address = lane_index × 64 + bit_offset, lane_index = 5 × y + x, Where lane_index represents the index of the lane, x represents the column coordinate of the state matrix, y represents the row coordinate of the state matrix, and bit_offset represents the bit offset within the lane.
[0031] The 5×5×64-bit three-dimensional state matrix can be represented as a 5×5 lane matrix, with each lane containing 64 bits of data. Therefore, the entire state matrix can be represented as 25 64-bit data units. Using the bit-based addressing method described above, access to the two-dimensional lane matrix can be converted into linear address access, thereby simplifying the memory access logic.
[0032] In traditional SHA-3 hardware implementations, the 1600-bit state is typically stored in a register array. While registers offer high access speed, they consume significant flip-flop resources, increasing hardware area. The mapping method described above offers the following advantages: First, it significantly reduces the number of flip-flops (FFs), thus reducing hardware area; second, SRAM has a larger storage capacity, making it more suitable for storing large amounts of state data; and third, a unified address mapping rule simplifies controller design. Therefore, SRAM state mapping becomes a crucial foundation for implementing lightweight SHA-3 hardware architectures.
[0033] Please refer to Figure 2 The diagram illustrates a flowchart of a round function calculation method according to an embodiment of this application. The round function calculation method includes the following steps: Step 201: Determine whether the current round number is less than or equal to the preset round number; Step 202: When the current round number is less than or equal to the preset round number, read the target data from the state matrix, and perform write-back operations of θ transformation, χ transformation, ι transformation, fused ρ transformation and π transformation on the target data in sequence to obtain the transformed state matrix, and execute step 201; Step 203: When the current round number is greater than the preset round number, calculate and output the hash result based on the transformed state matrix.
[0034] Specifically, when the current round is the first round, the obtained storage address is the preset initial storage address; when the current round is not the first round, the obtained storage address is the target storage address in the previous round.
[0035] Specifically, before the first round of permutation begins, data needs to be read from the state matrix. The address read at this time is a pre-defined initial address. The initial address is usually the address of the first lane in the state matrix, such as the address corresponding to coordinate (0,0) (calculated as 0 according to address = 5 × y + x). However, the preset initial storage address can also be customized by the designer, as long as it covers the entire state matrix.
[0036] After the first round of permutation begins, the position written in the previous round automatically becomes the starting point for the next round of reading. This eliminates the need to additionally record or recalculate the coordinate mapping of the current state matrix, simplifying the control logic.
[0037] The preset number of rounds is generally 24, but it can be set according to the specific algorithm. For example, if a longer hash value output is required (such as the extensible output function SHAKE), more rounds need to be executed. Here, the specific value of the preset number of rounds is not limited.
[0038] As can be seen, in step 202, compared with the traditional SHA-3 hardware implementation, this application firstly features a delayed ρ-shift mechanism. That is, the shift operation is not performed immediately during the intermediate calculation stage, but rather the logical shift is achieved through address mapping during the final write-back. This avoids performing a large number of shift operations in each round of calculation, reducing the number of shift logic circuits, reducing resource consumption such as lookup tables, and reducing data movement during the computation stage, thereby effectively reducing hardware implementation complexity. Secondly, the write-back operation integrates ρ-transformation and π-transformation, which can reduce one data movement, thereby reducing the number of memory accesses. Simultaneously, since the shift operation is completed during the address calculation stage, no additional shift circuit is required. In summary, adopting the lightweight SHA-3 hardware implementation method and related apparatus provided in this application is beneficial for reducing the consumption of logic and storage resources.
[0039] Step 203 is consistent with the traditional algorithm. When the calculator records that the current number of completed rounds equals the preset number of rounds, such as 24 rounds, it means that the state matrix (1600 bits) has undergone all 24 rounds of permutation, and the internal data is sufficiently obfuscated, allowing the desired hash value to be read. The reading method could be, for example, starting from coordinates (0,0) or a preset initial coordinate, sequentially extracting each lane (64 bits) in column-major or row-major order, concatenating them into a bit string, and then extracting the first n bits as the final hash value (n depends on the specific algorithm, such as 256 bits for SHA3-256 and 512 bits for SHA3-512). If a longer output is needed (such as the scalable output function SHAKE), more rounds may need to be executed and read, but this claim only describes the basic hash output; other examples are not specifically provided.
[0040] Specifically, the write-back operation of sequentially performing θ transformation, χ transformation, ι transformation, fused ρ transformation, and π transformation on the target data to obtain the transformed state matrix includes: The target data is sequentially transformed using the θ, χ, and ι transformations to obtain intermediate data; and the read storage address is transformed using the π transformation to obtain the target storage address. The offset is determined by looking up a table based on the target storage address; The intermediate data is transformed using the offset to obtain the transformed target data. The transformed target data is written back to the SRAM according to the target storage address to obtain the transformed state matrix.
[0041] In Keccak, the number of rotation bits (ρ) for each lane is determined by the original coordinates (x, y) (using a fixed 25-entry table). However, in this scheme, since the π transformation has already mapped the coordinates to the new address, and we know the target memory address, we can pre-create a table indexed by the address, storing the number of rotation bits corresponding to the original coordinates. Using the obtained target address as an index, we look up a 6-bit (0~63) offset in this table. Then, we use intermediate data (64 bits) to perform a cyclic shift according to the offset obtained from the table lookup to obtain the updated target data. Finally, we write the updated target data to the target memory address in SRAM.
[0042] As can be seen, the use of the delayed ρ-shift mechanism ensures that the data in each lane maintains its original bit order during the calculation process, without actual shifting operations. When data needs to be written back to SRAM, the write address is adjusted according to the corresponding ρ offset, so that the data is logically shifted. For example, when a lane needs to perform an r-bit cyclic shift, the equivalent effect can be achieved by adjusting the write address offset. In this way, a large number of shift operations can be avoided in each round of calculation, thereby reducing the complexity of the logic circuit. In traditional implementations, ρ-shift and π-rearrangement are usually performed separately, thus requiring two data shift operations. This invention integrates ρ-shift and π-transformation into a single address remapping operation through a unified address calculation method. In this way, lane internal shift (ρ-transformation) and lane position rearrangement (π-operation) can be completed simultaneously in a single write operation. Through this fusion optimization method, one data shift can be reduced, thereby reducing the number of memory accesses. At the same time, since the shift operation is completed in the address calculation stage, no additional shift circuit is required.
[0043] Specifically, the step of sequentially performing θ-transform, χ-transform, and ι-transform on the target data to obtain intermediate data includes: Perform an θ-transform on the target data to obtain the target data after column mixing and diffusion; The target data after column mixing and diffusion is subjected to χ transform to obtain the target data after nonlinear transformation; The target data after nonlinear transformation is subjected to ι transformation to obtain intermediate data.
[0044] The θ-transform is a linear diffusion step in Keccak. For each lane in the state matrix, its new value is equal to the original value XORed with the parity of the two adjacent columns. The data after column diffusion is still 64 bits, but each bit incorporates information from the entire column. The χ-transform is a non-linear operation in Keccak, acting on each row of the state matrix. For each lane, its new value is determined by three consecutive lanes in the same row. The ι-transform only XORs the round constant with the lane at position (0,0) in the state matrix. The round constant only affects the coordinate (0,0) and has no effect on lanes at other coordinates. Therefore, for lanes other than (0,0), the ι-transform directly passes the input value (equivalent to no operation). For the lane at position (0,0), it XORs a 64-bit round constant (different for each round).
[0045] As can be seen, although the embodiments of this application also perform the above operations in sequence, compared with the traditional SHA-3 hardware implementation, except for the θ transform, other transforms no longer perform read operations; except for the write-back operation of the fused ρ transform and π transform, other operations no longer perform write-back operations; reducing five read and write-back operations to one can reduce the number of data moves and storage accesses, and reduce system latency and power consumption.
[0046] Specifically, the θ transform and the ι transform are executed by the first operation module, and the π transform and the χ transform are executed by the second operation module. The first operation module and the second operation module work in a time-division multiplexing manner.
[0047] In traditional parallel implementations, Keccak round functions typically require multiple computation modules to work simultaneously. For example, the θ operation requires column parity calculation, the χ operation involves nonlinear logic operations, and the π operation involves data rearrangement. If all computation modules exist simultaneously, a large amount of logic resources are required, thus increasing hardware area. To reduce resource consumption, this invention employs a Folded Architecture to optimize the round function design. During actual execution, the controller schedules the two modules to perform different computation tasks at different times, thereby achieving computation unit reuse. For example, the θ operation is performed in the first stage, the π operation in the second stage, the χ operation in the third stage, and the ι operation in the final stage. Through this time-sharing execution method, the same group of computation units can complete multiple computation steps in different stages, thereby reducing the demand for parallel hardware resources.
[0048] The main advantages of folded structures include: First, they can significantly reduce the number of computational units, thereby reducing LUT resource consumption. Second, they can reduce the overall circuit area. Third, they can implement the SHA-3 algorithm in resource-constrained devices.
[0049] Figure 3 A schematic diagram of a lightweight hardware implementation device for SHA-3 provided in one embodiment of this application is shown. The device includes: Data input unit 301 is used to initialize the state matrix and absorb input data into the state matrix; Round function calculation unit 302 is used to initiate the round function calculation process: When the current round number is less than or equal to the preset round number, the target data is read from the state matrix, and the target data is sequentially written back by θ transformation, χ transformation, ι transformation, fused ρ transformation and π transformation to obtain the transformed state matrix; When the current round number is greater than the preset round number, the hash result is calculated and output based on the transformed state matrix.
[0050] Figure 4 The diagram illustrates the structure of a computer device according to an embodiment of this application, including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the functions of the computer system of the SHA-3 lightweight hardware implementation method in any of the above embodiments.
[0051] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a computer, causes the computer to perform the functions of the computer system of the SHA-3 lightweight hardware implementation method in any of the above embodiments.
[0052] This application also provides a computer program product containing instructions that, when executed by a computer, cause the computer to perform the functions of the computer system of the SHA-3 lightweight hardware implementation method in any of the above embodiments.
[0053] It is understood that the specific examples in this application are only intended to help those skilled in the art better understand the implementation methods of this application, and are not intended to limit the scope of the invention.
[0054] It is understood that in the various embodiments of this application, the sequence number of each process does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not limit the implementation process of the embodiments of this application in any way.
[0055] It is understood that the various implementation methods described in this application can be implemented individually or in combination, and the implementation methods in this application are not limited in this respect.
[0056] Unless otherwise stated, all technical and scientific terms used in the embodiments of this application have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to limit the scope of this application. The term "and / or" as used in this application includes any and all combinations of one or more of the associated listed items. The singular forms "a," "the," and "the" as used in the embodiments of this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise.
[0057] It is understood that the processor in the embodiments of this application can be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method embodiments can be completed by the integrated logic circuits in the processor's hardware or by instructions in software form. The processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules can be located in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. This storage medium is located in memory; the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method.
[0058] It is understood that the memory in the embodiments of this application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Specifically, non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory may be random access memory (RAM). It should be noted that the memory in the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.
[0059] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0060] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the aforementioned method implementations, and will not be repeated here.
[0061] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0062] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.
[0063] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0064] If a function is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or part of the technical solution, can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0065] The above are merely specific embodiments of this application, but the scope of protection of this invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this invention should be determined by the scope of the claims.
Claims
1. A lightweight hardware implementation method for SHA-3, characterized in that, The method includes: Initialize the state matrix and absorb the input data into the state matrix; Start the round function calculation process: When the current round number is less than or equal to the preset round number, the target data is read from the state matrix, and the target data is sequentially written back by θ transformation, χ transformation, ι transformation, fused ρ transformation and π transformation to obtain the transformed state matrix; When the current round number is greater than the preset round number, the hash result is calculated and output based on the transformed state matrix.
2. The method according to claim 1, characterized in that, The write-back operation, which sequentially performs θ transformation, χ transformation, ι transformation, fused ρ transformation, and π transformation on the target data to obtain the transformed state matrix, includes: The target data is sequentially transformed using the θ, χ, and ι transformations to obtain intermediate data; and the read storage address is transformed using the π transformation to obtain the target storage address. The offset is determined by looking up a table based on the target storage address; The intermediate data is transformed using the offset to obtain the transformed target data. The transformed target data is written back to the SRAM according to the target storage address to obtain the transformed state matrix.
3. The method according to claim 2, characterized in that, The intermediate data obtained by sequentially performing θ-transform, χ-transform, and ι-transform on the target data includes: Perform an θ-transform on the target data to obtain the target data after column mixing and diffusion; The target data after column mixing and diffusion is subjected to χ transform to obtain the target data after nonlinear transformation; The target data after nonlinear transformation is subjected to ι transformation to obtain intermediate data.
4. A method according to any one of claims 1-3, characterized in that, The θ transform and the ι transform are executed by the first arithmetic module, and the π transform and the χ transform are executed by the second arithmetic module. The first arithmetic module and the second arithmetic module work in a time-division multiplexing manner.
5. A method according to any one of claims 1-3, characterized in that, The method further includes: Each lane of the state matrix is mapped to a storage address in SRAM, and each storage address corresponds to a group of 64 physical storage units; Reading target data from the state matrix includes: The target data is read from the SRAM according to the obtained storage address.
6. A method according to claim 5, characterized in that, The storage address of each lane in the SRAM can be calculated in the following way: Storage address = lane_index × 64 + bit_offset, lane_index = 5 × y + x, Where lane_index represents the index of the lane, x represents the column coordinate of the state matrix, y represents the row coordinate of the state matrix, and bit_offset represents the bit offset within the lane.
7. The method according to claim 5, characterized in that, When the current round number is the first round, the obtained storage address is the preset initial storage address; when the current round number is not the first round, the obtained storage address is the target storage address in the previous round.
8. A lightweight hardware implementation device for SHA-3, characterized in that, The device includes: A data input unit is used to initialize the state matrix and absorb input data into the state matrix; The round function calculation unit is used to initiate the round function calculation process. When the current round number is less than or equal to the preset round number, the target data is read from the state matrix, and the target data is sequentially written back by θ transformation, χ transformation, ι transformation, fused ρ transformation and π transformation to obtain the transformed state matrix; When the current round number is greater than the preset round number, the hash result is calculated and output based on the transformed state matrix.
9. An electronic device, characterized in that, include: Processor and memory; The processor is connected to a memory, wherein the memory is used to store a computer program, and the processor is used to invoke the computer program to perform the method as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, the computer program including program instructions that, when executed by a processor, perform the method as described in any one of claims 1-7.