A system and method for implementing a log softmax function in hardware

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
Through hardware module co-design and data flow optimization, the hardware implementation system of the LogSoftmax function reduces storage resource consumption and improves computational efficiency in resource-constrained devices, solving the problem of high storage resource consumption in traditional methods and achieving a highly efficient hardware implementation of the LogSoftmax function.

CN120996121BActive Publication Date: 2026-06-1258TH RES INST OF CETC

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: 58TH RES INST OF CETC
Filing Date: 2025-10-23
Publication Date: 2026-06-12

Application Information

Patent Timeline

23 Oct 2025

Application

12 Jun 2026

Publication

CN120996121B

IPC: G06N3/063; G06F7/575; G06N3/048

CPC: G06N3/063; G06N3/048; G06F7/575

AI Tagging

Application Domain

Digital data processing details Physical realisation

Technical Efficacy Phrases

avoid idlingsolve management problems

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A control method for pressure-engaged conveyor rollers and a battery rack / warehouse
CN115447952Bavoid idling Improve work efficiency Control devices for conveyors Storage devices Control engineering Mechanical engineering
An automatic feeding and receiving machine for electroplated products
CN224278736URealize continuous automatic feedingquality improvement Conveyors Conveyor parts Transmission belt Electric machinery
Intelligent allocation method and system of computing resources for multi-task parallel training
CN122285209AImprove utilization efficiency avoid wasting Resource pool Resource consumption
Offline vector library building method, system, electronic device, and storage medium
CN122152505APrecisely control flow speedEffective balancing rate differenceResource allocation Directory Data mining
An adaptive docking transmission
CN117817711BExpanded docking accuracy rangeEliminate the effect of errorJoints Electric machinery Structural engineering

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

The hardware implementation of the LogSoftmax function suffers from high storage resource consumption and low computational efficiency in resource-constrained embedded devices. Traditional methods have failed to optimize the balance between storage efficiency and computational performance.

Method used

Through the collaborative design and data flow optimization of hardware modules, including the collaborative operation of control modules, data preprocessing modules, configuration data generation modules, LUT storage modules and computing modules, data flow and storage access are optimized. A hierarchical LUT structure and FIFO cache are adopted to support dynamic precision configuration and parallel computing. A state machine controls the efficient collaboration of each module.

Benefits of technology

It significantly reduces hardware storage resource consumption, improves computational efficiency, solves the problem of high storage resource consumption in traditional methods, and realizes an efficient hardware implementation of the LogSoftmax function.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN120996121B_ABST

Patent Text Reader

Abstract

The present application belongs to the technical field of artificial intelligence chip design, and particularly relates to a LogSoftmax function hardware implementation system and method thereof. The system comprises a control module, a data preprocessing module connected with the control module, a configuration data generation module connected with the control module, a LUT storage module connected with the control module and the configuration data generation module, and an operation module connected with the control module, the data preprocessing module and the LUT storage module. The data preprocessing module comprises a maximum value calculation circuit, a data parallel processing circuit and a precision conversion circuit. The configuration data generation module comprises a control flow analysis module, a data flow analysis module, a control information generation module, an exponential FIFO and a logarithmic FIFO. The LUT storage module is used for storing LUT tables of exponential values and logarithmic values. The method can significantly reduce storage resource consumption and improve computing efficiency.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence chip design technology, and specifically relates to a hardware implementation system and method for the LogSoftmax function. This system solves the technical problems of high storage resource consumption and low computational efficiency through the design of specific hardware modules. Background Technology

[0002] In recent years, artificial intelligence (AI) technology has achieved leapfrog development driven by both scientific research innovation and industrial application. Core technologies such as image classification, target recognition, and tracking have been widely applied to scenarios including industrial inspection, medical imaging, and autonomous driving, providing a technological foundation for the intelligent transformation of industries. Deep neural networks (DNNs), as the core architecture supporting this transformation, utilize their modular design (including convolutional layers, pooling layers, activation layers, and fully connected layers) to achieve a complete process from feature extraction to decision output, demonstrating powerful pattern recognition capabilities.

[0003] In deep learning, activation functions are key components for achieving nonlinear representation in neural networks. Their role is to transform the linear output of neurons into nonlinear features, thereby endowing the network with the ability to fit complex functions. Common activation functions such as ReLU, Sigmoid, and Tanh are suitable for different task requirements in various scenarios. However, in classification tasks, especially multi-class classification problems, simple nonlinear mapping is insufficient to directly output probability distributions, leading to the need for probability normalization functions. LogSoftmax, as the logarithmic form of the Softmax function, is widely used in classification tasks, particularly in multi-class classification problems, to convert the output of neural networks into a logarithmic probability distribution. Its core function is to map the input vector to a probability space while avoiding numerical instability issues (such as overflow caused by exponential operations). Compared to directly calculating Softmax, LogSoftmax combines exponential and logarithmic operations to directly output logarithmic probabilities, significantly reducing computational complexity and improving numerical accuracy. Furthermore, LogSoftmax has a natural advantage in cross-entropy loss functions, as it can be directly combined with logarithmic probabilities, avoiding redundant calculations and thus accelerating model training and inference processes. With the increasing demand for edge computing and real-time inference, efficient hardware implementation of LogSoftmax has become a key challenge in improving model deployment performance.

[0004] In recent years, artificial intelligence (AI) technology has achieved leapfrog development driven by both scientific research innovation and industrial application. Core technologies such as image classification, target recognition, and tracking have been widely applied to scenarios including industrial inspection, medical imaging, and autonomous driving, providing a technological foundation for the intelligent transformation of industries. Deep neural networks (DNNs), as the core architecture supporting this transformation, utilize their modular design (including convolutional layers, pooling layers, activation layers, and fully connected layers) to achieve a complete process from feature extraction to decision output, demonstrating powerful pattern recognition capabilities.

[0005] The hardware implementation of LogSoftmax needs to balance mathematical characteristics with hardware resources. Its mathematical expression is:

[0006]

[0007] This formula involves exponential and logarithmic operations, requiring efficient algorithms to simplify the calculations. In LogSoftmax's hardware implementation, the lookup table method has become a key choice for edge device deployment due to its efficiency and real-time performance. By pre-storing the discrete values of the exponential and logarithmic functions and combining them with interpolation algorithms, the lookup table method can quickly complete complex calculations, significantly reducing computational latency, making it particularly suitable for scenarios with extremely high real-time requirements, such as autonomous driving and industrial control.

[0008] However, the storage overhead of lookup table methods limits their application in resource-constrained embedded devices, such as ultra-low-power sensors or micro AI chips, where the cost of storage resources may far exceed the consumption of computing resources. The hardware implementation of the LogSoftmax function typically relies on pure lookup tables, resulting in high storage overhead, making it difficult to deploy, especially in resource-constrained embedded devices. Traditional methods fail to deeply integrate with the internal hardware architecture, such as by not optimizing data flow scheduling, storage access patterns, or parallel computing units, thus hindering efficient execution at the hardware level. Optimizing storage efficiency while maintaining accuracy has become a key direction for further optimization of lookup table methods. Summary of the Invention

[0009] The purpose of this invention is to provide a hardware implementation system and method for the LogSoftmax function. Compared with the lookup table method, this invention solves the problem of balancing storage efficiency and computing performance through the collaborative design of hardware modules and data flow optimization. This method can significantly reduce storage resource consumption and improve computing efficiency.

[0010] To address the aforementioned technical problems, this invention provides a hardware implementation system for the LogSoftmax function, comprising:

[0011] The control module consists of a state structure, including the following five states: Idle (IDLE), Data Preprocessing (PPCS), Configuration Data Generation (GTCD), Configuration LUT (CLUT), and Operation Function Value (CCL). Based on external enable signals and LUT update requirements, it switches between the five different states to control the coordinated operation of other hardware modules.

[0012] The data preprocessing module, connected to the control module, includes: a maximum value calculation circuit, a data parallel processing circuit, and a precision conversion circuit; the maximum value calculation circuit is used to calculate the input data. The maximum value in The data parallel processing circuit is used to calculate the input variables. The precision conversion circuit converts the input variable according to the control signal iof_sel output by the control module. Convert the data to the corresponding precision, and output a set of parallel, normalized and precision-converted data for subsequent LUT lookup and calculation.

[0013] A configuration data generation module, connected to the control module, includes: a control flow parsing module, a data flow parsing module, a control information generation module, an exponential FIFO, and a logarithmic FIFO. The control flow parsing module parses external configuration data according to the data format and generates control information from the configuration data of the LUT storage module. This control information includes address information and data length information. The data flow parsing module receives exponential and logarithmic values from the external data stream and writes them into the corresponding exponential and logarithmic FIFOs, respectively. The exponential and logarithmic FIFOs are used to cache the exponential and logarithmic values, respectively.

[0014] The LUT storage module, connected to the control module and the configuration data generation module, is used to store LUT tables for exponential and logarithmic values.

[0015] The computation module, connected to the control module, the data preprocessing module, and the LUT storage module, includes:

[0016] The first-level arithmetic unit consists of N parallel exponential arithmetic units. Each exponential arithmetic unit includes a fixed-point multiplier and a shifter. By performing exponential operations, it calculates the exponential values of N input variables with base e.

[0017] The second-level arithmetic unit, consisting of an accumulator, calculates the sum of the N exponential operation results in the first-level arithmetic unit by performing an accumulation operation.

[0018] The three-level arithmetic unit consists of a mantissa exponent detector, a fixed-point multiplier II, and an adder I. It calculates the logarithmic value of the accumulated value to the base e by performing logarithmic operations.

[0019] The fourth-level arithmetic unit consists of two adders. It calculates the difference between the input variable and the logarithmic value by performing addition operations to obtain the calculation result.

[0020] The system optimizes data flow and storage access at the hardware level through the coordinated operation of various hardware modules, thereby reducing hardware storage resource consumption and improving computing efficiency.

[0021] Preferably, the exponentiation unit includes:

[0022] Fixed-point multiplier 1: Calculates input variables and constants using fixed-point multiplication. The product of the two numbers yields both the integer and fractional parts.

[0023] The shifter uses a lookup table to find the LUT value corresponding to the fractional part, and then shifts the LUT value to the left by the integer part to obtain the final result of the exponentiation unit.

[0024] Preferably, the three-level arithmetic unit includes:

[0025] The mantissa exponent detector is used to detect and calculate the accumulated value of the input secondary arithmetic unit to obtain the exponent and mantissa of the accumulated value;

[0026] Fixed-point multiplier 2 calculates the exponent and constant of the accumulated value using fixed-point multiplication. The product;

[0027] Adder one calculates the sum of the LUT value obtained by the lookup table method and the product, thus obtaining the final result of the logarithmic calculation performed by the three-level operation unit.

[0028] Preferably, the state machine includes:

[0029] When in the IDLE state, upon receiving an external enable signal, the system jumps to the PPCS state for data preprocessing and simultaneously generates an enable signal for the data preprocessing module; otherwise, the system remains unchanged.

[0030] The data preprocessing state PPCS is used to control the data preprocessing module to preprocess the data. When the data preprocessing is completed, if an external signal indicates that the LUT needs to be updated, it jumps to the configuration data generation state GTCD and generates an enable signal for the configuration data generation module. If an external signal indicates that the LUT does not need to be updated, it jumps to the operation function value state CCL and generates an enable signal for the operation module. If no data processing completion signal is received, it remains unchanged.

[0031] The configuration data generation state GTCD is used to control the configuration data generation module to generate configuration data for the LUT storage module. When the configuration data is generated, it jumps to the configuration LUT state CLUT and generates the enable signal for the LUT storage module; otherwise, it remains unchanged.

[0032] Configure the LUT state CLUT to write LUT values to the LUT storage module. When all LUT values have been stored, jump to the operation function value state CCL and generate the enable signal of the operation module; otherwise, keep the state unchanged.

[0033] The function value state CCL is used to control the calculation of function values by the calculation module. When the function value is calculated, it jumps to the idle state IDLE; otherwise, it remains unchanged.

[0034] This invention also provides a hardware implementation method for the LogSoftmax function, employing a hardware implementation system for the LogSoftmax function as described above, comprising:

[0035] Step S1: Process the input data Preprocessing is performed to obtain the input variables of the computation module. ;in for The maximum value in, For input data Each element in the dataset; the preprocessing includes numerical stability handling and precision conversion operations;

[0036] Step S2: Based on the control signal from the control module, determine whether to update the LUT data; if yes, continue to step S3; otherwise, proceed to step S5.

[0037] Step S3: Parse the external configuration data stream and generate LUT-related configuration data;

[0038] Step S4: Based on the LUT-related configuration data generated in step S3, store the exponential and logarithmic values into the LUT table;

[0039] Step S5: Simultaneously calculate the input variables with base e in N exponentiation units using the first-level arithmetic unit in the arithmetic module. The exponent value is ;

[0040] Step S6: Calculate the exponent values of the results from the N exponent operation units using the secondary operation units in the operation module. The accumulated value is ;

[0041] Step S7: Calculate the accumulated value using the three-level arithmetic unit in the arithmetic module. logarithm

[0042] Step S8: Calculate the input variable using the four-level arithmetic unit in the arithmetic module. With logarithm The difference is used to obtain the final result of the function. .

[0043] Compared with the prior art, the present invention has the following beneficial effects:

[0044] In the hardware implementation of this invention, the state machine of the control module schedules the operations of each hardware module. The data preprocessing module receives the input data stream, extracts the maximum value through its hardware maximum value calculation circuit, and performs fixed-point number conversion. This not only reduces storage resource consumption but also effectively avoids overflow of calculation results from the secondary arithmetic units in the arithmetic module. A hierarchical LUT structure and FIFO cache are adopted to optimize the storage access mode. The first-level arithmetic unit uses N parallel exponential arithmetic units, and a four-stage pipeline structure realizes deep parallelism of the computation task. State machine control ensures efficient collaboration among modules and avoids resource idleness. The configurable precision conversion circuit is controlled by the iof_sel signal, supporting a dynamic configuration mechanism for runtime LUT updates to address the difficulty of adapting to different precision requirements and data formats in traditional hardware implementations. The carefully designed state machine ensures correct timing of each module, the FIFO cache solves the speed matching of data production and consumption, and a clear completion signal mechanism ensures computational correctness to solve data stream management and synchronization problems. Through optimized exponential and logarithmic arithmetic unit design, a mantissa exponent detector ensures the accuracy of key calculations, and reasonable fixed-point number bit width allocation and shifting strategies solve the problem of easy accumulation of precision errors in hardware fixed-point number arithmetic. Compared to the traditional table lookup method, this invention can significantly reduce hardware storage resource consumption and improve computing efficiency. Attached Figure Description

[0045] Figure 1 This is a schematic diagram of the hardware circuit structure of the LogSoftmax function provided by the present invention.

[0046] Figure 2 This is a schematic diagram of the computing module structure provided by the present invention.

[0047] Figure 3 This is a schematic diagram of the exponent operation unit structure provided by the present invention.

[0048] Figure 4 A schematic diagram of the three-level arithmetic unit structure provided by the present invention.

[0049] Figure 5 The state transition diagram of the control module provided by the present invention.

[0050] Figure 6 The circuit diagram of the data preprocessing module provided for this invention.

[0051] Figure 7 The circuit diagram of the configuration data generation module provided by this invention.

[0052] Figure 8 The function operation flowchart provided for this invention. Detailed Implementation

[0053] The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. The advantages and features of the present invention will become clearer from the following description. It should be noted that the drawings are all in a very simplified form and use non-precise proportions, and are only used to facilitate and clarify the illustration of the embodiments of the present invention.

[0054] like Figure 1 As shown, this embodiment of the invention specifically provides a hardware implementation system for the LogSoftmax function, including:

[0055] The control module consists of a state structure, including the following five states: Idle (IDLE), Data Preprocessing (PPCS), Configuration Data Generation (GTCD), Configuration LUT (CLUT), and Operation Function Value (CCL). Based on external enable signals and LUT update requirements, it switches between the five different states to control the coordinated operation of other hardware modules.

[0056] The data preprocessing module, connected to the control module, includes: a maximum value calculation circuit, a data parallel processing circuit, and a precision conversion circuit; the maximum value calculation circuit is used to calculate the input data. The maximum value in The data parallel processing circuit is used to calculate the input variables. The precision conversion circuit converts the input variable according to the control signal iof_sel output by the control module. Convert the data to the corresponding precision, and output a set of parallel, normalized and precision-converted data for subsequent LUT lookup and calculation.

[0057] A configuration data generation module, connected to the control module, includes: a control flow parsing module, a data flow parsing module, a control information generation module, an exponential FIFO, and a logarithmic FIFO. The control flow parsing module parses external configuration data according to the data format and generates control information from the configuration data of the LUT storage module. This control information includes address information and data length information. The data flow parsing module receives exponential and logarithmic values from the external data stream and writes them into the corresponding exponential and logarithmic FIFOs, respectively. The exponential and logarithmic FIFOs are used to cache the exponential and logarithmic values, respectively.

[0058] The LUT storage module, connected to the control module and the configuration data generation module, is used to store LUT tables for exponential and logarithmic values.

[0059] The computation module, connected to the control module, the data preprocessing module, and the LUT storage module, includes:

[0060] The first-level arithmetic unit consists of N parallel exponential arithmetic units. Each exponential arithmetic unit includes a fixed-point multiplier and a shifter. By performing exponential operations, it calculates the exponential values of N input variables with base e.

[0061] The second-level arithmetic unit, consisting of an accumulator, calculates the sum of the N exponential operation results in the first-level arithmetic unit by performing an accumulation operation.

[0062] The three-level arithmetic unit consists of a mantissa exponent detector, a fixed-point multiplier II, and an adder I. It calculates the logarithmic value of the accumulated value to the base e by performing logarithmic operations.

[0063] The fourth-level arithmetic unit consists of two adders. It calculates the difference between the input variable and the logarithmic value by performing addition operations to obtain the calculation result.

[0064] The system optimizes data flow and storage access at the hardware level through the coordinated operation of various hardware modules, thereby reducing hardware storage resource consumption and improving computing efficiency.

[0065] The exponentiation unit includes:

[0066] Fixed-point multiplier 1: Calculates input variables and constants using fixed-point multiplication. The product of the two numbers yields both the integer and fractional parts.

[0067] The shifter uses a lookup table to find the LUT value corresponding to the fractional part, and then shifts the LUT value to the left by the integer part to obtain the final result of the exponentiation unit.

[0068] The three-level arithmetic unit includes:

[0069] The mantissa exponent detector is used to detect and calculate the accumulated value of the input secondary arithmetic unit to obtain the exponent and mantissa of the accumulated value;

[0070] Fixed-point multiplier 2 calculates the exponent and constant of the accumulated value using fixed-point multiplication. The product;

[0071] Adder one calculates the sum of the LUT value obtained by the lookup table method and the product, thus obtaining the final result of the logarithmic calculation performed by the three-level operation unit.

[0072] As a further explanation of the embodiments of the present invention, the circuit structure of the arithmetic module is as follows: Figure 2 As shown, it includes a first-level arithmetic unit, a second-level arithmetic unit, a third-level arithmetic unit, and a fourth-level arithmetic unit. The operation process includes an exponential operation stage, an accumulation stage, a logarithmic operation stage, and an addition stage. The exponential operation stage performs exponential operations through the first-level arithmetic unit to calculate N exponential values exp(x1-x) to the base e. max ),…,exp(x N -x max The accumulation stage performs accumulation operations through a second-level arithmetic unit to calculate the sum of N exponential results (sum). The logarithmic operation stage performs logarithmic operations through a third-level arithmetic unit to calculate the logarithmic value ln(sum) of the accumulation result with base e. The addition stage performs addition operations through a fourth-level arithmetic unit to calculate the difference x between the input variable of the arithmetic module and the result of the third-level arithmetic unit. i -x max -ln(sum) yields the result of the function operation.

[0073] The mathematical principle of the exponentiation unit in the first-level arithmetic unit is: Let ,in, , They are respectively The integer part and the fractional part, ,but Therefore, the result of the exponentiation unit can be obtained by... The result is obtained by shifting. The structure of the exponentiation unit is as follows: Figure 3 As shown, firstly, fixed-point multiplication is used to... With constant Multiply to get the integer part of the product. and decimal part Then, obtain the result by looking up a table. The value; finally, the value is transferred through the shift register. Shift left Position, obtained The value of, i.e. The result. Clearly, under the same precision conditions, storage... Storage It consumes fewer resources. Therefore, when implementing exponential operations, this method can reduce storage resources compared to the lookup table method.

[0074] Mathematical research shows that for any There exist real pairs (k, w) such that Where k∈[1, 2), w is an integer, then Therefore, the three-level computing unit can be achieved by... The calculation is transformed into Calculation and Calculation. The three-level operational unit structure is as follows: Figure 4 As shown, when performing logarithmic operations, k and w are first calculated using a mantissa exponent detector; then, fixed-point multiplication is used to calculate... At the same time, retrieve from LUT The value; finally, the value is calculated using an adder. This yields the logarithmic result. Clearly, under the same precision conditions, storage... Storage It consumes fewer resources. Therefore, when implementing logarithmic operations, this method can also reduce storage resources compared to the lookup table method.

[0075] The control module is implemented using a state machine, and its state transitions are as follows: Figure 5 As shown, it includes five states, among which,

[0076] IDLE: Idle state, the state the circuit is in when it is not working or the function result has been calculated. The state machine starts from the IDLE state, waits for an external enable signal, and jumps to the PPCS state upon receiving the external enable signal, while generating an enable signal for the data preprocessing module; otherwise, it remains in this state.

[0077] PPCS: Data Preprocessing State. This state controls the data preprocessing module to preprocess the data. After data preprocessing is complete, if an external signal indicates that the LUT needs to be updated, it jumps to the GTCD state and generates an enable signal for the configuration data generation module; if an external signal indicates that the LUT does not need to be updated, it jumps to the CCL state and generates an enable signal for the calculation module; if no data processing completion signal is received, it remains in this state.

[0078] GTCD: Configuration data generation state. This state controls the configuration data generation module to generate configuration data for the LUT storage module. Once the configuration data generation is complete, it transitions to the CLUT state and generates an enable signal for the LUT storage module; otherwise, it remains unchanged.

[0079] CLUT: Configures the LUT state, which is used to write LUT values to the LUT storage module. Once all LUT values have been stored, it jumps to the CCL state and generates an enable signal for the computation module; otherwise, it remains unchanged.

[0080] CCL: Function value state. This state controls the calculation of function values by the computation module. Once the function value calculation is complete, it jumps to the IDLE state; otherwise, it remains unchanged.

[0081] The data preprocessing module circuit will Subtract the maximum value from each component in the equation, and use... replace Computation, which in turn enables LUT storage Instead of storage This not only reduces storage resource consumption but also effectively prevents overflow of calculation results from the secondary arithmetic units in the computing module. Its circuit structure is as follows: Figure 6 As shown, it includes a maximum value calculation circuit, a data parallel processing circuit, and a precision conversion circuit. The maximum value calculation circuit is used to calculate... maximum value Data parallel processing circuits are used to calculate The precision conversion circuit converts the signal iof_sel to... Convert to a value with the corresponding precision.

[0082] For the LUT storage module, external configuration information cannot be used directly; it needs to be parsed and used by the configuration data generation module. The circuit structure of the configuration data generation module is as follows: Figure 7 As shown, it includes a control flow parsing module, a data flow parsing module, a control information generation module, an exponential FIFO, and a logarithmic FIFO. The control flow parsing module parses external configuration data according to the data format and generates control information from the configuration data of the LUT storage module, including address information and data length information. The data flow parsing module receives the exponential and logarithmic values from the external data stream and writes them into the corresponding FIFOs. The exponential and logarithmic FIFOs are used to cache the exponential and logarithmic values, respectively.

[0083] The calculation process of the LogSoftmax function is as follows: Figure 8 As shown, it includes the following steps:

[0084] Step S1: Process the input data Preprocessing is performed to obtain the input variables of the computation module. ;in for The maximum value in, For input data Each element in the dataset; the preprocessing includes numerical stability handling and precision conversion operations;

[0085] Step S2: Based on the control signal from the control module, determine whether to update the LUT data; if yes, continue to step S3; otherwise, proceed to step S5.

[0086] Step S3: Parse the external configuration data stream and generate LUT-related configuration data;

[0087] Step S4: Based on the LUT-related configuration data generated in step S3, store the exponential and logarithmic values into the LUT table;

[0088] Step S5: Simultaneously calculate the input variables with base e in N exponentiation units using the first-level arithmetic unit in the arithmetic module. The exponent value is ;

[0089] Step S6: Calculate the exponent values of the results from the N exponent operation units using the secondary operation units in the operation module. The accumulated value is ;

[0090] Step S7: Calculate the accumulated value using the three-level arithmetic unit in the arithmetic module. logarithm ;

[0091] Step S8: Calculate the input variable using the four-level arithmetic unit in the arithmetic module. With logarithm The difference is used to obtain the final result of the function. .

[0092] The above description is merely a description of preferred embodiments of the present invention and is not intended to limit the scope of the present invention in any way. Any changes or modifications made by those skilled in the art based on the above disclosure shall fall within the protection scope of the claims.

Claims

1. A system for hardware implementation of a LogSoftmax function, the system comprising: include: The control module consists of a state structure, including the following five states: Idle (IDLE), Data Preprocessing (PPCS), Configuration Data Generation (GTCD), Configuration LUT (CLUT), and Operation Function Value (CCL). Based on external enable signals and LUT update requirements, it switches between the five different states to control the coordinated operation of other hardware modules. The data preprocessing module, connected to the control module, includes: a maximum value calculation circuit, a data parallel processing circuit, and a precision conversion circuit; the maximum value calculation circuit is used to calculate the input data. The maximum value in The data parallel processing circuit is used to calculate the input variables. The precision conversion circuit converts the input variable according to the control signal iof_sel output by the control module. Convert the data to the corresponding precision, and output a set of parallel, normalized and precision-converted data for subsequent LUT lookup and calculation. A configuration data generation module, connected to the control module, includes: a control flow parsing module, a data flow parsing module, a control information generation module, an exponential FIFO, and a logarithmic FIFO. The control flow parsing module parses external configuration data according to the data format and generates control information from the configuration data of the LUT storage module. This control information includes address information and data length information. The data flow parsing module receives exponential and logarithmic values from the external data stream and writes them into the corresponding exponential and logarithmic FIFOs, respectively. The exponential and logarithmic FIFOs are used to cache the exponential and logarithmic values, respectively. The LUT storage module, connected to the control module and the configuration data generation module, is used to store LUT tables for exponential and logarithmic values. The computation module, connected to the control module, the data preprocessing module, and the LUT storage module, includes: The first-level arithmetic unit consists of N parallel exponential arithmetic units. Each exponential arithmetic unit includes a fixed-point multiplier and a shifter. By performing exponential operations, it calculates the exponential values of N input variables with base e. The second-level arithmetic unit, consisting of an accumulator, calculates the sum of the N exponential operation results in the first-level arithmetic unit by performing an accumulation operation. The three-level arithmetic unit consists of a mantissa exponent detector, a fixed-point multiplier II, and an adder I. It calculates the logarithmic value of the accumulated value to the base e by performing logarithmic operations. The fourth-level arithmetic unit consists of two adders. It calculates the difference between the input variable and the logarithmic value by performing addition operations to obtain the calculation result. The system optimizes data flow and storage access at the hardware level through the coordinated operation of various hardware modules, thereby reducing hardware storage resource consumption and improving computing efficiency.

2. The hardware implementation system for the LogSoftmax function as described in claim 1, characterized in that, The exponentiation unit includes: Fixed-point multiplier 1: Calculates input variables and constants using fixed-point multiplication. The product of the two numbers yields both the integer and fractional parts. The shifter uses a lookup table to find the LUT value corresponding to the fractional part, and then shifts the LUT value to the left by the integer part to obtain the final result of the exponentiation unit.

3. The hardware implementation system for the LogSoftmax function as described in claim 1, characterized in that, The three-level arithmetic unit includes: The mantissa exponent detector is used to detect and calculate the accumulated value of the input secondary arithmetic unit to obtain the exponent and mantissa of the accumulated value; Fixed-point multiplier 2 calculates the exponent and constant of the accumulated value using fixed-point multiplication. The product; Adder one calculates the sum of the LUT value obtained by the lookup table method and the product, thus obtaining the final result of the logarithmic calculation performed by the three-level operation unit.

4. The hardware implementation system for the LogSoftmax function as described in claim 1, characterized in that, The state machine includes: When in the IDLE state, upon receiving an external enable signal, the system jumps to the PPCS state for data preprocessing and simultaneously generates an enable signal for the data preprocessing module; otherwise, the system remains unchanged. The data preprocessing state PPCS is used to control the data preprocessing module to preprocess the data. When the data preprocessing is completed, if an external signal indicates that the LUT needs to be updated, it jumps to the configuration data generation state GTCD and generates an enable signal for the configuration data generation module. If an external signal indicates that the LUT does not need to be updated, it jumps to the operation function value state CCL and generates an enable signal for the operation module. If no data processing completion signal is received, it remains unchanged. The configuration data generation state GTCD is used to control the configuration data generation module to generate configuration data for the LUT storage module. When the configuration data is generated, it jumps to the configuration LUT state CLUT and generates the enable signal for the LUT storage module; otherwise, it remains unchanged. Configure the LUT state CLUT to write LUT values to the LUT storage module. When all LUT values have been stored, jump to the operation function value state CCL and generate the enable signal of the operation module; otherwise, keep the state unchanged. The function value state CCL is used to control the calculation of function values by the calculation module. When the function value is calculated, it jumps to the idle state IDLE; otherwise, it remains unchanged.

5. A hardware implementation method for the LogSoftmax function, employing a hardware implementation system for the LogSoftmax function as described in any one of claims 1 to 4, characterized in that, include: Step S1: Process the input data Preprocessing is performed to obtain the input variables of the computation module. ;in for The maximum value in, For input data Each element in the dataset; the preprocessing includes numerical stability handling and precision conversion operations; Step S2: Based on the control signal from the control module, determine whether to update the LUT data; if yes, continue to step S3; otherwise, proceed to step S5. Step S3: Parse the external configuration data stream and generate LUT-related configuration data; Step S4: Based on the LUT-related configuration data generated in step S3, store the exponential and logarithmic values into the LUT table; Step S5: Simultaneously calculate the input variables with base e in N exponentiation units using the first-level arithmetic unit in the arithmetic module. The exponent value is ; Step S6: Calculate the exponent values of the results from the N exponent operation units using the secondary operation units in the operation module. The accumulated value is ; Step S7: Calculate the accumulated value using the three-level arithmetic unit in the arithmetic module. logarithm ; Step S8: Calculate the input variable using the four-level arithmetic unit in the arithmetic module. With logarithm The difference is used to obtain the final result of the function. .

Citation Information

Patent Citations

CN109308520A
GB202201358D0

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

CN109308520A

GB202201358D0