A neural network inference chip Softmax operator splitting method

By performing normalization or non-normalization splitting on demand in the neural network inference chip, the numerical out-of-bounds problem under FP16 precision is solved, achieving high-efficiency and low-cost computational accuracy and efficiency, avoiding the dependence on FP32 precision and the redundant calculations caused by full normalization.

CN122221920APending Publication Date: 2026-06-16JIUZHI (SUZHOU) INTELLIGENT TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIUZHI (SUZHOU) INTELLIGENT TECH CO LTD
Filing Date
2026-03-17
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing neural network inference chips that only support FP16 precision, the numerical values ​​are prone to going out of bounds when splitting operators. Furthermore, existing solutions either rely on FP32 precision, resulting in high costs and latency, or use full normalization splitting to reduce inference efficiency.

Method used

By statistically analyzing the target data range of each layer operator in the neural network, determining whether the out-of-bounds condition is met, performing normalization or non-normalization splitting as needed, generating configuration files to guide compilation, avoiding exponential operation overflow and underflow under FP16 precision, and reducing redundant operations.

🎯Benefits of technology

It effectively balances the accuracy and efficiency of computational functions under FP16 precision, avoids dependence on FP32, reduces hardware costs and latency, and improves computational efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
Patent Text Reader

Abstract

The application discloses a neural network inference chip operator splitting method, belongs to the technical field of neural network inference chips, and is suitable for an edge computing scene with high real-time requirement. A technical problem to be solved is that in a neural network inference chip supporting only FP16 precision, numerical values are prone to overflow during operator splitting, and existing schemes either rely on FP32 precision to cause high cost and delay or full normalization splitting reduces inference efficiency. A technical scheme is characterized in that a correction mechanism is added to a compiler, the data range of input data, exponential operation results and the like of each layer operator is counted, whether the overflow condition is met is judged to mark whether normalization or non-normalization splitting is needed, a configuration file that can be manually edited is output, and the compiler executes corresponding splitting operations on each layer operator according to the mark, without the need of FP32 intermediate precision promotion.
Need to check novelty before this filing date? Find Prior Art