Mixed precision model inference acceleration method and system applied to large language model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By dynamically dividing high- and low-precision computational subgraphs into parallel executions within a large language model, the computational and storage pressure issues are resolved, improving inference speed and efficiency.

CN122242780APending Publication Date: 2026-06-19XINGFAN XINGQI (CHENGDU) TECH CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: XINGFAN XINGQI (CHENGDU) TECH CO LTD
Filing Date: 2026-05-22
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Large language models face computational and storage pressures during the inference phase. Existing methods fail to effectively utilize the differences in the precision sensitivity of computing nodes, resulting in wasted computing resources or decreased model performance.

Method used

By analyzing the sensitivity of numerical precision of each layer in the computation graph and the semantic complexity of the input text, the graph is dynamically divided into high-precision and low-precision computation subgraphs, and computational cores of different precision are called to perform parallel execution, generating a mixed bit-width forward propagation data stream.

Benefits of technology

While ensuring the accuracy of the reasoning results, it significantly reduces the amount of computation and storage required, and improves the reasoning speed and efficiency of large language models.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122242780A_ABST

Patent Text Reader

Abstract

This invention provides a method and system for accelerating mixed-precision model inference applied to large language models, belonging to the field of natural language processing technology. First, it acquires the input text sequence to be processed and the computational graph topology of a pre-trained large language model. Then, based on relevant attributes, it generates numerical representations of computational node units with bit-width compression tolerance labels and reconstructs the computational graph topology. Next, it calls different bit-width computational cores to perform parallel forward propagation processing to obtain an intermediate tensor set. Then, it performs cross-subgraph tensor interaction processing on the intermediate tensor set to generate a mixed-bit-width forward propagation data stream. Finally, it generates a text inference output sequence based on this data. This invention can reduce the computational load and storage requirements of inference, and improve inference efficiency.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and more specifically, to a method and system for accelerating inference using mixed-precision models applied to large language models. Background Technology

[0002] Large language models, with their powerful language understanding and generation capabilities, have achieved remarkable results in numerous tasks such as machine translation, text summarization, and intelligent question answering. However, as the model size continues to increase, large language models face enormous computational and storage pressures during the inference phase. Traditional inference methods typically employ a uniform numerical representation bit width, meaning that all computational nodes in the model use data of the same precision for computation. For some computational nodes with low precision requirements, using high-precision numerical representations wastes computational resources and increases inference latency; while for some critical computational nodes, low-precision numerical representations may lead to a decline in model performance and affect the accuracy of inference results.

[0003] Furthermore, most existing methods focus on hardware-level optimization, such as using dedicated acceleration chips or optimizing hardware architecture, while neglecting optimization from the perspective of model computation graph. These methods do not fully consider the importance of different computing nodes in the model and their different sensitivities to numerical accuracy, and cannot effectively reduce the computational load and storage requirements of inference while ensuring model performance. Summary of the Invention

[0004] In view of the aforementioned problems, and in conjunction with the first aspect of the present invention, the present invention provides a method for accelerating inference of mixed-precision models applied to large language models, the method comprising: Obtain the computation graph topology of the text input sequence to be processed and the pre-trained large language model. The computation graph topology includes multiple computation node units with initial numerical representation of bit width attributes and data flow association edges between the computation node units. Based on the gradient propagation path length attribute of the data flow to the associated edge in the computation graph topology and the semantic complexity representation vector of the text input sequence to be processed, a numerical representation bit width compression tolerance label is generated for each computation node unit, and the computation graph topology is reconstructed into a first numerical representation bit width computation subgraph and a second numerical representation bit width computation subgraph based on the numerical representation bit width compression tolerance label. The first numerical representation bit width operation core is called to perform forward propagation processing on the first numerical representation bit width calculation subgraph, and the second numerical representation bit width operation core is called simultaneously to perform forward propagation processing on the second numerical representation bit width calculation subgraph, so as to obtain the first numerical representation bit width intermediate tensor set and the second numerical representation bit width intermediate tensor set. The bit width occupancy of the first numerical representation bit width operation core is greater than that of the second numerical representation bit width operation core. Cross-subgraph tensor interaction processing is performed on the first set of intermediate tensors representing bit width and the second set of intermediate tensors representing bit width. Tensor units in the first set of intermediate tensors representing bit width are transmitted to the receiving computing node unit in the second set of intermediate tensors representing bit width, and tensor units in the second set of intermediate tensors representing bit width are transmitted to the receiving computing node unit in the first set of intermediate tensors representing bit width, thereby generating a hybrid bit width forward propagation data stream. The text inference output sequence is generated based on the hybrid bit-width forward propagation data stream and the output layer weight tensor of the pre-trained large language model.

[0005] Furthermore, this invention also provides a mixed-precision model inference acceleration system applied to large language models, comprising: A processor; a machine-readable storage medium for storing machine-executable instructions of the processor; wherein the processor is configured to execute the above-described method for accelerating mixed-precision model inference for large language models by executing the machine-executable instructions.

[0006] Based on the above, by acquiring the input sequence of the text to be processed and the computation graph topology of the pre-trained large language model, and according to the gradient propagation path length attribute of the data flow to the associated edges in the computation graph topology and the semantic complexity representation vector of the input sequence of the text to be processed, a numerical representation bit width compression tolerance label is generated for each computation node unit. This can identify the sensitivity of different computation nodes to changes in numerical precision. Based on the numerical representation bit width compression tolerance label, the computation graph topology is reconstructed into computation subgraphs with different numerical representation bit widths, enabling computation tasks of different precisions to be executed efficiently in the corresponding subgraphs. The forward propagation processing of the two computation subgraphs is performed in parallel by calling the computation cores of different numerical representation bit widths, making full use of hardware resources and improving computational efficiency. Cross-subgraph tensor interaction processing is performed on the intermediate tensor sets generated by the two computation subgraphs to generate a hybrid bit width forward propagation data stream, ensuring information transmission and collaborative work between different precision computations of the model. Finally, the text inference output sequence is generated based on the hybrid bit width forward propagation data stream and the output layer weight tensor of the pre-trained large language model. While ensuring the accuracy of the inference results, the computational load and storage requirements of inference are significantly reduced, and the inference speed and efficiency of the large language model are improved. Attached Figure Description

[0007] Figure 1 This is a schematic diagram of the execution flow of the mixed-precision model inference acceleration method for large language models provided by the present invention.

[0008] Figure 2 This is a schematic diagram of exemplary hardware and software components of the mixed-precision model inference acceleration system for large language models provided by the present invention. Detailed Implementation

[0009] Figure 1 This is a flowchart illustrating a method for accelerating inference in a mixed-precision model applied to a large language model, provided by an embodiment of the present invention. A detailed description follows.

[0010] The mixed-precision model inference acceleration method for large language models provided in this application can be applied to scenarios where large language models are deployed on resource-constrained edge computing devices (such as smartphones and embedded AI boxes) to accelerate the inference process of natural language processing tasks such as text generation, question answering systems, and code completion. This method is applicable to various application scenarios, including but not limited to: deploying large language models in the medical field (such as models used to assist in generating diagnostic reports) on portable medical devices; deploying large language models in the education field (such as models used for intelligent Q&A and homework grading) on student learning tablets; and deploying large language models in the e-commerce field (such as models used for generating product recommendation reasons and customer service dialogues) on mobile shopping applications. The following explanation uses an e-commerce customer service dialogue scenario as an example; those skilled in the art will understand that this method is also applicable to other large language model deployment scenarios such as medical and educational fields. This method analyzes the sensitivity of each layer in the computation graph to numerical precision loss and the semantic complexity of the input text, dynamically divides the computation graph into high-precision and low-precision computation subgraphs, and calls different computation cores to perform forward propagation respectively. This significantly reduces the amount of computation and memory bandwidth while maintaining the quality of the generated text, and solves the problems of high latency and insufficient memory when large language models are inferenced on edge devices.

[0011] Step S110: Obtain the input sequence of the text to be processed and the computation graph topology of the pre-trained large language model. The computation graph topology includes multiple computation node units with initial numerical representation of bit width attributes and data flow association edges between the computation node units.

[0012] In e-commerce customer service dialogue scenarios, the text input sequence to be processed refers to a piece of natural language text input by the user, such as a customer service inquiry message sent by the user: "Is this garment too large? I want to return or exchange it." This text input sequence to be processed will be converted into a series of token identifiers by a word segmenter. A pre-trained large language model refers to a generative language model that has been trained on a large-scale e-commerce dialogue corpus. This generative language model can understand the user's inquiry intent and generate response text. The computational graph topology of this pre-trained large language model is a directed acyclic graph, where each computational node unit represents a specific mathematical operation. For example, the matrix multiplication node is responsible for performing the multiplication operation between the weight matrix and the input vector; the layer normalization node is responsible for adjusting the distribution of the input tensor to a standard distribution with a mean of 0 and a variance of 1; the attention computation node is responsible for calculating the attention weight distribution between the query vector and the key vector; and the activation function node is responsible for applying nonlinear transformations such as the GELU or ReLU functions to the input. Data flow association edges represent the connection relationship between the output port of one computational node unit and the input port of another computational node unit. Each computing node unit has an initial numerical representation bit width attribute after the model training is completed. It is usually in 32-bit floating-point format, that is, each weight value and each activation value occupies 32 bits of storage space.

[0013] Step S120: Based on the gradient propagation path length attribute of the data flow to the associated edge in the computation graph topology and the semantic complexity representation vector of the text input sequence to be processed, generate a numerical representation bit width compression tolerance label for each computation node unit, and reconstruct the computation graph topology into a first numerical representation bit width computation subgraph and a second numerical representation bit width computation subgraph based on the numerical representation bit width compression tolerance label.

[0014] Step S121: Transform the text input sequence to be processed into an initial input embedding tensor with a sequence length dimension and a word embedding dimension, and load the initial input embedding tensor into the starting computation node unit of the computation graph topology.

[0015] The text input sequence is segmented into individual word units. For e-commerce customer service dialogue scenarios, the segmenter needs to be able to recognize specific words such as product names, size descriptions, and return / exchange intentions. By querying the word embedding table, each word unit is converted into a fixed-dimensional floating-point vector, resulting in an initial input embedding tensor with the shape (sequence length, word embedding dimension). This initial input embedding tensor is then used as input data and loaded into the first computational node unit of the computation graph topology, which is typically an embedding layer node.

[0016] Step S122: For the target computing node unit in the computing graph topology, count the total number of data flow associated edges that propagate from the starting computing node unit along the data flow associated edges to the target computing node unit, and use the total number of data flow associated edges as the gradient propagation path length attribute of the target computing node unit.

[0017] Starting from the initial computation node, a breadth-first search is performed along the data flow towards the associated edges. For each visited target computation node, the total number of data flow edges traversed from the initial computation node to that target computation node is recorded. This total number represents the gradient propagation path length attribute of that target computation node. In the e-commerce customer service dialogue model, the gradient propagation path length is smaller in the embedding layer and shallow transformer layers closer to the input, and larger in the deep transformer layers and output layers closer to the output. The gradient propagation path length reflects the degree of influence of the layer's output on the final output of the model: the larger the path length, the more layers the gradient undergoes during backpropagation, and the impact of the layer's numerical accuracy on the final result may be amplified or attenuated.

[0018] Step S123: Extract the frequency distribution data of rare word units contained in the text input sequence to be processed, and identify the maximum dependency distance value of cross-syntactic component dependency relationship chains in the text input sequence to be processed. Combine the frequency distribution data and the maximum dependency distance value into the semantic complexity representation vector.

[0019] The text input sequence is subjected to word frequency statistics to identify rare word units that appear below a preset threshold in the entire e-commerce dialogue corpus, such as specific product models "ABC123" or promotional event names "Double Eleven Big Promotion". The frequency distribution of these rare word units in the sequence is statistically analyzed to obtain a rare word unit frequency vector. Simultaneously, dependency parsing is performed on the text to identify modification relationships (such as subject-verb and verb-object relationships) between words in the sentence. A dependency tree is constructed, and the longest path length from the root node to the leaf node is found; this longest path length is the maximum dependency distance value. The rare word unit frequency vector and the maximum dependency distance value are concatenated to form a semantic complexity representation vector. Higher semantic complexity indicates that the input text contains more rare words or more complex syntactic structures, requiring the model to have higher numerical accuracy to maintain its ability to understand complex semantics.

[0020] Step S124: Input the gradient propagation path length attribute of the target computing node unit into a preset first length response function to generate a first tolerance tendency value, wherein the output value of the first length response function is positively correlated with the value of the gradient propagation path length attribute; and input the maximum dependency distance value in the semantic complexity representation vector into a preset second length response function to generate a second tolerance tendency value, wherein the output value of the second length response function is positively correlated with the value of the maximum dependency distance value; add the first tolerance tendency value and the second tolerance tendency value to obtain the basic bit width compression tolerance value for the target computing node unit.

[0021] The first length response function is defined as f1(L) = L / L_max, where L is the gradient propagation path length attribute of the target computation node unit, and L_max is the maximum path length in the entire computation graph topology. The output value of this first length response function increases with L, indicating that layers closer to the output are more sensitive to quantization and have lower tolerance. The second length response function is defined as f2(D) = D / D_max, where D is the maximum dependency distance value, and D_max is the preset maximum possible dependency distance. The output value of this second length response function increases with D, indicating that the more complex the input text syntax, the higher the accuracy required by the model. Adding f1(L) and f2(D) yields the base bit-width compression tolerance value T_base = f1(L) + f2(D). The larger T_base is, the less suitable the computation node unit is for low bit-width compression.

[0022] Step S125: Extract the discrete distribution parameters of the historical activation tensor generated by the target computing node unit in the historical forward propagation record, compare the discrete distribution parameters of the historical activation tensor with the preset discrete comparison range, and generate a third tolerance tendency value for the target computing node unit.

[0023] Statistical information of the activation tensors output by the target computation node unit during multiple forward propagations is obtained from historical inference records. This includes the standard deviation of activation values σ_act and the range width of the activation value distribution R_act = max(act) - min(act). A range for comparing the degree of dispersion is set. If σ_act is greater than a preset first threshold or R_act is greater than a preset second threshold, the activation value distribution of the computation node unit is determined to be relatively discrete, which is prone to numerical overflow risk. Therefore, the third tolerance tendency value T_act is set to a larger value (e.g., 1.5), indicating that higher precision needs to be retained; otherwise, T_act is set to a smaller value (e.g., 0.5), indicating that lower precision is acceptable.

[0024] Step S126: Multiply the base bit width compression tolerance value with the third tolerance tendency value to obtain the final bit width compression tolerance value of the target computing node unit. Based on the comparison result between the final bit width compression tolerance value and the preset tolerance judgment limit, determine the numerical representation bit width compression tolerance mark of the target computing node unit. The numerical representation bit width compression tolerance mark is used to identify the sensitivity category of the target computing node unit to numerical representation bit width compression processing.

[0025] Calculate the final bit-width compression tolerance value T_final = T_base * T_act. A preset tolerance threshold T_boundary is used. If T_final > T_boundary, the bit-width compression tolerance flag for the target compute node unit is set to "High Sensitivity Category," indicating that the compute node unit is not suitable for low-bit-width compression and needs to retain 32-bit floating-point precision. If T_final ≤ T_boundary, the flag is set to "Low Sensitivity Category," indicating that the compute node unit can accept low-bit-width compression (such as 8-bit or 4-bit integers).

[0026] Step S127: Traverse all computation node units in the computation graph topology and sequentially execute the steps of gradient propagation path length attribute statistics, semantic complexity representation vector combination, basic bit width compression tolerance value generation, and final bit width compression tolerance value calculation to obtain the numerical representation bit width compression tolerance label corresponding to each computation node unit.

[0027] Repeat steps S122 to S126 to traverse each computing node cell in the computing graph topology and assign a numerical value to each computing node cell to represent the bit width compression tolerance label (high sensitivity category or low sensitivity category).

[0028] Step S128: Based on the numerical representation bit width compression tolerance label, reconstruct the computation graph topology into a first numerical representation bit width computation subgraph and a second numerical representation bit width computation subgraph.

[0029] Step S1281: Obtain the numerical representation bit width compression tolerance tag bound to each computing node unit in the computing graph topology, and divide the computing node units corresponding to the numerical representation bit width compression tolerance tags with the same sensitivity category into the same computing node unit set.

[0030] For example, all computing node units marked as "highly sensitive" are assigned to the first candidate set, and all computing node units marked as "lowly sensitive" are assigned to the second candidate set.

[0031] Step S1282: Extract all computing node units bound to the numerical representation bit width compression tolerance marker with the first sensitivity category to form a first computing node unit set; at the same time, extract all computing node units bound to the numerical representation bit width compression tolerance marker with the second sensitivity category to form a second computing node unit set.

[0032] All computing node units in the first candidate set are determined as the first computing node unit set, and all computing node units in the second candidate set are determined as the second computing node unit set.

[0033] Step S1283: In the computation graph topology, retain the original data flow association edges between any two computation node units within the first set of computation node units, and remove the data flow association edges pointing from any one of the first set of computation node units to any one of the second set of computation node units, to obtain the initial first numerical representation bit width computation subgraph.

[0034] Copy the original computation graph topology. In the copied graph, retain all edges where both ends belong to the first set of computation node cells, and delete all edges pointing from nodes in the first set of computation node cells to nodes in the second set of computation node cells. After deleting the edges, the nodes in the first set of computation node cells and the retained edges constitute the initial first numerical representation bit-width computation subgraph.

[0035] Step S1284: In the computation graph topology, retain the original data flow association edges between any two computation node units within the second set of computation node units, and remove the data flow association edges pointing from any one of the computation node units in the second set of computation node units to any one of the computation node units in the first set of computation node units, to obtain the initial second numerical representation bit width computation subgraph.

[0036] Similarly, in the replicated graph, all edges where both ends belong to the second set of computational node units are retained, and all edges pointing from nodes in the second set of computational node units to nodes in the first set of computational node units are deleted. After deleting the edges, the nodes in the second set of computational node units and the retained edges constitute the initial second numerical representation bit-width computational subgraph.

[0037] Step S1285: Identify the first type of edge computing node units in the initial first value representation bit width calculation subgraph that have a data flow association edge with the second computing node unit set that has been removed, and add a virtual output port identifier pointing to the external subgraph for the first type of edge computing node units; and identify the second type of edge computing node units in the initial second value representation bit width calculation subgraph that have a data flow association edge with the first computing node unit set that has been removed, and add a virtual output port identifier pointing to the external subgraph for the second type of edge computing node units.

[0038] In the initial first numerical representation bit-width computation subgraph, identify nodes that originally had edges pointing to nodes in the second computation node unit set, but those edges have been deleted. These nodes are called first-type edge computation node units. Add a virtual output port identifier to each first-type edge computation node unit to indicate that the node's output tensor needs to be transmitted to the second numerical representation bit-width computation subgraph. Similarly, in the initial second numerical representation bit-width computation subgraph, identify nodes that originally had edges pointing to nodes in the first computation node unit set, but those edges have been deleted. These nodes are called second-type edge computation node units, and virtual output port identifiers are added to them.

[0039] Step S1286: In the initial first numerical representation bit width calculation subgraph, identify the first type of receiving computing node unit that receives data from the second computing node unit set, and add a virtual input port identifier for receiving external subgraph to the first type of receiving computing node unit; and in the initial second numerical representation bit width calculation subgraph, identify the second type of receiving computing node unit that receives data from the first computing node unit set, and add a virtual input port identifier for receiving external subgraph to the second type of receiving computing node unit.

[0040] In the initial first numerical representation bit-width computation subgraph, identify nodes that originally had edges pointing from nodes in the second computation node unit set, but those edges have been deleted. These nodes are called first-type receiving computation node units. Add a virtual input port identifier to each first-type receiving computation node unit to indicate that the node needs to receive tensors from the second numerical representation bit-width computation subgraph. Similarly, in the initial second numerical representation bit-width computation subgraph, identify second-type receiving computation node units and add virtual input port identifiers to them.

[0041] Step S1287: Based on the correspondence between the virtual output port identifier and the virtual input port identifier, construct a cross-subgraph data interaction mapping table connecting the initial first numerical representation bit width calculation subgraph and the initial second numerical representation bit width calculation subgraph. The cross-subgraph data interaction mapping table records the source port and target port pairing information of cross-subgraph tensor transmission.

[0042] For each deleted edge from a first-class edge computing node unit to a node in the second set of computing node units, a mapping record is established: the source port is the virtual output port identifier of the first-class edge computing node unit, and the target port is the virtual input port identifier of the first-class receiving computing node unit corresponding to the target node in the second numerical representation bit-width computing subgraph. All mapping records are summarized to form a cross-subgraph data interaction mapping table.

[0043] Step S1288: Determine the initial first numerical representation bit width calculation subgraph after adding the virtual port identifier as the first numerical representation bit width calculation subgraph, and determine the initial second numerical representation bit width calculation subgraph after adding the virtual port identifier as the second numerical representation bit width calculation subgraph.

[0044] The resulting first numerical representation bit-width calculation subgraph will consist of high-sensitivity category nodes. This first numerical representation bit-width calculation subgraph will call high-precision computing cores (such as 16-bit or 32-bit floating-point cores) to perform forward propagation. The second numerical representation bit-width calculation subgraph will consist of low-sensitivity category nodes. This second numerical representation bit-width calculation subgraph will call low-precision computing cores (such as 8-bit integer cores) to perform forward propagation.

[0045] Step S130: Call the first numerical representation bit width operation core to perform forward propagation processing on the first numerical representation bit width calculation subgraph, and simultaneously call the second numerical representation bit width operation core to perform forward propagation processing on the second numerical representation bit width calculation subgraph to obtain the first numerical representation bit width intermediate tensor set and the second numerical representation bit width intermediate tensor set. The bit width occupancy of the first numerical representation bit width operation core is greater than the bit width occupancy of the second numerical representation bit width operation core.

[0046] Step S131: The first numerical representation bit width operation core and the second numerical representation bit width operation core start execution in parallel and synchronously, respectively acquiring the first subgraph adjacency relationship description data of the first numerical representation bit width calculation subgraph and the second subgraph adjacency relationship description data of the second numerical representation bit width calculation subgraph, and determining the topology execution order of the calculation node units in their respective calculation subgraphs based on their respective adjacency relationship description data.

[0047] The first numerical representation bit-width arithmetic core refers to a computing unit that supports high-precision numerical operations, such as a vector processing unit that supports 16-bit or 32-bit floating-point operations. The second numerical representation bit-width arithmetic core refers to a computing unit that supports low-precision numerical operations, such as a matrix multiplication acceleration unit that supports 8-bit integer operations. Physically, the two arithmetic cores can be independent hardware modules or different operating modes of the same hardware module. Both cores execute simultaneously, each reading the adjacency description data of its respective computational subgraph and determining the execution order of the computational node units within its subgraph using a topological sorting algorithm.

[0048] Step S132: The first numerical representation bit width operation core executes the first initial propagation tensor sequentially into each first subgraph calculation node unit in the first numerical representation bit width calculation subgraph according to the topological execution order of the first numerical representation bit width calculation subgraph. Each first subgraph calculation node unit performs a first numerical representation bit width multiplication and accumulation operation on the received first input tensor and the bound first weight parameter fragment to generate a first output tensor and pass it to the next first subgraph calculation node unit.

[0049] The first initial propagation tensor is the initial input embedding tensor generated in step S121. For each computation node unit in the first numerical representation bit-width computation subgraph, the first numerical representation bit-width computation core reads the weight parameter fragment bound to that node (stored in 32-bit floating-point format), reads the input tensor, performs multiplication and accumulation operations, and generates the output tensor. Due to the use of a high-precision computation core, the precision of 32-bit floating-point numbers is maintained during the computation process.

[0050] Step S133: While the first numerical representation bit-width multiplication and accumulation operation is being performed by the first numerical representation bit-width multiplication and accumulation operation, the second numerical representation bit-width calculation core sequentially transmits the second initial propagation tensor to each second subgraph calculation node unit in the second numerical representation bit-width calculation subgraph according to the topological execution order of the second numerical representation bit-width calculation subgraph. Each second subgraph calculation node unit performs the second numerical representation bit-width multiplication and accumulation operation on the received second input tensor and the bound second weight parameter fragment to generate a second output tensor and pass it to the next second subgraph calculation node unit.

[0051] The second initial propagation tensor is also a copy of the initial input embedding tensor, but the second numerical representation bit-width computation core converts it from 32-bit floating-point format to 8-bit integer format before receiving it. For each computation node unit in the second numerical representation bit-width computation subgraph, the second numerical representation bit-width computation core reads the weight parameter fragment bound to that node (pre-quantized to 8-bit integer format), reads the input tensor (8-bit integer format), performs low-precision multiplication and accumulation operations, and generates the output tensor (the accumulation result maintains 32-bit integer precision to prevent overflow). The two computation cores execute in parallel without blocking each other.

[0052] Step S134: During the process of the first numerical representation bit width calculation core traversing the first numerical representation bit width calculation subgraph, the first output tensor generated by the first boundary calculation node unit is copied and stored in the first boundary temporary storage area. The first boundary calculation node unit is the first subgraph calculation node unit marked as having a data flow direction associated edge pointing to the second numerical representation bit width calculation subgraph in the first subgraph adjacency relationship description data.

[0053] The first boundary computing node unit is the first type of edge computing node unit identified in step S1285. When the first numerical representation bit-width calculation core executes to this type of node, in addition to passing the output tensor to the next node in the subgraph, it also stores a copy of the output tensor in the first boundary temporary storage area (a shared memory area) for subsequent use by the second numerical representation bit-width calculation subgraph.

[0054] Step S135: During the process of the second numerical representation bit width calculation core traversing the second numerical representation bit width calculation subgraph, the second output tensor generated by the second boundary calculation node unit is copied and stored in the second boundary temporary storage area. The second boundary calculation node unit is the second subgraph calculation node unit marked as having a data flow direction associated edge pointing to the first numerical representation bit width calculation subgraph in the adjacency relationship description data of the second subgraph.

[0055] The second boundary computing node unit is the second type of edge computing node unit identified in step S1285. When the second numerical representation bit-width operation core executes to this type of node, in addition to passing the output tensor to the next node in the subgraph, it also stores a copy of the output tensor in the second boundary temporary storage area.

[0056] Step S136: After traversing all the first subgraph calculation node units in the first numerical representation bit width calculation subgraph, the first numerical representation bit width operation core aggregates and stores the first output tensors generated by each first subgraph calculation node unit as the first numerical representation bit width intermediate tensor set, and at the same time encapsulates all the tensors in the first boundary temporary storage area into the first cross-subgraph transmission data packet and pushes it to the shared data interaction area.

[0057] After the first numerical representation bit-width computation core completes the forward propagation of the entire first numerical representation bit-width computation subgraph, it collects the output tensors generated by all computation node units and organizes them into a first numerical representation bit-width intermediate tensor set according to the node order. At the same time, it packages the tensors in the first boundary temporary storage area and pushes them to the shared data interaction area, waiting for the second numerical representation bit-width computation core to read them.

[0058] Step S137: After traversing all the second subgraph calculation node units in the second numerical representation bit width calculation subgraph, the second numerical representation bit width operation core aggregates and stores the second output tensors generated by each second subgraph calculation node unit as the second numerical representation bit width intermediate tensor set. At the same time, it encapsulates all the tensors in the second boundary temporary storage area into a second cross-subgraph transmission data packet and pushes it to the shared data interaction area.

[0059] Similarly, after the second numerical representation bit width operation core completes the forward propagation of the entire second numerical representation bit width calculation subgraph, it collects all output tensors to form the second numerical representation bit width intermediate tensor set, and packages the tensors in the second boundary temporary storage area and pushes them to the shared data interaction area.

[0060] Step S140: Perform cross-subgraph tensor interaction processing on the first set of intermediate tensors representing bit width and the second set of intermediate tensors representing bit width, transmit tensor units in the first set of intermediate tensors representing bit width to the receiving computing node unit in the second set of intermediate tensors representing bit width, and transmit tensor units in the second set of intermediate tensors representing bit width to the receiving computing node unit in the first set of intermediate tensors representing bit width, thereby generating a mixed bit width forward propagation data stream.

[0061] Step S141: The second numerical representation bit-width operation core reads the first cross-subgraph transmission data packet from the shared data interaction area, parses out the first boundary output tensor and its corresponding first source port identifier carried in the first cross-subgraph transmission data packet, determines the first target port identifier paired with the first source port identifier and the corresponding first target receiving computing node unit according to the cross-subgraph data interaction mapping table, performs numerical representation bit-width boosting conversion on the first boundary output tensor and injects it into the input port of the first target receiving computing node unit. When the first target receiving computing node unit performs forward propagation, the second numerical representation bit-width operation core fuses the injected tensor with the original input tensor of the first target receiving computing node unit according to the input fusion method required by the computing node unit, and continues to propagate based on the fusion result.

[0062] When the second numerical representation bit-width computation core reaches the first type of receiving computing node unit, it first reads the corresponding first cross-subgraph transmission data packet from the shared data interaction area. The first boundary output tensor is in 32-bit floating-point format, while the second numerical representation bit-width computation subgraph uses 8-bit integer format. Therefore, a numerical representation bit-width boosting conversion needs to be performed. However, this "boosting" is actually a precision reduction adaptation conversion: converting the 32-bit floating-point number to an 8-bit integer. The conversion method is as follows: calculate the maximum and minimum values in the tensor, determine the quantization scaling factor and zero-point offset, and then perform quantization. The converted 8-bit integer tensor is injected into the input of the first target receiving computing node unit, concatenated or added with the other original input tensors of that node, and then the forward propagation continues.

[0063] Step S142: The first numerical representation bit-width operation core reads the second cross-subgraph transmission data packet from the shared data interaction area, parses out the second boundary output tensor and its corresponding second source port identifier carried in the second cross-subgraph transmission data packet, determines the second target port identifier paired with the second source port identifier and the corresponding second target receiving computing node unit according to the cross-subgraph data interaction mapping table, performs numerical representation bit-width adaptation conversion on the second boundary output tensor and injects it into the input port of the second target receiving computing node unit. When the second target receiving computing node unit performs forward propagation, the first numerical representation bit-width operation core fuses the injected tensor with the original input tensor of the second target receiving computing node unit according to the input fusion method required by the computing node unit, and continues to propagate based on the fusion result.

[0064] When the first numerical representation bit-width arithmetic core reaches the second type of receiving computing node unit, it reads the corresponding second cross-subgraph transmission data packet from the shared data interaction area. The second boundary output tensor is in 8-bit integer format, while the first numerical representation bit-width arithmetic subgraph uses 32-bit floating-point format. Therefore, a numerical representation bit-width adaptation conversion needs to be performed: the 8-bit integer is dequantized into a 32-bit floating-point number. The converted floating-point tensor is injected into the input of the second target receiving computing node unit, merged with the other input tensors of that node, and continues to propagate.

[0065] Step S143: During the forward propagation process of the remaining computation node units in the first and second numerical representation bit width calculation subgraphs, if subsequent boundary computation node units are encountered, the boundary output tensor parsing, numerical representation bit width conversion and fusion injection operations are repeated until all computation node units in the two computation subgraphs have completed the forward propagation process.

[0066] Forward propagation between two subgraphs may involve multiple layers of boundary interactions. Each time a boundary computation node is encountered, the cross-subgraph tensor transfer and transformation operations described above are repeated until all nodes in both subgraphs have been processed.

[0067] Step S144: After the first numerical representation bit-width operation core and the second numerical representation bit-width operation core have completed their respective computational subgraph traversal, they concatenate and combine the final output tensors output by the final computational node units in their respective computational subgraphs according to the original output hierarchy order of the computational graph topology to generate the hybrid bit-width forward propagation data stream.

[0068] The first and second numerical values represent the final outputs of the bit-width computation subgraphs, respectively. They are concatenated according to their output positions in the original computation graph topology to form a hybrid bit-width forward propagation data stream. This hybrid bit-width forward propagation data stream contains the complete results of the entire model's forward propagation, but different parts of it come from computation paths of varying precision.

[0069] Step S150: Generate a text inference output sequence based on the mixed bit-width forward propagation data stream and the output layer weight tensor of the pre-trained large language model.

[0070] Step S151: Obtain the terminal mixed bit-width output tensor formed by aggregating the final output component of the first numerical bit-width calculation subgraph and the final output component of the second numerical bit-width calculation subgraph.

[0071] Extract the terminal mixed bit-width output tensor from the mixed bit-width forward propagation data stream. This terminal mixed bit-width output tensor is the hidden state representation of the last layer output of the model, with a shape of (sequence length, hidden layer dimension).

[0072] Step S152: Before the aggregation of the end-mixed bit-width output tensor is formed, the final output component of the first numerical representation bit-width calculation subgraph is converted into the target output bit-width format, and the final output component of the second numerical representation bit-width calculation subgraph is converted into the target output bit-width format. Then, the aggregation is performed to obtain the target bit-width end-output tensor. The target bit-width end-output tensor is multiplied with the output layer weight tensor of the pre-trained large language model to obtain the original output score vector aligned with the vocabulary dimension. Each score element in the original output score vector corresponds to a candidate word unit in the vocabulary.

[0073] Before aggregation, the final output component of the first numerical representation bit-width computation subgraph is converted from 32-bit floating-point numbers to 16-bit floating-point numbers (target output bit-width), and the final output component of the second numerical representation bit-width computation subgraph is converted from 8-bit integers to 16-bit floating-point numbers. After aggregation, the target bit-width terminal output tensor H_last is obtained. H_last is then multiplied by the output layer weight tensor W_out to obtain the original output score vector logits = H_last * W_out^T, where the shape of logits is (sequence length, vocabulary size).

[0074] Step S153: Perform normalization exponential operation on the original output score vector to generate a probability distribution vector in the vocabulary dimension. Based on the probability distribution vector, select the predicted word unit for the current inference time step and append the predicted word unit to the end of the generated word sequence. The probability distribution vector represents the probability measure of each candidate word unit as the prediction result of the current inference time step.

[0075] Applying the Softmax function to the original output score vector `logits` yields the probability distribution vector `p = exp(logits_i) / Σexp(logits_j)`. Using a greedy sampling or Top-K sampling strategy, a word is selected from the vocabulary as the prediction result for the current time step, and its identifier is appended to the end of the generated word sequence. Here, `logits_i` refers to the original score corresponding to the i-th candidate word in the original output score vector. `i` is the index in the vocabulary dimension, ranging from 0 to (number of candidate words - 1). `logits_j` refers to the original score corresponding to the j-th candidate word in the original output score vector. `j` is also an index in the vocabulary dimension, ranging from 0 to (number of candidate words - 1). In the summation of the denominator of the Softmax function, `j` is used as the summation variable, iterating through all candidate words in the vocabulary.

[0076] Step S154: Query the word embedding table according to the predicted lexical unit, obtain the word embedding representation vector corresponding to the predicted lexical unit, and feed the word embedding representation vector as the input embedding tensor of the next inference time step back to the starting computation node unit of the computation graph topology.

[0077] Using the identifier of the predicted term as an index, the corresponding embedding vector (32-bit floating-point format) is looked up from the word embedding table. This embedding vector is then used as the input for the next inference time step, reloaded into the starting computation node unit of the computation graph topology, and the next round of iteration generation begins.

[0078] Step S155: The updated generated lexical sequence is used as the new text input sequence to be processed. The computation graph topology reconstruction, forward propagation processing, cross-subgraph tensor interaction processing, and prediction lexical unit generation process are repeatedly triggered until the preset sequence generation termination condition is met. The sequence generation termination condition includes generating a preset sequence termination lexical unit or the length of the generated lexical sequence reaches a preset length limit.

[0079] Repeat steps S120 to S154, generating one word character in each iteration. When the generated word character is a special terminating word character (such as...), ... <eos>Generation will stop when the sequence length reaches the preset maximum value.

[0080] Step S156: After each generation of a predicted lexical unit, record the identifier of the predicted lexical unit in the vocabulary, and arrange the identifiers recorded at all inference time steps in the generation order to form an identifier sequence. Call the lexical decoder to process the identifier sequence, map each identifier in the identifier sequence back to the corresponding natural language lexical unit, and concatenate and combine the natural language lexical units in the order of the identifier sequence to obtain a continuous natural language text segment, which serves as the final text inference output sequence corresponding to the text input sequence to be processed.

[0081] Store the token identifiers generated at all time steps in sequence. After generation, call the token decoder to convert the identifier sequence into natural language text. For example, concatenate the identifiers such as "[CLS]", "this", "clothes", "size", "runs large", "suggestion", and "exchange" into the reply text "This piece of clothing runs large, we suggest exchanging it", which serves as the final output of the e-commerce customer service dialogue model.

[0082] Step S210: Before obtaining the input sequence of the text to be processed and the computation graph topology of the pre-trained large language model, static numerical representation bit width compression processing is performed on the initial weight parameter set contained in the pre-trained large language model to generate a compressed weight parameter set with compressed numerical representation bit width format.

[0083] Before loading the e-commerce customer service dialogue model onto the edge device, offline quantization is first performed on all weight parameters in the model. The initial set of weight parameters is stored in 32-bit floating-point format. Using a per-tensor asymmetric quantization method, for each weight tensor W, its minimum value min_W and maximum value max_W are calculated, and the quantization scaling factor s = (max_W - min_W) / (2^B - 1), where B is the target quantization bit width (e.g., 8 bits), is calculated. The zero-point offset z = round(-min_W / s) is also calculated. Each floating-point weight element w is quantized into an integer w_q = clamp(round(w / s + z), 0, 2^B - 1). The quantized integer weights and the quantization parameters (s, z) are stored together to obtain a compressed set of weight parameters.

[0084] Step S220: Based on the operation type attributes of each computation node unit in the computation graph topology of the pre-trained large language model, generate a corresponding weight bit width recovery trigger condition label for each computation node unit. The operation type attributes include dot product operation type and nonlinear activation operation type.

[0085] Traverse each computation node in the computation graph topology. If the node's operation type attribute is a dot product operation (such as matrix multiplication or convolution), which is sensitive to precision, generate a "restore required" weight bit width restoration trigger condition flag for this type of node, indicating that the compressed weights need to be restored to floating-point numbers during forward propagation. If the node's operation type attribute is a non-linear activation operation (such as ReLU, GELU, or Sigmoid), which is not sensitive to precision, generate a "no restoration required" weight bit width restoration trigger condition flag for this type of node, indicating that the compressed weights can be used directly for computation.

[0086] Step S230: Store the compressed weight parameter set together with the weight bit width recovery trigger condition flag in the first dedicated storage area of the first numerical representation bit width operation core, and store the compressed weight parameter set together with the weight bit width recovery trigger condition flag in the second dedicated storage area of the second numerical representation bit width operation core.

[0087] The quantized set of compressed weight parameters and the recovery trigger condition flag corresponding to each weight are written to the dedicated storage area (such as cache or local memory) of the first numerically represented bit-width computing core. Simultaneously, a copy of the same data is written to the dedicated storage area of the second numerically represented bit-width computing core. Each computing core manages its own storage space independently.

[0088] Step S240: Before performing forward propagation processing on the target first subgraph computation node unit in the first numerical representation bit width computation subgraph, the first numerical representation bit width computation core reads the weight bit width recovery trigger condition flag corresponding to the target first subgraph computation node unit.

[0089] When the first numerical representation bit-width operation core is ready to execute a certain first subgraph calculation node unit, it first reads the weight bit-width recovery trigger condition flag corresponding to the node from the first dedicated storage area.

[0090] Step S250: If the weight bit width recovery trigger condition flag indicates that bit width recovery processing needs to be performed on the weight, the first numerical representation bit width operation core extracts the compressed weight fragment associated with the target first subgraph computing node unit from the first dedicated storage area, and performs numerical representation bit width recovery processing on the compressed weight fragment to obtain the restored original bit width weight fragment.

[0091] If marked as "restore required", the first value indicates that the bit-width operation core reads the compressed weight fragment and quantization parameters (s, z) corresponding to the node from the first dedicated storage area and performs the dequantization operation: w_fp=(w_q-z)*s, restoring the 8-bit integer weight to a 32-bit floating-point weight.

[0092] Step S260: The first numerical representation bit-width operation core uses the restored original bit-width weight fragment to participate in the forward propagation operation of the target first subgraph computation node unit, and releases the first dedicated storage area space occupied by the restored original bit-width weight fragment after completing the forward propagation operation.

[0093] Perform matrix multiplication or convolution operations using the restored 32-bit floating-point weights, and release the temporary storage space after the operation is complete.

[0094] Step S270: If the weight bit width recovery trigger condition flag indicates that bit width recovery processing of the weight is not required, then the first value indicates that the bit width operation core directly extracts the compressed weight fragment associated with the target first subgraph computing node unit from the first dedicated storage area, and uses the compressed weight fragment to directly participate in the forward propagation operation processing of the target first subgraph computing node unit.

[0095] If marked "No recovery required", the first value indicates that the bit-width arithmetic core directly uses compressed weights (8-bit integers) to perform the dot product operation. For activation functions such as ReLU, their operation does not involve weights, and threshold comparison operations can be directly performed on the input tensor.

[0096] Step S280: Before performing forward propagation processing on the target second subgraph computation node unit in the second numerical representation bit-width computation subgraph, the second numerical representation bit-width computation core reads the weight bit-width recovery trigger condition flag corresponding to the target second subgraph computation node unit. If the weight bit-width recovery trigger condition flag indicates that bit-width recovery processing needs to be performed on the weight, the second numerical representation bit-width computation core extracts the compressed weight fragment associated with the target second subgraph computation node unit from the second dedicated storage area and performs numerical representation bit-width recovery processing on the compressed weight fragment to obtain the recovered original bit-width weight fragment. If the weight bit-width recovery trigger condition flag indicates that bit-width recovery processing does not need to be performed on the weight, the second numerical representation bit-width computation core directly extracts the compressed weight fragment associated with the target second subgraph computation node unit from the second dedicated storage area and uses the compressed weight fragment to directly participate in the forward propagation computation processing of the target second subgraph computation node unit.

[0097] The second numerical representation bit-width arithmetic core performs the same weight loading and recovery logic as the first numerical representation bit-width arithmetic core, but uses data in a second dedicated memory area. Since the second numerical representation bit-width arithmetic core itself is a low-precision arithmetic core, it will also perform dequantization for weights that "need to be recovered," but may choose to recover them as 16-bit floating-point numbers instead of 32-bit floating-point numbers to suit its hardware capabilities.

[0098] Step S310: During the forward propagation processing of the first numerical representation bit width calculation subgraph executed by the first numerical representation bit width calculation core, the activation value amplitude statistics and activation value fluctuation range statistics of the first subgraph activation tensor output by each first subgraph calculation node unit in the first numerical representation bit width calculation subgraph are continuously collected.

[0099] For each computation node unit in the first numerical representation bit-width computation subgraph, after the first numerical representation bit-width computation core completes the forward propagation of the node, it records the statistical characteristics of its output activation tensor, including: the maximum value max_act, the minimum value min_act, the mean value mean_act, and the standard deviation std_act of the activation value. The above information is stored in a circular buffer for subsequent activation value distribution analysis.

[0100] Step S320: During the forward propagation processing of the second numerical representation bit width calculation subgraph executed by the second numerical representation bit width calculation core, the activation value amplitude statistics and activation value fluctuation range statistics of the second subgraph activation tensor output by each second subgraph calculation node unit in the second numerical representation bit width calculation subgraph are continuously collected.

[0101] Similarly, for each computation node cell in the second numerical representation bit-width computation subgraph, the max_act, min_act, mean_act, and std_act of its output activation tensor are recorded.

[0102] Step S330: Input the activation value amplitude statistics and activation value fluctuation range statistics of the first subgraph activation tensor and the activation value amplitude statistics and activation value fluctuation range statistics of the second subgraph activation tensor into the activation value distribution comparison and analysis process to generate a numerical representation bit width dynamic adjustment instruction sequence for the computation graph topology.

[0103] The activation value distribution comparison and analysis process performs the following operations: For each computation node unit, compare the activation value statistics of its current inference time step with the historical statistics. If, for multiple consecutive time steps, the std_act of the output activation value of a node is consistently lower than the preset low fluctuation threshold (indicating a very stable activation value distribution), and the node currently belongs to the first numerical representation bit-width computation subgraph (high precision), then a bit-width reduction instruction is generated to migrate the node to the second numerical representation bit-width computation subgraph (low precision). Conversely, if, for multiple consecutive time steps, the max_act of the output activation value of a node is close to the preset upper limit or the min_act is close to the lower limit (indicating a risk of overflow), and the node currently belongs to the second numerical representation bit-width computation subgraph (low precision), then a bit-width increase instruction is generated to migrate the node to the first numerical representation bit-width computation subgraph (high precision).

[0104] Step S340: The numerical representation bit width dynamic adjustment instruction sequence includes a bit width increase instruction and a bit width decrease instruction for the target computing node unit. The bit width increase instruction is used to migrate the target computing node unit from the second numerical representation bit width calculation subgraph to the first numerical representation bit width calculation subgraph, and the bit width decrease instruction is used to migrate the target computing node unit from the first numerical representation bit width calculation subgraph to the second numerical representation bit width calculation subgraph.

[0105] Increment and decrement instructions are inverse operations used to dynamically adjust the subgraph to which a computation node cell belongs.

[0106] Step S350: In response to the increased bit width occupancy instruction, locate the first target computing node unit to be migrated in the second numerical representation bit width calculation subgraph, and disconnect the data flow direction association edge between the first target computing node unit and other computing node units in the second numerical representation bit width calculation subgraph.

[0107] When a bit-width occupancy increase instruction is received, the node to be migrated specified by the instruction is first located in the second numerical representation bit-width calculation subgraph. All input and output edges of this node are deleted, temporarily isolating it from the second numerical representation bit-width calculation subgraph.

[0108] Step S360: In the first numerical representation bit width calculation subgraph, identify the upstream and downstream related computing node units that have an upstream and downstream dependency relationship with the first target computing node unit, insert the first target computing node unit into the data flow path between the upstream and downstream related computing node units, and reconstruct the data flow related edge connection.

[0109] Based on the original computation graph topology, determine the upstream and downstream nodes of this node in the first numerical representation bit-width computation subgraph (these nodes may already be in the first subgraph). Establish a path through this node between the upstream and downstream nodes, adding edges from the upstream node to this node, and edges from this node to the downstream node.

[0110] Step S370: Update the cross-subgraph data interaction mapping table, and modify the target port of the second cross-subgraph data transmission unit that originally pointed to the first target computing node unit to the new port identifier of the first target computing node unit in the first numerical representation bit width computing subgraph.

[0111] Because the first target computation node unit has changed its subgraph affiliation, data that previously needed to be transferred from the second subgraph to that node now needs to be transferred directly from within the first subgraph. Therefore, the cross-subgraph data interaction mapping table needs to be updated, relevant cross-subgraph mapping records need to be deleted, or port identifiers need to be modified.

[0112] Step S380: In response to the reduced bit width occupancy instruction, locate the second target computing node unit to be migrated in the first numerical representation bit width calculation subgraph, and disconnect the data flow direction association edge connection between the second target computing node unit and other computing node units in the first numerical representation bit width calculation subgraph; identify upstream and downstream associated computing node units with upstream and downstream dependencies on the second target computing node unit in the second numerical representation bit width calculation subgraph, insert the second target computing node unit into the data flow direction path between the upstream and downstream associated computing node units, and reconstruct the data flow direction association edge connection; update the cross-subgraph data interaction mapping table, and modify the target port of the first cross-subgraph data transmission unit that originally pointed to the second target computing node unit to the new port identifier of the second target computing node unit in the second numerical representation bit width calculation subgraph.

[0113] The processing flow of the bit-width reduction instruction is symmetrical to that of the bit-width increase instruction, which migrates the node from the high-precision subgraph to the low-precision subgraph and updates the cross-subgraph interaction mapping table accordingly.

[0114] For example, the method may further include: step S410: after the first numerical representation bit-width operation core and the second numerical representation bit-width operation core complete a complete forward propagation process of the computation graph topology, the statistical distribution characteristics of the tensor values output by each computation node unit in the hybrid bit-width forward propagation data stream are obtained.

[0115] After the end of a complete text generation iteration (from input to output of a token), collect the output tensor statistics of all computing node units, including the maximum value, minimum value, mean, and standard deviation.

[0116] Step S420: Compare the statistical distribution characteristics of the tensor values with a preset numerical overflow determination boundary to identify the risk computing node units with numerical overflow risks in the mixed-bitwidth forward propagation data stream. The numerical overflow risks include numerical upper overflow status and numerical lower overflow status.

[0117] For the output tensor of each computing node unit, check whether its maximum value is close to the upper limit of the numerical representation bitwidth format (e.g., for 8-bit signed integers, the upper limit is 127). If it is close, there is a risk of numerical upper overflow. Check whether its minimum value is close to the lower limit (-128). If it is close, there is a risk of numerical lower overflow. At the same time, check whether there are too many extremely small values close to 0 (which may cause the gradient to vanish). If so, it is also marked as a risk of numerical lower overflow.

[0118] Step S430: For the identified risk computing node units, insert a numerical scaling processing node unit in the numerical representation bitwidth computing subgraph to which the risk computing node unit belongs. The numerical scaling processing node unit is used to perform amplitude scaling adjustment on the tensor input to the risk computing node unit.

[0119] For a computing node unit with a risk of numerical upper overflow, insert a scaling node at its input end. This scaling node multiplies the input tensor by a scaling factor s (0 < s < 1) to compress the numerical range into a safe interval. For a computing node unit with a risk of numerical lower overflow, insert an amplification node to multiply the input tensor by a scaling factor greater than 1.

[0120] Step S440: During the forward propagation process of the risk computing node unit, the numerical scaling processing node unit multiplies the original input tensor input to the risk computing node unit by a preset scaling factor to obtain a scaled input tensor, and uses the scaled input tensor as the actual input of the risk computing node unit.

[0121] The scaling node performs the operation X_scaled = X * s, where s is a pre-computed scaling factor. For example, if the maximum value of the original input tensor is 200 and the upper limit of 8-bit integers is 127, then s = 127 / 200 = 0.635 is selected to compress the input range into [-127, 127].

[0122] Step S450: The risk computing node unit that has the risk of numerical overflow performs forward propagation operation based on the scaled input tensor to generate a scaled output tensor. Before transmitting the scaled output tensor to the downstream computing node unit, it performs inverse scaling on the scaled output tensor. The inverse scaling is to divide the scaled output tensor by the scaling factor to restore it to the original numerical range.

[0123] After the computing node completes the operation, it obtains the output tensor Y_scaled = f(X_scaled). Since the input was scaled, the output also needs to be restored accordingly: Y = Y_scaled / s, before it is passed to the downstream node.

[0124] Step S460: Record the numerical scaling processing node unit insertion operation performed on the risk calculation node unit, and generate a numerical scaling processing record entry, which includes the identifier of the inserted scaling processing node unit and its corresponding scaling coefficient value.

[0125] Record the scaling node information (node ID, scaling factor s) for each insertion into a list so that it can be reused or adjusted in subsequent inference time steps.

[0126] Step S470: In subsequent inference time steps, if the change in the input sequence of the text to be processed causes the tensor numerical statistical distribution characteristics of the computation node units previously identified as having a risk of numerical overflow to return to the normal numerical range, then the previously inserted numerical scaling processing node units are located according to the numerical scaling processing record entries and removed from the computation subgraph.

[0127] In subsequent inference iterations, the output statistics of previously risky nodes are continuously monitored. If the value range of the node is within the safe range for multiple consecutive time steps, the previously inserted scaling node is removed, and the original computation path is restored.

[0128] Step S480: If a new computing node unit is detected to have a risk of numerical overflow in a subsequent inference time step, a new numerical scaling processing node unit is dynamically inserted into the numerical representation bit width calculation subgraph to which the new computing node unit belongs, based on the comparison result of the numerical overflow judgment boundary.

[0129] For newly emerging risk nodes, new scaling nodes are dynamically inserted according to steps S430 to S450.

[0130] Step S490: Synchronize the insertion and removal process of the numerical scaling processing node unit with the reconstruction process of the computation graph topology. Each time the computation graph topology reconstruction is performed, apply the accumulated numerical scaling processing record entries to the newly generated first numerical representation bit width computation subgraph and second numerical representation bit width computation subgraph.

[0131] When a change in the input text triggers a reconstruction of the computation graph topology, all currently accumulated numerical scaling records are applied to the newly generated subgraph to ensure the continuity of the scaling strategy.

[0132] Step S4100: During the process of generating the text inference output sequence, the statistical distribution characteristics of tensor values in the mixed bit-width forward propagation data stream are continuously monitored, and the distribution position and activation state of the numerical scaling processing node unit in the computation subgraph are dynamically adjusted according to the monitoring results.

[0133] Steps S410 to S490 are executed repeatedly to achieve online adaptive adjustment of the numerical scaling strategy.

[0134] For example, the method may further include: step S510: when the length of the text input sequence to be processed exceeds a preset sequence length division threshold, the text input sequence to be processed is divided into a preceding text segment unit and a following text segment unit with contextual relationship.

[0135] When a user enters a long e-commerce customer service inquiry text (e.g., more than 512 words), such as when the user describes a return or exchange request for multiple products, the text is segmented into multiple segments. The first segment contains the first 512 words as the preceding text segment unit, and the second segment contains the remaining words as the following text segment unit. The two segments are semantically continuous.

[0136] Step S520: Call the first numerical representation bit-width operation core to perform hybrid bit-width forward propagation processing on the preceding text fragment unit based on the first numerical representation bit-width calculation subgraph and the second numerical representation bit-width calculation subgraph, and generate a preceding key-value cache tensor set corresponding to the preceding text fragment unit. The preceding key-value cache tensor set contains the key vector tensor and value vector tensor output by each word unit in the preceding text fragment unit at each layer of the computation graph topology.

[0137] Perform standard forward propagation on the preceding text fragment. In the self-attention computation of each layer, save the calculated key vector K and value vector V to form a key-value cache tensor set. For a model with L layers, this key-value cache tensor set contains L pairs of (K, V) tensors.

[0138] Step S530: Store the preceding key-value cache tensor set in a key-value cache dedicated storage space shared and accessed by the first numerical representation bit-width operation core and the second numerical representation bit-width operation core, and attach a fragment position offset identifier to the preceding key-value cache tensor set. The fragment position offset identifier is used to record the start and end position index information of the preceding text fragment unit in the original text input sequence to be processed.

[0139] The key-value cache of the preceding segment is written to the shared memory region, and the start position 0 and end position 511 of the segment in the original sequence are recorded. At the same time, the original position encoding of the words in the segment is recorded so that they can be correctly aligned when calculating attention for subsequent segments.

[0140] Step S540: When the second numerical representation bit width operation core performs hybrid bit width forward propagation processing based on the first numerical representation bit width calculation subgraph and the second numerical representation bit width calculation subgraph on the subsequent text segment unit, it reads the preceding key value cache tensor set and the segment position offset identifier through the key value cache dedicated storage space.

[0141] When processing subsequent text fragments, the second value indicates that the bit-width operation core first reads the key-value cache and position offset identifier of the preceding fragment from shared memory.

[0142] Step S550: The second numerical representation bit-width operation core, based on the fragment position offset identifier, concatenates the key vector tensor and value vector tensor in the preceding key-value cache tensor set to the corresponding position offset of the current key vector tensor and current value vector tensor generated by the subsequent text fragment unit before the position offset of the current key vector tensor and current value vector tensor generated in the current inference layer, to generate the concatenated complete key vector tensor sequence and complete value vector tensor sequence.

[0143] In the self-attention calculation of each layer, the sequence length of the current key vector K_cur generated by the subsequent segment is L_cur, and the sequence length of the cached key vector K_prev of the preceding segment is L_prev. K_prev and K_cur are concatenated along the sequence dimension to obtain K_full = concat(K_prev, K_cur), with a total length of L_prev + L_cur. Similarly, V_full is obtained.

[0144] Step S560: During the forward propagation process of the subsequent computation node units in the computation graph topology, the second numerical representation bit-width operation core performs cross-segment attention weight calculation processing using the complete key vector tensor sequence and the complete value vector tensor sequence, so that each word unit in the subsequent text segment unit can pay attention to the semantic information carried by the word units in the preceding text segment unit according to the result of the cross-segment attention weight calculation processing.

[0145] During self-attention computation, for the i-th lexical unit in the subsequent segment, attention weights are calculated between its query vector Q_i and all key vectors in K_full, including the key vectors of the preceding segment. This allows the subsequent lexical unit to "see" the information of the preceding lexical unit, achieving cross-segment contextual understanding.

[0146] Step S570: In the cross-segment attention weight calculation process, the second numerical representation bit width operation core identifies the first precision cache component from the first numerical representation bit width calculation subgraph and the second precision cache component from the second numerical representation bit width calculation subgraph in the preceding key value cache tensor set, and performs precision alignment processing on the first precision cache component and the second precision cache component to unify the numerical representation bit width format participating in the attention weight calculation.

[0147] Since the key-value caches of the preceding segments may come from computation paths of different precisions (partially from high-precision subgraphs and partially from low-precision subgraphs), they need to be unified in precision after concatenation. The first-precision cache component (high precision) is quantized to a low-precision format, or the second-precision cache component (low precision) is dequantized to a high-precision format, so that all tensors involved in the computation have the same numerical representation bit width.

[0148] Step S580: After completing all forward propagation processing of the subsequent text fragment unit, the second numerical representation bit-width operation core appends the set of subsequent key-value cache tensors corresponding to the subsequent text fragment unit to the key-value cache dedicated storage space, and updates the fragment position offset identifier to cover the overall start and end position index information of the preceding text fragment unit and the subsequent text fragment unit.

[0149] After processing the subsequent fragment, its own key-value cache is also stored in shared memory, and the position offset flag is updated from 0 to L_prev+L_cur-1.

[0150] Step S590: If the text input sequence to be processed is divided into more than two text segment units, then according to the generation order of the text segment units, the first numerical representation bit-width operation core or the second numerical representation bit-width operation core sequentially performs the following processing on each subsequent text segment unit: reading the preceding key-value cache tensor set, concatenating the key vector tensor and value vector tensor, and calculating the cross-segment attention weight. After the processing is completed, the current key-value cache tensor set corresponding to the current text segment unit is appended to the key-value cache dedicated storage space.

[0151] For cases with more than two segments, repeat steps S540 to S580. After each segment is processed, its key-value cache is appended to shared memory for use by subsequent segments.

[0152] Step S5100: When generating a complete text inference output sequence for the text input sequence to be processed, all key-value cache tensor sets accumulated in the key-value cache dedicated storage space are merged and deduplicated according to the fragment position offset identifier to form a global key-value cache tensor set covering all word units of the text input sequence to be processed. The global key-value cache tensor set is used to support the retrieval of global context information of the text input sequence to be processed during the generation of the text inference output sequence.

[0153] When generating the response text, the key-value caches of all segments in shared memory are merged, duplicate boundary terms are removed, and a complete key-value cache covering the entire input sequence is formed. During generation, each newly generated term can directly use this global key-value cache to calculate its attention weight with all input terms.

[0154] Step S610: Before the first numerical representation bit-width operation core and the second numerical representation bit-width operation core perform synchronous forward propagation processing, obtain the numerical distribution sparsity statistics of the weight parameter tensors corresponding to each computing node unit in the computation graph topology. The numerical distribution sparsity statistics are used to characterize the proportion of elements in the weight parameter tensors whose numerical amplitude is lower than the preset sparsity judgment boundary.

[0155] For each weight parameter tensor W, count the number of elements n_zero whose absolute value is less than the preset sparsity decision boundary T_sparse (e.g., 0.05), and calculate the sparsity = n_zero / N, where N is the total number of elements in the weight tensor.

[0156] Step S620: Based on the numerical distribution sparsity statistics, generate a corresponding weight sparsity sensitivity label for each computing node unit in the computation graph topology. The weight sparsity sensitivity label is used to identify the sensitivity category of the computing node unit to the sparsification of the weight parameter tensor.

[0157] If the sparsity is greater than the preset high sparsity threshold (e.g., 0.7), the sparsity sensitivity of the node's weights is set to "sparse-friendly," indicating that the node's weights are highly sparse and sparsification will not significantly affect accuracy. If the sparsity is less than the preset low sparsity threshold (e.g., 0.3), it is set to "sparse-sensitive," indicating that the node's weights are relatively dense and sparsification may result in a loss of accuracy.

[0158] Step S630: Extract the set of candidate sparse computing node units marked as insensitive to the sparsification of weight parameter tensors in the computation graph topology, and perform structured sparse compression on the weight parameter tensors of each computing node unit in the set of candidate sparse computing node units to generate sparse weight parameter tensors. The structured sparse compression process sets the absolute values of elements in the weight parameter tensors that are lower than the preset sparsity decision boundary to zero and stores the non-zero elements and their position indices in a compressed format.

[0159] For nodes marked as "sparse-friendly", structured sparse compression is performed on their weights. Sparsification is performed in fixed-size blocks (e.g., 4x4) using either block sparsity or channel sparsity. Blocks where all elements are less than T_sparse are marked as zero blocks, and only the indices and values of non-zero blocks are stored.

[0160] Step S640: When the first numerical representation bit width calculation subgraph is loaded in the first numerical representation bit width calculation core, the first subgraph sparse calculation node unit belonging to the candidate sparse calculation node unit set in the first numerical representation bit width calculation subgraph is identified, the sparse weight parameter tensor corresponding to the first subgraph sparse calculation node unit is loaded into the first dedicated storage area, and the non-sparse regular weight parameter tensor is loaded into the first dedicated storage area.

[0161] The first value indicates that when the bit-width operation core loads weights, it checks whether the node belongs to the candidate sparse set. If it does, the sparse weights are loaded; otherwise, the regular dense weights are loaded.

[0162] Step S650: When the second numerical representation bit width calculation subgraph is loaded in the second numerical representation bit width calculation core, the second subgraph sparse calculation node unit belonging to the candidate sparse calculation node unit set in the second numerical representation bit width calculation subgraph is identified, the sparse weight parameter tensor corresponding to the second subgraph sparse calculation node unit is loaded into the second dedicated storage area, and the non-sparse regular weight parameter tensor is loaded into the second dedicated storage area.

[0163] The second value indicates that the bit-width operation core performs similar loading logic.

[0164] Step S660: When the first numerical representation bit-width operation core performs forward propagation processing on the first subgraph sparsification computation node unit, it performs multiplication and accumulation operations only on the components in the input tensor corresponding to the position index of the non-zero element, based on the non-zero elements and their position indices stored in the sparsification weight parameter tensor, and skips the component operations corresponding to the position index of the zero-value element.

[0165] For sparse weights, the first numerical value indicates that the bit-width operation core performs sparse matrix multiplication. It iterates through the stored non-zero element indices, loading and multiplying only the components at the corresponding positions in the input tensor, skipping zero-value positions. This significantly reduces the number of multiplication operations and memory accesses.

[0166] Step S670: When the second numerical representation bit-width operation core performs forward propagation processing on the second subgraph sparsification computation node unit, it performs multiplication and accumulation operations only on the components in the input tensor corresponding to the position index of the non-zero element, based on the non-zero elements and their position indices stored in the sparsification weight parameter tensor, and skips the component operations corresponding to the position index of the zero-value element.

[0167] The second value indicates that the bit-width operation core performs the same sparse computation logic.

[0168] Step S680: After the first numerical representation bit-width operation core and the second numerical representation bit-width operation core complete the forward propagation process, the first subgraph sparse output tensor generated by the first subgraph sparse computation node unit and the second subgraph sparse output tensor generated by the second subgraph sparse computation node unit are restored to a dense tensor format compatible with adjacent non-sparse computation node units.

[0169] The resulting tensor from sparse computation may be in sparse format (storing only non-zero values), while subsequent non-sparse computation nodes require dense tensors as input. Therefore, to convert a sparse output tensor into a dense tensor, create a dense tensor filled with all zeros and fill in the corresponding positions with non-zero values based on their indexes.

[0170] Step S690: The first subgraph sparse output tensor and the second subgraph sparse output tensor, which have been restored to dense tensor format, are respectively injected into the corresponding receiving computing node units in the cross-subgraph tensor interaction processing flow to participate in the generation of the mixed bit-width forward propagation data stream.

[0171] The restored dense tensor is passed to the downstream node to continue the forward propagation process.

[0172] Step S6100: During the reconstruction of the computation graph topology in the subsequent inference time step, the composition of the candidate sparse computation node unit set is dynamically adjusted according to the updated numerical distribution sparsity statistics, and the adjusted sparsity weight parameter tensor distribution state is applied to the newly generated first numerical representation bit-width computation subgraph and second numerical representation bit-width computation subgraph.

[0173] As inference progresses, the importance distribution of weights may change. Periodically recalculate the weight sparsity statistics, removing nodes with decreasing sparsity from the candidate set (stopping sparsification) and adding nodes with increasing sparsity to the candidate set (starting sparsification). Apply the updated sparsification strategy during the next computation graph reconstruction.

[0174] For example, the method may further include: step S710: before the first numerical representation bit-width operation core and the second numerical representation bit-width operation core perform synchronous forward propagation processing, construct a synchronous scheduling event sequence corresponding to the cross-subgraph data interaction mapping table, wherein each scheduling event unit in the synchronous scheduling event sequence is bound to a cross-subgraph tensor transfer operation.

[0175] Before proceeding with the forward propagation, the cross-subgraph data interaction mapping table is parsed. Each row of this table records a data transmission requirement from a source subgraph boundary node to a target subgraph receiving node. For each transmission record, a scheduling event unit is created. Each scheduling event unit contains the following fields: source node identifier (pointing to the boundary node that produces the tensor), target node identifier (pointing to the node that needs to receive the tensor), source subgraph identifier (first or second subgraph), target subgraph identifier, shape information of the transmitted tensor, and data dependency flags. All scheduling event units are arranged according to the execution order of the source nodes in their respective subgraphs, forming a synchronous scheduling event sequence. This synchronous scheduling event sequence is used to coordinate the execution progress of the two computational cores, ensuring that data is produced before it can be consumed.

[0176] Step S720: When the first numerical representation bit width operation core executes to the first type of edge computing node unit in the first numerical representation bit width calculation subgraph, it writes the first type of edge output tensor generated by the first type of edge computing node unit to the shared buffer storage area jointly accessed by the first numerical representation bit width operation core and the second numerical representation bit width operation core, and records the write completion status.

[0177] The first numerical value indicates that the bit-width computation core performs calculations according to the topological order of the first subgraph. When the first type of edge computing node unit (i.e., the node marked in step S1285 whose output needs to be passed to the second subgraph) is reached, the node generates an output tensor T_out after completing the forward propagation calculation. The computation core copies the output tensor T_out to a specified location in the shared buffer storage area through a direct memory access channel. The shared buffer storage area is a physically contiguous memory space that can be accessed by both computation cores through memory mapping. After the write is completed, the computation core sets the write completion flag corresponding to the tensor to 1 in a shared state table (located in shared memory). The shared state table stores the write status of each cross-subgraph tensor, indexed by the source node identifier.

[0178] Step S730: After confirming that the first type of edge output tensor has been written, the first numerical representation bit width calculation core suspends the forward propagation process downstream of the first type of edge calculation node unit in the first numerical representation bit width calculation subgraph, and waits for the synchronization scheduling event unit associated with the first type of edge output tensor to be triggered.

[0179] After the write operation is complete, the first value indicates that the bit-width arithmetic core checks the sequence of synchronization scheduling events. If the output tensor generated by the current edge node is the input required by a receiving node in a certain second subgraph, and the computation of that receiving node depends on this tensor, then the first value indicates that the bit-width arithmetic core suspends execution of all computation nodes in the current subgraph that are after that edge node. The arithmetic core enters a low-power wait loop, continuously checking the trigger flags of scheduling event units associated with the current edge output tensor. This wait loop does not consume computational resources and only occupies a small amount of control logic.

[0180] Step S740: When the second numerical representation bit width operation core executes to the second type of edge computing node unit in the second numerical representation bit width calculation subgraph, it writes the second type of edge output tensor generated by the second type of edge computing node unit to the shared buffer storage area and records the writing completion status.

[0181] The second numerical value indicates that the bit-width computation core performs calculations in parallel according to the topological order of the second subgraph. When the second type of edge computing node unit (i.e., the node marked in step S1285 whose output needs to be passed to the first subgraph) is reached, the node generates an output tensor T_out' after completing the forward propagation calculation. The computation core writes the output tensor to the corresponding position in the shared buffer storage area and sets the corresponding write completion flag to 1 in the shared state table.

[0182] Step S750: After confirming that the second type of edge output tensor has been written, the second numerical representation bit width calculation core suspends the forward propagation process downstream of the second type of edge calculation node unit in the second numerical representation bit width calculation subgraph, and waits for the synchronization scheduling event unit associated with the second type of edge output tensor to be triggered.

[0183] The second value indicates that the bit-width operation core also suspends the execution of downstream nodes after writing and enters a waiting state, waiting for the corresponding scheduling event unit to be triggered.

[0184] Step S760: When it is detected that all scheduling event units bound to the cross-subgraph transmission operation corresponding to the same pair of virtual output ports and virtual input ports and having a data dependency relationship have completed their write status, the synchronous scheduling controller broadcasts a continue execution instruction to the first numerical representation bit-width operation core and the second numerical representation bit-width operation core.

[0185] The synchronization scheduler controller is a lightweight hardware module or operating system-level service that continuously polls the write completion flag in the shared state table. For each pair of cross-subgraph transfer operations (from the edge node of the first subgraph to the receiving node of the second subgraph, or vice versa), when the write completion flag of the source node is 1, the controller checks whether there are any other dependencies on the transfer (e.g., multiple source tensors need to be concatenated before being sent to the target node). When all dependencies are met, the controller sends an inter-core interrupt signal to the first and second numerical representation bit-width operation cores, broadcasting that the instruction execution should continue.

[0186] Step S770: After receiving the continue execution instruction, the first numerical representation bit width calculation core releases the waiting state of the suspended forward propagation processing flow in the first numerical representation bit width calculation subgraph and continues to execute the forward propagation processing of the subsequent first subgraph calculation node units.

[0187] The first value indicates that after the bit-width arithmetic core receives an interrupt signal, it exits the waiting loop, resumes the program counter to the suspended position, and continues execution of the computation node unit located downstream of the edge node in the first subgraph. At this time, the second subgraph has already written the required input tensors to the shared buffer storage area, and the first subgraph can directly read this data.

[0188] Step S780: After receiving the continue execution instruction, the second numerical representation bit width calculation core releases the waiting state of the suspended forward propagation processing flow in the second numerical representation bit width calculation subgraph and continues to execute the forward propagation processing of the subsequent second subgraph calculation node units.

[0189] The second value indicates that the bit-width operation core has resumed synchronous execution and continues to process the downstream nodes in the second subgraph.

[0190] Step S790: During the forward propagation process, the first numerical representation bit-width operation core reads the second type of edge output tensor written by the second numerical representation bit-width operation core from the shared buffer storage area, and uses it as the input tensor of the corresponding first type of receiving computing node unit to participate in subsequent operations.

[0191] When the first numerical representation bit-width operation core reaches the first type of receiving computation node unit (i.e., the node that needs to receive data from the second subgraph), it first reads the corresponding second type edge output tensor from the shared buffer storage area. If the bit-width of the tensor's numerical representation (e.g., an 8-bit integer) does not match the bit-width used by the current first subgraph (e.g., a 32-bit floating-point number), the operation core calls a precision conversion function to convert the tensor to a matching format. Then, the converted tensor is used as one of the inputs to that node, participating in the forward propagation computation along with other inputs.

[0192] Step S7100: During the forward propagation process, the second numerical representation bit-width operation core reads the first type of edge output tensor written by the first numerical representation bit-width operation core from the shared buffer storage area, and uses it as the input tensor of the corresponding second type of receiving computing node unit to participate in subsequent operations.

[0193] The second value indicates that the bit-width computation core performs symmetrical operations, reading the first type of edge output tensor from the shared buffer storage area, performing necessary precision conversions, and then using it as input to the second type of receiving computation node unit for computation. Through the above-mentioned synchronization scheduling mechanism, the two computation cores can achieve efficient parallel execution, only needing to wait for necessary synchronization at cross-subgraph data transmission points, thus avoiding deadlock and data race problems.

[0194] In one exemplary embodiment, a mixed-precision model inference acceleration system for large language models is provided. This system can be a terminal, server, etc., and its internal structure diagram can be as follows: Figure 2 As shown, it includes a processor, memory, input / output interface, communication interface, display unit, and input device. The processor, memory, and input / output interface are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interface. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input / output interface is used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, near-field communication, or other technologies. When the computer program is executed by the processor, it implements a mixed-precision model inference acceleration method applied to large language models. The display unit is used to form a visually visible image and can be a display screen, projection device, or virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device can be a touch layer covering the display screen, or a button, trackball, or touchpad set on the shell of a mixed precision model inference acceleration system for large language models, or an external keyboard, touchpad, or mouse, etc.

[0195] It should be noted that, in order to simplify the description of the present invention and thus help to understand one or more embodiments of the invention, multiple features may sometimes be grouped into one embodiment, drawing or description thereof in the foregoing description of the embodiments of the present invention.< / eos>

Claims

1. A mixed-precision model inference acceleration method applied to a large language model, characterized in that, The method includes: Obtain the computation graph topology of the text input sequence to be processed and the pre-trained large language model. The computation graph topology includes multiple computation node units with initial numerical representation of bit width attributes and data flow association edges between the computation node units. Based on the gradient propagation path length attribute of the data flow to the associated edge in the computation graph topology and the semantic complexity representation vector of the text input sequence to be processed, a numerical representation bit width compression tolerance label is generated for each computation node unit, and the computation graph topology is reconstructed into a first numerical representation bit width computation subgraph and a second numerical representation bit width computation subgraph based on the numerical representation bit width compression tolerance label. The first numerical representation bit width operation core is called to perform forward propagation processing on the first numerical representation bit width calculation subgraph, and the second numerical representation bit width operation core is called simultaneously to perform forward propagation processing on the second numerical representation bit width calculation subgraph, so as to obtain the first numerical representation bit width intermediate tensor set and the second numerical representation bit width intermediate tensor set. The bit width occupancy of the first numerical representation bit width operation core is greater than that of the second numerical representation bit width operation core. Cross-subgraph tensor interaction processing is performed on the first set of intermediate tensors representing bit width and the second set of intermediate tensors representing bit width. Tensor units in the first set of intermediate tensors representing bit width are transmitted to the receiving computing node unit in the second set of intermediate tensors representing bit width, and tensor units in the second set of intermediate tensors representing bit width are transmitted to the receiving computing node unit in the first set of intermediate tensors representing bit width, thereby generating a hybrid bit width forward propagation data stream. The text inference output sequence is generated based on the hybrid bit-width forward propagation data stream and the output layer weight tensor of the pre-trained large language model.

2. The mixed-precision model inference acceleration method applied to a large language model according to claim 1, characterized in that, The step of generating a numerical representation bit-width compression tolerance marker for each computation node unit based on the gradient propagation path length attribute of the data flow to the associated edges in the computation graph topology and the semantic complexity representation vector of the text input sequence to be processed includes: The text input sequence to be processed is transformed into an initial input embedding tensor with a sequence length dimension and a word embedding dimension, and the initial input embedding tensor is loaded into the starting computation node unit of the computation graph topology. For the target computing node unit in the computing graph topology, the total number of data flow-related edges that propagate from the starting computing node unit along the data flow-related edges to the target computing node unit is counted, and the total number of data flow-related edges is used as the gradient propagation path length attribute of the target computing node unit. Extract the frequency distribution data of rare word units contained in the text input sequence to be processed, and identify the maximum dependency distance value of cross-syntactic component dependency relationship chain in the text input sequence to be processed. Combine the frequency distribution data and the maximum dependency distance value to form the semantic complexity representation vector. The gradient propagation path length attribute of the target computing node unit is input into a preset first length response function to generate a first tolerance tendency value, the output value of the first length response function being positively correlated with the magnitude of the gradient propagation path length attribute; and the maximum dependency distance value in the semantic complexity representation vector is input into a preset second length response function to generate a second tolerance tendency value, the output value of the second length response function being positively correlated with the magnitude of the maximum dependency distance value; the first tolerance tendency value and the second tolerance tendency value are added together to obtain the basic bit width compression tolerance value for the target computing node unit; Extract the discreteness distribution parameters of the historical activation tensor generated by the target computing node unit in the historical forward propagation record, compare the discreteness distribution parameters of the historical activation tensor with the preset discreteness comparison range, and generate a third tolerance tendency value for the target computing node unit. The base bit width compression tolerance value is multiplied by the third tolerance tendency value to obtain the final bit width compression tolerance value of the target computing node unit. Based on the comparison result between the final bit width compression tolerance value and the preset tolerance judgment limit, the numerical representation bit width compression tolerance mark of the target computing node unit is determined. The numerical representation bit width compression tolerance mark is used to identify the sensitivity category of the target computing node unit to numerical representation bit width compression processing. Traverse all computation node units in the computation graph topology and sequentially execute the steps of gradient propagation path length attribute statistics, semantic complexity representation vector combination, basic bit width compression tolerance value generation, and final bit width compression tolerance value calculation to obtain the bit width compression tolerance label corresponding to each computation node unit.

3. The method for accelerating inference using mixed-precision models applied to large language models according to claim 1, characterized in that, The process of reconstructing the computation graph topology into a first numerical representation bit-width computation subgraph and a second numerical representation bit-width computation subgraph based on the numerical representation bit-width compression tolerance marker includes: Obtain the numerical representation bit width compression tolerance tag bound to each computing node unit in the computing graph topology, and divide the computing node units corresponding to the numerical representation bit width compression tolerance tags with the same sensitivity category into the same computing node unit set; Extract all computing node units bound to the numerical representation bit width compression tolerance marker with the first sensitivity category to form a first computing node unit set; at the same time, extract all computing node units bound to the numerical representation bit width compression tolerance marker with the second sensitivity category to form a second computing node unit set. In the computation graph topology, the original data flow association edges between any two computation node units within the first set of computation node units are retained, and the data flow association edges pointing from any one of the first set of computation node units to any one of the second set of computation node units are removed, resulting in an initial first numerical representation bit width computation subgraph. In the computation graph topology, the original data flow association edges between any two computation node units within the second set of computation node units are retained, and the data flow association edges pointing from any one of the computation node units in the second set of computation node units to any one of the computation node units in the first set of computation node units are removed, resulting in an initial second numerical representation bit width computation subgraph. The initial first value represents a first type of edge computing node unit in the bit-width calculation subgraph that has a data flow associated edge with the second set of computing node units that has been removed, and a virtual output port identifier pointing to the external subgraph is added to the first type of edge computing node unit; and the initial second value represents a second type of edge computing node unit in the bit-width calculation subgraph that has a data flow associated edge with the first set of computing node units that has been removed, and a virtual output port identifier pointing to the external subgraph is added to the second type of edge computing node unit; In the initial first numerical representation bit width calculation subgraph, a first type of receiving computing node unit that receives data from the second computing node unit set is identified, and a virtual input port identifier for receiving external subgraph is added to the first type of receiving computing node unit; and in the initial second numerical representation bit width calculation subgraph, a second type of receiving computing node unit that receives data from the first computing node unit set is identified, and a virtual input port identifier for receiving external subgraph is added to the second type of receiving computing node unit; Based on the correspondence between the virtual output port identifier and the virtual input port identifier, a cross-subgraph data interaction mapping table is constructed to connect the initial first value representing bit width calculation subgraph and the initial second value representing bit width calculation subgraph. The cross-subgraph data interaction mapping table records the source port and target port pairing information of cross-subgraph tensor transmission. The initial first value representing the bit width calculation subgraph after adding the virtual port identifier is determined as the first value representing the bit width calculation subgraph, and the initial second value representing the bit width calculation subgraph after adding the virtual port identifier is determined as the second value representing the bit width calculation subgraph.

4. The method for accelerating inference using mixed-precision models applied to large language models according to claim 1, characterized in that, The process involves calling the first numerical representation bit-width computation core to perform forward propagation processing on the first numerical representation bit-width computation subgraph, and simultaneously calling the second numerical representation bit-width computation core to perform forward propagation processing on the second numerical representation bit-width computation subgraph, thereby obtaining the first numerical representation bit-width intermediate tensor set and the second numerical representation bit-width intermediate tensor set, including: The first numerical representation bit width operation core and the second numerical representation bit width operation core start execution in parallel and synchronously, respectively acquiring the first subgraph adjacency relationship description data of the first numerical representation bit width calculation subgraph and the second subgraph adjacency relationship description data of the second numerical representation bit width calculation subgraph, and determining the topological execution order of the computing node units in their respective computing subgraphs based on their respective adjacency relationship description data; The first numerical representation bit width operation core, according to the topological execution order of the first numerical representation bit width calculation subgraph, sequentially inputs the first initial propagation tensor into each first subgraph calculation node unit in the first numerical representation bit width calculation subgraph. Each first subgraph calculation node unit performs the first numerical representation bit width multiplication and accumulation operation on the received first input tensor and the bound first weight parameter fragment to generate the first output tensor and pass it to the next first subgraph calculation node unit. While the first numerical representation bit-width operation core is performing the first numerical representation bit-width multiplication and accumulation operation, the second numerical representation bit-width operation core sequentially passes the second initial propagation tensor to each second subgraph operation node unit in the second numerical representation bit-width operation subgraph according to the topological execution order of the second numerical representation bit-width operation subgraph. Each second subgraph operation node unit performs the second numerical representation bit-width multiplication and accumulation operation on the received second input tensor and the bound second weight parameter fragment to generate a second output tensor and pass it to the next second subgraph operation node unit. During the process of the first numerical representation bit width operation core traversing the first numerical representation bit width calculation subgraph, the first output tensor generated by the first boundary calculation node unit is copied and stored in the first boundary temporary storage area. The first boundary calculation node unit is the first subgraph calculation node unit marked as having a data flow direction associated edge pointing to the second numerical representation bit width calculation subgraph in the first subgraph adjacency relationship description data. During the process of the second numerical representation bit width operation core traversing the second numerical representation bit width calculation subgraph, the second output tensor generated by the second boundary calculation node unit is copied and stored in the second boundary temporary storage area. The second boundary calculation node unit is the second subgraph calculation node unit marked as having a data flow direction associated edge pointing to the first numerical representation bit width calculation subgraph in the second subgraph adjacency relationship description data. After traversing all the first subgraph calculation node units in the first numerical representation bit width calculation subgraph, the first numerical representation bit width operation core aggregates and stores the first output tensors generated by each first subgraph calculation node unit as the first numerical representation bit width intermediate tensor set, and at the same time encapsulates all the tensors in the first boundary temporary storage area into the first cross-subgraph transmission data packet and pushes it to the shared data interaction area. After traversing all the second subgraph computation node units in the second numerical representation bit width computation subgraph, the second numerical representation bit width computation core aggregates and stores the second output tensors generated by each second subgraph computation node unit as the second numerical representation bit width intermediate tensor set. At the same time, it encapsulates all the tensors in the second boundary temporary storage area into a second cross-subgraph transmission data packet and pushes it to the shared data interaction area.

5. The method for accelerating mixed-precision model inference applied to large language models according to claim 1, characterized in that, The step of performing cross-subgraph tensor interaction processing on the first set of intermediate tensors representing bit width and the second set of intermediate tensors representing bit width, transmitting tensor units from the first set of intermediate tensors representing bit width to the receiving computation node unit in the second set of computational subgraphs representing bit width, and transmitting tensor units from the second set of intermediate tensors representing bit width to the receiving computation node unit in the first set of computational subgraphs representing bit width, generates a mixed bit width forward propagation data stream, including: The second numerical representation bit-width operation core reads the first cross-subgraph transmission data packet from the shared data interaction area, parses out the first boundary output tensor and its corresponding first source port identifier carried in the first cross-subgraph transmission data packet, determines the first target port identifier paired with the first source port identifier and the corresponding first target receiving computing node unit according to the cross-subgraph data interaction mapping table, performs numerical representation bit-width boosting conversion on the first boundary output tensor and injects it into the input port of the first target receiving computing node unit, and when the first target receiving computing node unit performs forward propagation, the second numerical representation bit-width operation core fuses the injected tensor with the original input tensor of the first target receiving computing node unit according to the input fusion method required by the computing node unit, and continues to propagate based on the fusion result; The first numerical representation bit-width operation core reads the second cross-subgraph transmission data packet from the shared data interaction area, parses out the second boundary output tensor and its corresponding second source port identifier carried in the second cross-subgraph transmission data packet, determines the second target port identifier paired with the second source port identifier and the corresponding second target receiving computing node unit according to the cross-subgraph data interaction mapping table, performs numerical representation bit-width adaptation conversion on the second boundary output tensor and injects it into the input port of the second target receiving computing node unit, and when the second target receiving computing node unit performs forward propagation, the first numerical representation bit-width operation core merges the injected tensor with the original input tensor of the second target receiving computing node unit according to the input fusion method required by the computing node unit, and continues to propagate based on the fusion result; During the forward propagation process of the remaining computation node units in the first and second numerical representation bit width computation subgraphs, if subsequent boundary computation node units are encountered, the boundary output tensor parsing, numerical representation bit width conversion and fusion injection operations are repeated until all computation node units in the two computation subgraphs have completed the forward propagation process. After the first and second numerical bit-width operation cores have completed their respective computational subgraph traversal, they concatenate and combine the final output tensors of the final computational node units in their respective computational subgraphs according to the original output hierarchy order of the computational graph topology to generate the hybrid bit-width forward propagation data stream.

6. The method for accelerating inference using mixed-precision models applied to large language models according to claim 1, characterized in that, The step of generating a text inference output sequence based on the mixed bit-width forward propagation data stream and the output layer weight tensor of the pre-trained large language model includes: Obtain the terminal mixed bit-width output tensor formed by aggregating the final output component of the first numerical bit-width calculation subgraph with the final output component of the second numerical bit-width calculation subgraph; Before the aggregation of the end-mixed bit-width output tensor is formed, the final output component of the first numerical bit-width calculation subgraph is converted into the target output bit-width format, and the final output component of the second numerical bit-width calculation subgraph is also converted into the target output bit-width format. Then, they are aggregated to obtain the target bit-width end-output tensor. The target bit-width end-output tensor is multiplied with the output layer weight tensor of the pre-trained large language model to obtain the original output score vector aligned with the vocabulary dimension. Each score element in the original output score vector corresponds to a candidate word unit in the vocabulary. The original output score vector is processed by a normalized exponential operation to generate a probability distribution vector in the vocabulary dimension. Based on the probability distribution vector, the predicted word unit for the current inference time step is selected and appended to the end of the generated word sequence. The probability distribution vector represents the probability measure of each candidate word unit as the prediction result of the current inference time step. The word embedding representation vector corresponding to the predicted word unit is obtained by querying the word embedding table based on the predicted word unit, and the word embedding representation vector is fed back to the starting computation node unit of the computation graph topology as the input embedding tensor of the next inference time step. The updated generated word sequence is used as a new text input sequence to be processed. The computation graph topology reconstruction, forward propagation processing, cross-subgraph tensor interaction processing and prediction word unit generation process are repeatedly triggered until the preset sequence generation termination condition is met. The sequence generation termination condition includes generating a preset sequence termination word unit or the length of the generated word sequence reaches a preset length limit. After each predicted lexical unit is generated, its identifier in the vocabulary is recorded. The identifiers recorded at all inference time steps are arranged in the order of generation to form an identifier sequence. The lexical decoder is called to process the identifier sequence, mapping each identifier in the identifier sequence back to its corresponding natural language lexical unit. The natural language lexical units are then concatenated and combined in the order of the identifier sequence to obtain a continuous natural language text segment, which serves as the final text inference output sequence corresponding to the text input sequence to be processed.

7. The method for accelerating mixed-precision model inference applied to large language models according to claim 1, characterized in that, The method further includes: Before obtaining the input sequence of the text to be processed and the computation graph topology of the pre-trained large language model, static numerical representation bit width compression processing is performed on the initial weight parameter set contained in the pre-trained large language model to generate a compressed weight parameter set with compressed numerical representation bit width format. Based on the operation type attributes of each computation node unit in the computation graph topology of the pre-trained large language model, a corresponding weight bit width recovery trigger condition label is generated for each computation node unit. The operation type attributes include dot product operation type and nonlinear activation operation type. The compressed weight parameter set and the weight bit width recovery trigger condition flag are stored together in the first dedicated storage area of the first numerical representation bit width operation core, and the compressed weight parameter set and the weight bit width recovery trigger condition flag are stored together in the second dedicated storage area of the second numerical representation bit width operation core; Before performing forward propagation processing on the target first subgraph computation node unit in the first numerical representation bit width computation subgraph, the first numerical representation bit width computation core reads the weight bit width recovery trigger condition flag corresponding to the target first subgraph computation node unit. If the weight bit width recovery trigger condition flag indicates that bit width recovery processing needs to be performed on the weight, the first numerical representation bit width operation core extracts the compressed weight fragment associated with the target first subgraph computing node unit from the first dedicated storage area, and performs numerical representation bit width recovery processing on the compressed weight fragment to obtain the recovered original bit width weight fragment. The first value indicates that the bit-width operation core uses the restored original bit-width weight fragment to participate in the forward propagation operation of the target first subgraph computing node unit, and releases the first dedicated storage area space occupied by the restored original bit-width weight fragment after completing the forward propagation operation. If the weight bit width recovery trigger condition flag indicates that bit width recovery processing is not required for the weight, then the first value indicates that the bit width operation core directly extracts the compressed weight fragment associated with the target first subgraph computing node unit from the first dedicated storage area, and uses the compressed weight fragment to directly participate in the forward propagation operation processing of the target first subgraph computing node unit; Before performing forward propagation processing on the target second subgraph computation node unit in the second numerical representation bit width calculation subgraph, the second numerical representation bit width operation core reads the weight bit width recovery trigger condition flag corresponding to the target second subgraph computation node unit. If the weight bit width recovery trigger condition flag indicates that bit width recovery processing needs to be performed on the weight, the second numerical representation bit width operation core extracts the compressed weight fragment associated with the target second subgraph computing node unit from the second dedicated storage area, and performs numerical representation bit width recovery processing on the compressed weight fragment to obtain the recovered original bit width weight fragment. If the weight bit-width recovery trigger condition flag indicates that bit-width recovery processing of the weight is not required, then the second value indicates that the bit-width operation core directly extracts the compressed weight fragment associated with the target second subgraph computing node unit from the second dedicated storage area, and uses the compressed weight fragment to directly participate in the forward propagation operation processing of the target second subgraph computing node unit.

8. The method for accelerating inference using mixed-precision models applied to large language models according to claim 3, characterized in that, The method further includes: During the forward propagation processing of the first numerical representation bit width calculation subgraph by the first numerical representation bit width calculation core, the activation value amplitude statistics and activation value fluctuation range statistics of the first subgraph activation tensor output by each first subgraph calculation node unit in the first numerical representation bit width calculation subgraph are continuously collected. During the forward propagation processing of the second numerical representation bit width calculation subgraph in the second numerical representation bit width calculation core, the activation value amplitude statistics and activation value fluctuation range statistics of the second subgraph activation tensor output by each second subgraph calculation node unit in the second numerical representation bit width calculation subgraph are continuously collected. The activation value amplitude statistics and activation value fluctuation range statistics of the first subgraph activation tensor and the activation value amplitude statistics and activation value fluctuation range statistics of the second subgraph activation tensor are input into the activation value distribution comparison and analysis process to generate a numerical representation bit width dynamic adjustment instruction sequence for the computation graph topology. The numerical representation bit width dynamic adjustment instruction sequence includes bit width increase and bit width decrease instructions for the target computing node unit. The bit width increase instruction is used to migrate the target computing node unit from the second numerical representation bit width calculation subgraph to the first numerical representation bit width calculation subgraph, and the bit width decrease instruction is used to migrate the target computing node unit from the first numerical representation bit width calculation subgraph to the second numerical representation bit width calculation subgraph. In response to the instruction to increase bit width occupancy, the first target computing node unit to be migrated is located in the second numerical representation bit width calculation subgraph, and the data flow direction association edge connection between the first target computing node unit and other computing node units in the second numerical representation bit width calculation subgraph is disconnected. In the first numerical representation bit width calculation subgraph, identify the upstream and downstream related computing node units that have an upstream and downstream dependency relationship with the first target computing node unit, insert the first target computing node unit into the data flow path between the upstream and downstream related computing node units, and reconstruct the data flow related edge connection. The cross-subgraph data interaction mapping table is updated by modifying the target port of the second cross-subgraph data transmission unit, which originally pointed to the first target computing node unit, to the new port identifier of the first target computing node unit in the first numerical representation bit width computing subgraph. In response to the instruction to reduce bit width occupancy, the second target computing node unit to be migrated is located in the first numerical representation bit width calculation subgraph, and the data flow direction association edge connection between the second target computing node unit and other computing node units in the first numerical representation bit width calculation subgraph is disconnected. In the second numerical representation bit width calculation subgraph, identify the upstream and downstream related computing node units that have an upstream and downstream dependency relationship with the second target computing node unit, insert the second target computing node unit into the data flow path between the upstream and downstream related computing node units, and reconstruct the data flow related edge connection. The cross-subgraph data interaction mapping table is updated, and the target port of the first cross-subgraph data transmission unit that originally pointed to the second target computing node unit is modified to the new port identifier of the second target computing node unit in the second numerical representation bit-width computing subgraph.

9. The method for accelerating inference in mixed-precision models applied to large language models according to claim 3, characterized in that, The method further includes: Before the first numerical representation bit-width operation core and the second numerical representation bit-width operation core perform synchronous forward propagation processing, a synchronous scheduling event sequence corresponding to the cross-subgraph data interaction mapping table is constructed, and each scheduling event unit in the synchronous scheduling event sequence is bound to a cross-subgraph tensor transfer operation. When the first numerical representation bit width operation core executes to the first type of edge computing node unit in the first numerical representation bit width calculation subgraph, it writes the first type of edge output tensor generated by the first type of edge computing node unit to the shared buffer storage area accessed by the first numerical representation bit width operation core and the second numerical representation bit width operation core, and records the write completion status. After confirming that the first type of edge output tensor has been written, the first numerical representation bit width operation core suspends the forward propagation process located downstream of the first type of edge computing node unit in the first numerical representation bit width calculation subgraph, and waits for the synchronization scheduling event unit associated with the first type of edge output tensor to be triggered. When the second numerical representation bit width operation core executes to the second type of edge computing node unit in the second numerical representation bit width calculation subgraph, it writes the second type of edge output tensor generated by the second type of edge computing node unit to the shared buffer storage area and records the writing completion status. After confirming that the second type of edge output tensor has been written, the second numerical representation bit width operation core suspends the forward propagation process located downstream of the second type of edge computing node unit in the second numerical representation bit width calculation subgraph, and waits for the synchronization scheduling event unit associated with the second type of edge output tensor to be triggered. When all scheduling event units bound to cross-subgraph transmission operations corresponding to the same pair of virtual output ports and virtual input ports and having data dependencies are detected to be ready for write completion, the synchronous scheduling controller broadcasts a continue execution instruction to the first numerical representation bit-width operation core and the second numerical representation bit-width operation core. After the first numerical representation bit width arithmetic core receives the continue execution instruction, it releases the waiting state of the suspended forward propagation processing flow in the first numerical representation bit width calculation subgraph and continues to execute the forward propagation processing of the subsequent first subgraph calculation node unit. After receiving the continue execution instruction, the second numerical representation bit width calculation core releases the waiting state of the suspended forward propagation processing flow in the second numerical representation bit width calculation subgraph and continues to execute the forward propagation processing of the subsequent second subgraph calculation node unit; The first value indicates that during the forward propagation process, the bit-width operation core reads the second type of edge output tensor written by the bit-width operation core from the shared buffer storage area, and uses it as the input tensor of the corresponding first type of receiving computing node unit to participate in subsequent operations. During the forward propagation process, the second numerical representation bit-width operation core reads the first type of edge output tensor written by the first numerical representation bit-width operation core from the shared buffer storage area and uses it as the input tensor of the corresponding second type of receiving computing node unit to participate in subsequent operations.

10. A mixed-precision model inference acceleration system applied to large language models, characterized in that, include: processor; A machine-readable storage medium for storing machine-executable instructions of the processor; The processor is configured to execute the mixed-precision model inference acceleration method for large language models according to any one of claims 1 to 9 by executing the machine-executable instructions.