Method and system for dynamic allocation of computing power in intelligent model inference process
By dynamically allocating computing resources and generating value assessment indicators based on the uncertainty measurement of intermediate output data and data dependencies, the problem of improper allocation of computing resources during the reasoning process of artificial intelligence models is solved, and a balance between computing accuracy and efficiency is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING SHANGYUN DIGITAL TECHNOLOGY CO LTD
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies lack the fine perception and adaptability to the real-time state of the reasoning process in artificial intelligence models, leading to improper allocation of computing resources and affecting computational accuracy and efficiency.
The uncertainty measure of the computing unit is determined by obtaining the distribution dispersion and numerical stability of the intermediate output data. The value assessment index is generated by combining the data dependencies between computing units, computing resources are dynamically allocated, and the system switches to full computing accuracy when the accumulated error exceeds the tolerance value.
It achieves the goal of prioritizing the computational accuracy of key computational steps under limited computing resources, controlling the overall computational error within an acceptable range, and improving computational efficiency and the reliability of inference results.
Smart Images

Figure CN122242743A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to computing power allocation technology, and more particularly to a method and system for dynamic allocation of computing power during the reasoning process of intelligent models. Background Technology
[0002] In the field of artificial intelligence model inference, especially when dealing with complex multi-level computational models, how to efficiently and rationally allocate limited computing resources is a core challenge.
[0003] Conventional static or rule-based dynamic allocation methods have significant limitations. The main drawback lies in their lack of fine-grained awareness and adaptability to the real-time state of the inference process. Since the resource allocation scheme is determined before task execution, it cannot be adjusted based on the dynamic characteristics of intermediate results generated during actual computation. Different input data can lead to significant differences in the distribution of intermediate features. A pre-defined fixed strategy may result in wasted computational power on some inputs, while on others, improper resource allocation may introduce excessive errors, affecting the final inference accuracy. Furthermore, conventional methods often treat individual computational units in isolation, ignoring the complex data dependencies between them. The output error of one computational unit may be amplified in subsequent calculations, and simple allocation rules struggle to assess and quantify the impact of this error propagation. This can lead to locally optimized allocation decisions having uncontrollable negative impacts on the overall task result, failing to achieve an optimal balance between overall computational accuracy and resource consumption. Summary of the Invention
[0004] The present invention provides a method and system for dynamic allocation of computing power in the reasoning process of intelligent models, which can solve the problems in the prior art.
[0005] A first aspect of this invention provides a method for dynamic allocation of computing power during the inference process of an intelligent model, comprising:
[0006] Obtain the multi-level computational model corresponding to the task to be reasoned, wherein the multi-level computational model contains multiple computational units arranged in the order of execution;
[0007] During the inference process, for each completed computation unit, the intermediate output data is extracted. The uncertainty measure of the current computation unit is determined based on the distribution dispersion and numerical stability of the intermediate output data. The influence range of the current computation unit on subsequent computation units is determined based on the data dependency relationship between computation units. The uncertainty measure and the influence range are weighted and combined to generate a value assessment index.
[0008] The computational units to be executed are sorted and computing resources are allocated according to the value assessment index. The computational units with the highest sorting accuracy are allocated to the first computing resource pool with full computational accuracy, while the computational units with the lowest sorting accuracy are allocated to the second computing resource pool with reduced computational accuracy. The processing capacity of the first computing resource pool is higher than that of the second computing resource pool.
[0009] The subsequent computing units are executed according to the computing power resource allocation results. During the execution process, the output error caused by the reduced precision is accumulated. When the accumulated error exceeds the preset tolerance value, the subsequent computing units are switched to full computing precision. After each computing unit is completed, the above intermediate output data extraction, value assessment, resource allocation and execution process is repeated until the inference task is completed.
[0010] The steps of determining the uncertainty measure of the current computing unit based on the distribution dispersion and numerical stability of intermediate output data, determining the influence range of the current computing unit on subsequent computing units based on the data dependencies between computing units, and generating a value assessment index by weighted combination of the uncertainty measure and the influence range include:
[0011] Obtain intermediate output data from completed computation units, calculate the standard deviation and gradient magnitude of the intermediate output data, and generate an uncertainty measure by weighted summation of the standard deviation and gradient magnitude.
[0012] Construct a computation graph for the multi-level computation model, where nodes in the computation graph correspond to computation units, and directed edges represent data dependencies between computation units;
[0013] Locate the current node corresponding to the current computing unit in the computation graph, perform a breadth-first traversal along the directed edges from the current node, record the level distance of each visited subsequent node, and obtain the set of reachable subsequent nodes.
[0014] Identify the output node corresponding to the final output layer of the multi-level computation model in the computation graph; for each subsequent node in the set of subsequent nodes, calculate the shortest path length from the subsequent node to the output node;
[0015] For each subsequent node, a distance influence factor is determined based on the hierarchical distance of that subsequent node. The distance influence factor is then multiplied by the inverse of the shortest path length from that subsequent node to the output node to generate the influence weight of that subsequent node.
[0016] The influence weights of all subsequent nodes in the set of subsequent nodes are summed and normalized to generate the influence range of the current computing unit on subsequent computing units.
[0017] The value assessment index is generated by multiplying the uncertainty measure with the scope of influence.
[0018] The steps for determining the distance influence factor based on the hierarchical distance of the subsequent node include:
[0019] The hierarchical distance of the subsequent node is used as the input of the exponential decay function to calculate the decay coefficient. The exponential decay function makes the decay coefficient of the subsequent node with a larger hierarchical distance smaller.
[0020] Obtain the type identifier of the computation unit corresponding to the subsequent node, and query the corresponding type weight coefficient from the preset type weight mapping table according to the type identifier; in the type weight mapping table, the convolution type computation unit corresponds to the first type weight coefficient, the attention type computation unit corresponds to the second type weight coefficient, and the fully connected type computation unit corresponds to the third type weight coefficient.
[0021] The distance influence factor is generated by multiplying the attenuation coefficient with the type weight coefficient.
[0022] The steps of sorting and allocating computing resources to subsequent computing units based on the value assessment indicators include:
[0023] All subsequent computational units to be executed are arranged in descending order of their corresponding value assessment indicators to generate an initial sorting sequence;
[0024] Traverse the initial sorting sequence, and for each computing unit, identify its direct predecessor computing unit according to the execution order relationship of the multi-level computing model; generate a priority allocation identifier for the current computing unit based on the resource pool allocation status and data transmission characteristics of the direct predecessor computing unit;
[0025] The initial sorting sequence is adjusted according to the priority allocation identifier, and the positions of the computing units with the priority allocation identifier are moved forward to generate the adjusted sorting sequence.
[0026] The allocation boundary position is calculated based on the total number of the adjusted sorted sequence and the preset ratio. The calculation units before the allocation boundary position are extracted to form a first allocation set, and the remaining calculation units are extracted to form a second allocation set.
[0027] The computing units in the first allocation set are set to full-width floating-point numbers and submitted to the first computing power resource pool, while the computing units in the second allocation set are set to reduced-width fixed-point numbers and submitted to the second computing power resource pool.
[0028] The steps for generating a priority allocation identifier for the current computing unit based on the resource pool allocation status and data transmission characteristics of the direct predecessor computing unit include:
[0029] For each direct predecessor computing unit, query the resource pool allocation status of the direct predecessor computing unit, and extract the direct predecessor computing units that have been marked as allocated to the first computing power resource pool to form the direct predecessor subset of the first resource pool.
[0030] For each direct predecessor computing unit in the first resource pool direct predecessor subset, the predecessor weight is calculated based on the amount of intermediate data transmitted from the direct predecessor computing unit to the current computing unit.
[0031] Sum the predecessor weights of all direct predecessor computation units in the first resource pool's direct predecessor subset to generate a predecessor influence score;
[0032] The predecessor influence score is compared with a preset influence threshold. If the predecessor influence score is greater than the preset influence threshold, a priority allocation flag is added to the current computing unit.
[0033] The steps that accumulate and reduce the output error caused by precision execution during the execution process include:
[0034] For each computing unit that completes its computation in the second computing power resource pool with reduced bit width and fixed number of points, obtain its actual output result and the corresponding theoretical full-precision output result, and calculate the numerical difference between the two to generate the unit output error;
[0035] Obtain the influence range value corresponding to the calculation unit, and multiply the unit output error with the influence range value to generate a weighted unit error;
[0036] The weighted unit error is added to the error accumulation sequence, which stores the weighted unit errors in the order of the execution completion time of the computing units;
[0037] For each weighted unit error in the error accumulation sequence, a time decay coefficient is calculated based on the time interval between the execution completion time of the calculation unit corresponding to the weighted unit error and the current time.
[0038] The cumulative error is generated by multiplying the errors of all weighted units in the cumulative error sequence by their corresponding time decay coefficients and then summing the results.
[0039] The steps for switching subsequent calculation units to full calculation accuracy when the accumulated error exceeds a preset tolerance value include:
[0040] When the accumulated error exceeds a preset tolerance value, a precision switching process is triggered.
[0041] Traverse the error accumulation sequence, extract the calculation unit identifier corresponding to each weighted unit error for each weighted unit error, obtain the influence range value of the calculation unit in the multi-level calculation model, and multiply the weighted unit error with the influence range value to generate the error contribution.
[0042] All calculation units are sorted in descending order of error contribution, and a predetermined number of calculation units with the highest error contribution are extracted to form a set of calculation units with high error contribution.
[0043] Identify all subsequent computation units in the execution order relationship of each computation unit in the high error contribution computation unit set in the multi-level computation model, and form a switching target computation unit set;
[0044] For each computing unit in the set of target computing units to be switched, if the computing unit is currently allocated to the second computing power resource pool, then the computing precision parameter of the computing unit is switched from a reduced bit width fixed-point number to a full bit width floating-point number, and the computing unit is resubmitted to the first computing power resource pool.
[0045] A second aspect of the present invention provides a dynamic computing power allocation system for intelligent model inference, comprising:
[0046] The model acquisition unit is used to acquire the multi-level computing model corresponding to the task to be reasoned, wherein the multi-level computing model contains multiple computing units arranged in the order of execution.
[0047] The value assessment unit is used to extract intermediate output data for each completed calculation unit during the execution of inference, determine the uncertainty measure of the current calculation unit based on the distribution dispersion and numerical stability of the intermediate output data, determine the influence range of the current calculation unit on subsequent calculation units based on the data dependency relationship between calculation units, and generate a value assessment index by weighting and combining the uncertainty measure and the influence range.
[0048] The resource allocation unit is used to sort the subsequent computing units to be executed according to the value assessment index and allocate computing resources. The computing units with a preset proportion of the ranking are allocated to the first computing resource pool with full computing precision, and the computing units with lower ranking are allocated to the second computing resource pool with reduced computing precision. The processing capacity of the first computing resource pool is higher than that of the second computing resource pool.
[0049] The execution monitoring unit is used to execute subsequent calculation units according to the computing power resource allocation results. During the execution process, it accumulates the output error caused by the reduced precision execution. When the accumulated error exceeds the preset tolerance value, the subsequent calculation unit is switched to full calculation precision.
[0050] A third aspect of the present invention provides an electronic device, comprising:
[0051] processor;
[0052] Memory used to store processor-executable instructions;
[0053] The processor is configured to invoke instructions stored in the memory to execute the aforementioned method.
[0054] A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the aforementioned method.
[0055] This method allocates differentiated precision and computing power to subsequent computing units based on the ranking results of value assessment indicators. The dynamic allocation mechanism allows limited computing resources to be concentrated on the computational stages that have a greater impact on the final inference result.
[0056] During the reduction in precision, the accumulated output error is continuously monitored. Once the accumulated error exceeds a preset tolerance value, subsequent calculation units are automatically switched back to full calculation precision. This ensures that while pursuing efficiency, the overall calculation error is strictly controlled within an acceptable range, guaranteeing the reliability of the inference results. This error feedback control mechanism achieves an adaptive balance between calculation precision and efficiency. Attached Figure Description
[0057] Figure 1 This is a flowchart illustrating the dynamic allocation of computing power during the intelligent model reasoning process in an embodiment of the present invention.
[0058] Figure 2 This is a flowchart for dynamic resource allocation and accuracy optimization of computing units based on value assessment indicators. Detailed Implementation
[0059] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0060] The technical solution of the present invention will be described in detail below with reference to specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes will not be repeated in some embodiments.
[0061] Figure 1This is a flowchart illustrating the dynamic allocation of computing power during the intelligent model reasoning process in an embodiment of the present invention, as shown below. Figure 1 As shown, the method includes:
[0062] Obtain the multi-level computational model corresponding to the task to be reasoned, wherein the multi-level computational model contains multiple computational units arranged in the order of execution;
[0063] During the inference process, for each completed computation unit, the intermediate output data is extracted. The uncertainty measure of the current computation unit is determined based on the distribution dispersion and numerical stability of the intermediate output data. The influence range of the current computation unit on subsequent computation units is determined based on the data dependency relationship between computation units. The uncertainty measure and the influence range are weighted and combined to generate a value assessment index.
[0064] The computational units to be executed are sorted and computing resources are allocated according to the value assessment index. The computational units with the highest sorting accuracy are allocated to the first computing resource pool with full computational accuracy, while the computational units with the lowest sorting accuracy are allocated to the second computing resource pool with reduced computational accuracy. The processing capacity of the first computing resource pool is higher than that of the second computing resource pool.
[0065] The subsequent computing units are executed according to the computing power resource allocation results. During the execution process, the output error caused by the reduced precision is accumulated. When the accumulated error exceeds the preset tolerance value, the subsequent computing units are switched to full computing precision.
[0066] After each computational unit is completed, the process of extracting intermediate output data, evaluating value, allocating resources, and executing is repeated until the reasoning task is completed.
[0067] In one optional implementation, the steps of determining the uncertainty measure of the current computing unit based on the distribution dispersion and numerical stability of intermediate output data, determining the influence range of the current computing unit on subsequent computing units based on the data dependencies between computing units, and generating a value assessment index by weighted combination of the uncertainty measure and the influence range include:
[0068] Obtain intermediate output data from completed computation units, calculate the standard deviation and gradient magnitude of the intermediate output data, and generate an uncertainty measure by weighted summation of the standard deviation and gradient magnitude.
[0069] Construct a computation graph for the multi-level computation model, where nodes in the computation graph correspond to computation units, and directed edges represent data dependencies between computation units;
[0070] Locate the current node corresponding to the current computing unit in the computation graph, perform a breadth-first traversal along the directed edges from the current node, record the level distance of each visited subsequent node, and obtain the set of reachable subsequent nodes.
[0071] Identify the output node corresponding to the final output layer of the multi-level computational model in the computation graph;
[0072] For each subsequent node in the set of subsequent nodes, calculate the shortest path length from that subsequent node to the output node;
[0073] For each subsequent node, a distance influence factor is determined based on the hierarchical distance of that subsequent node. The distance influence factor is then multiplied by the inverse of the shortest path length from that subsequent node to the output node to generate the influence weight of that subsequent node.
[0074] The influence weights of all subsequent nodes in the set of subsequent nodes are summed and normalized to generate the influence range of the current computing unit on subsequent computing units.
[0075] The value assessment index is generated by multiplying the uncertainty measure with the scope of influence.
[0076] For example, intermediate output data refers to the output results generated after the computation unit completes its execution and need to be passed to subsequent computation units, such as the activation value tensor of a certain layer in a neural network, the feature map matrix of a convolutional layer, or the vector output of a fully connected layer. Statistical analysis is performed on this data, and the intermediate output data is organized into a vector form x=( , ,..., Calculate the standard deviation σ of the vector. The standard deviation reflects the degree of fluctuation in the output data; a larger value indicates that the results produced by this computational unit have high dispersion. Simultaneously, extract the gradient information of this computational unit in the backpropagation path, and calculate the L2 norm of the gradient vector as the gradient magnitude. The gradient magnitude reflects the sensitivity of the computational unit to model parameter updates; a larger magnitude means that changes in the unit's output have a significant impact on the overall model. The uncertainty measure U = α·σ + β·σ is obtained by weighting the standard deviation and the gradient magnitude using preset weighting coefficients α and β. Where α+β=1. In this embodiment, the preset weight coefficient α is 0.6 and β is 0.4. This ratio setting is based on empirical statistics to achieve a balance between data distribution characteristics and gradient sensitivity.
[0077] In the multi-level computational model, each computational unit is mapped to a node in the computational graph, with the node number corresponding to the execution order of the computational units. The data flow relationships between computational units are analyzed; when the output of computational unit A is used as the input of computational unit B, a directed edge is established between node A and node B. After mapping all computational units, a directed acyclic graph describing the computational dependencies of the entire model is formed.
[0078] In the computation graph, the node corresponding to the currently completed computation unit is identified as the starting node. A breadth-first traversal strategy is used to search for subsequent affected nodes. The level distance is defined as the minimum number of edges required to reach a subsequent node from the starting node; the level distance of the starting node itself is 0. A queue is initialized, and the starting node is enqueued, with its level distance marked as 0. A node is removed from the queue, and all outgoing edges pointing to its successor nodes are traversed. If a successor node has not been visited, it is enqueued, and its level distance is marked as the current node's level distance plus 1. This process is repeated until the queue is empty. All visited nodes and their level distances are recorded, forming a set of reachable subsequent nodes.
[0079] The computational unit corresponding to the final output layer of the multi-level computational model is identified, and the corresponding node in the computational graph is marked as the output node. For each node in the subsequent node set, Dijkstra's algorithm is used to calculate the shortest path length from that node to the output node. Specifically, with the node to be computed as the source node and the output node as the target node, the distance of the source node is initialized to 0, and the distances of the other nodes are set to infinity. The node with the smallest distance among the currently unvisited nodes is iteratively selected, and a relaxation operation is performed to update the distances of its adjacent nodes until the target node is visited. The distance value obtained at this point is the shortest path length. This length represents the minimum number of computational steps required to propagate from the computation result of the current node to the final model output.
[0080] Calculate the influence weight for each subsequent node. Design a distance influence factor based on the node's hierarchical distance *d*, using an exponential decay form exp(-λd), where λ is the decay coefficient; the greater the distance, the smaller the influence factor. Obtain the shortest path length *l* from the node to the output node, and calculate its reciprocal 1 / l as the output contribution; the shorter the path, the greater the node's direct contribution to the final output. Multiply the distance influence factor by the output contribution to obtain the influence weight w = exp(-λd)·(1 / l) for the subsequent node.
[0081] The total weight W_total is obtained by summing the influence weights of all nodes in the subsequent node set. The influence weight of each node is then normalized by dividing it by the total weight. The normalized weights are then summed to generate the influence range R of the current computing unit on subsequent computing units. This value is between 0 and 1, reflecting the overall influence strength of the current unit's output on the subsequent computing chain.
[0082] The uncertainty measure U is multiplied by the scope of influence R to generate the value assessment index V = U × R. This index comprehensively considers the uncertainty of the current computing unit's output and the breadth of its impact on subsequent calculations. The larger the value, the more critical the execution quality of the computing unit is to the overall inference result, and more computing resources should be allocated to ensure its computational accuracy.
[0083] This invention can accurately quantify the contribution of each computing unit to the inference result, avoiding the waste of computing power caused by the equal allocation of resources to all computing units in traditional methods. This ensures that key computing units have sufficient computing accuracy, while secondary computing units save resources by using reduced precision operations. This significantly improves computing efficiency while ensuring the accuracy of model inference.
[0084] In one optional implementation, the step of determining the distance influence factor based on the hierarchical distance of the subsequent node includes:
[0085] The hierarchical distance of the subsequent node is used as the input of the exponential decay function to calculate the decay coefficient. The exponential decay function makes the decay coefficient of the subsequent node with a larger hierarchical distance smaller.
[0086] Obtain the type identifier of the computing unit corresponding to the subsequent node, and query the corresponding type weight coefficient from the preset type weight mapping table based on the type identifier;
[0087] In the type weight mapping table, the convolution type calculation unit corresponds to the first type weight coefficient, the attention type calculation unit corresponds to the second type weight coefficient, and the fully connected type calculation unit corresponds to the third type weight coefficient.
[0088] The distance influence factor is generated by multiplying the attenuation coefficient with the type weight coefficient.
[0089] For example, when determining the distance influence factor, for each subsequent node identified in the dependency graph constructed by the current computing unit, its corresponding distance influence factor needs to be calculated separately. Specifically, if the current computing unit is at level k in a multi-level computing model, and a subsequent node is at level m, the level distance is defined as d = mk. In a directed acyclic graph structure, since computing units are executed in hierarchical order, a node at level m must be on the subsequent path of a node at level k. The number of shortest edges obtained by graph traversal is equal to the level difference mk. Therefore, the two definitions yield the same calculation results in practical applications and there is no conflict. This level distance represents the number of intermediate computing layers required to reach the subsequent node from the current computing unit.
[0090] The hierarchical distance *d* is input into the exponential decay function for calculation. The exponential decay function is expressed as exp(-λ·d), where λ is the decay rate parameter, typically set between 0.1 and 0.5. Specifically, when λ is 0.3 and the hierarchical distance *d* is 1, the decay coefficient is calculated as exp(-0.3×1)≈0.74; when *d* is 3, the decay coefficient is exp(-0.3×3)≈0.41; when *d* is 5, the decay coefficient is exp(-0.3×5)≈0.22; and when *d* is 8, the decay coefficient decreases to exp(-0.3×8)≈0.09. When the hierarchical distance *d* is 1, it indicates that the subsequent node is the next adjacent computational unit, and the calculated decay coefficient is close to 1. When the hierarchical distance *d* increases to 5 or greater, the decay coefficient value rapidly decreases to below 0.1. This mechanism reflects the objective law that the direct influence of the output characteristics of the current computational unit on distant subsequent nodes gradually weakens after multiple layers of propagation. The exponential decay function has been fully disclosed, and its mathematical form exp(-λ·d) is a well-known standard decay model in the art. The range of the parameter λ from 0.1 to 0.5 has been clearly defined. The decay effect at different distances is clearly illustrated through the specific numerical examples above. Those skilled in the art can select an appropriate value of λ and implement the technical solution based on the actual number of model layers and the desired decay rate.
[0091] The computation unit type identifier is obtained by parsing the configuration file or network structure definition of the multi-level computation model. This identifier explicitly indicates whether the computation unit belongs to a convolution operation, an attention mechanism operation, or a fully connected operation. Based on the obtained type identifier, a pre-established type weight mapping table is accessed to perform a query operation. The type weight mapping table adopts a key-value pair storage structure, using the computation unit type identifier as the index key and the corresponding type weight coefficient as the stored value.
[0092] For convolutional computation units, which perform local feature extraction with a limited receptive field and strong spatial locality of output features, the corresponding first-type weight coefficients are set to values between 0.6 and 0.8. For attention-type computation units, which establish global dependencies through attention weight matrices, they can capture long-distance feature associations, have a wide impact on subsequent layers, and offer flexible weight allocation; therefore, the second-type weight coefficients are set to higher values between 1.0 and 1.2. For fully connected computation units, whose computation involves a globally weighted summation of input features, the output dimension is usually reduced, and their impact on subsequent layers is moderate; the third-type weight coefficients are set to 0.7 to 0.9.
[0093] After obtaining the type weight coefficients, a product operation is performed to multiply the attenuation coefficients by the type weight coefficients. This operation integrates the distance attenuation effect with the inherent influence characteristics of the computation unit type, generating a comprehensive distance influence factor. This distance influence factor typically ranges from 0 to 1.2. When subsequent nodes are close to the current computation unit level and belong to the attention type, the distance influence factor can reach a high value close to 1; when subsequent nodes are far away and belong to the convolution type, the distance influence factor decreases to below 0.1. By calculating the distance influence factors for all subsequent nodes and performing a weighted sum, the influence range of the current computation unit on the overall subsequent computation process can be quantified, serving as a key component of the influence range indicator in the calculation of the value assessment indicator.
[0094] This invention overcomes the limitations of traditional methods that rely solely on hierarchical distance for influence assessment by introducing a weighting coefficient for the type of computational unit to differentiate the distance attenuation factor.
[0095] In one optional implementation, the step of sorting the subsequent computing units to be executed and allocating computing resources according to the value assessment index includes:
[0096] All subsequent computational units to be executed are arranged in descending order of their corresponding value assessment indicators to generate an initial sorting sequence;
[0097] Traverse the initial sorting sequence, and for each computing unit, identify its direct predecessor computing unit according to the execution order relationship of the multi-level computing model;
[0098] Based on the resource pool allocation status and data transmission characteristics of the direct predecessor computing unit, a priority allocation identifier is generated for the current computing unit.
[0099] The initial sorting sequence is adjusted according to the priority allocation identifier, and the positions of the computing units with the priority allocation identifier are moved forward to generate the adjusted sorting sequence.
[0100] The allocation boundary position is calculated based on the total number of the adjusted sorted sequence and the preset ratio. The calculation units before the allocation boundary position are extracted to form a first allocation set, and the remaining calculation units are extracted to form a second allocation set.
[0101] The computing units in the first allocation set are set to full-width floating-point numbers and submitted to the first computing power resource pool, while the computing units in the second allocation set are set to reduced-width fixed-point numbers and submitted to the second computing power resource pool.
[0102] Combination Figure 2The flowchart illustrating the dynamic resource allocation and accuracy optimization of computational units based on value assessment indicators is provided below. For example, it collects the identifiers of all computational units currently awaiting execution in the inference task, reads the value assessment indicator for each computational unit, and arranges the computational units in descending order of value to form an initial sorting sequence. Computational units with higher value assessment indicators are placed at the beginning of the sequence, reflecting their greater potential contribution to the accuracy of the inference results.
[0103] After obtaining the initial sorting sequence, a priority adjustment operation based on data transmission characteristics is performed. Each computing unit in the initial sorting sequence is traversed, the topology of the multi-level computing model is parsed, the input data source node of the current computing unit is extracted, and the set of direct predecessor computing units providing the input data is determined. For each direct predecessor computing unit, its currently allocated resource pool type is queried to determine whether the predecessor unit is allocated to the first or second computing resource pool. When the predecessor computing unit is allocated to the first computing resource pool, if the current computing unit is also allocated to the same resource pool, the data transmission between them occurs within the same resource pool, resulting in low data transmission latency and no need for precision conversion. Based on this transmission characteristic, a priority allocation flag is added to the current computing unit; the flag value can be set to a Boolean flag. If the predecessor unit is located in the second computing resource pool and the current unit is allocated to the first resource pool, a format conversion from a reduced-width fixed-point number to a full-width floating-point number needs to be performed during data transmission. This conversion operation introduces additional overhead; in this case, no priority allocation flag is set for the current computing unit.
[0104] The initial sorting sequence is adjusted based on the generated priority allocation identifiers. The computational units carrying priority allocation identifiers are traversed in the sequence, and their positions are moved forward. The specific movement distance is determined by the identifier strength. Identifier strength is defined as the proportion of the number of predecessor nodes of the current computational unit that have been allocated to the same target resource pool to the total number of predecessor nodes. Specifically, if the current computational unit has 5 direct predecessor units, 3 of which have been allocated to the first computing power resource pool and the current unit will also be allocated to that resource pool, then the identifier strength is calculated as 3 ÷ 5 = 0.6. Higher identifier strength indicates better data locality, and a larger movement distance is allowed. In this embodiment, a identifier strength of 1.0 moves the unit forward 5 positions, an identifier strength of 0.6 to 0.8 moves the unit forward 3 positions, an identifier strength of 0.4 to 0.6 moves the unit forward 1 position, and an identifier strength below 0.4 does not move the unit. The moved sequence retains the basic sorting logic of the value assessment indicators while also considering the principle of data locality, allowing computational units with better data transmission paths to receive higher allocation priority, thus forming the adjusted sorting sequence.
[0105] When determining a specific resource allocation scheme, the total number of computing units contained in the adjusted sorted sequence is counted. This number is multiplied by a preset ratio and rounded down to obtain the index value of the allocation boundary position. For example, if the preset ratio is 30% and the total number of computing units is 20, then the boundary position index is 6. All computing units within the index value range are extracted from the beginning of the sequence to form the first allocation set, and the remaining computing units are assigned to the second allocation set.
[0106] For the computing units in the first allocation set, their computation parameters are configured in full-width floating-point format, typically using 32-bit floating-point representation, to ensure numerical precision during computation. Execution requests from these computing units are submitted to the task scheduling queue of the first computing resource pool, where high-performance computing cores in this pool are responsible for execution. For the computing units in the second allocation set, their computation parameters are converted to a reduced-width fixed-point format, which can use 8-bit or 16-bit fixed-point representation, sacrificing some numerical precision for increased computational speed. Execution requests from these units are submitted to the second computing resource pool, where they are processed by computing units with higher energy efficiency, achieving differentiated allocation of computing resources during inference.
[0107] This invention introduces a priority adjustment mechanism based on data transmission characteristics to further optimize resource allocation strategies while maintaining value assessment ranking. This method prioritizes allocating computational units with close data dependencies to the same resource pool, avoiding format conversion overhead and transmission latency caused by cross-resource pool data transmission, and reducing the number of data moves during inference.
[0108] In one optional implementation, the step of generating a priority allocation identifier for the current computing unit based on the resource pool allocation status and data transmission characteristics of the direct predecessor computing unit includes:
[0109] For each direct predecessor computing unit, query the resource pool allocation status of the direct predecessor computing unit, and extract the direct predecessor computing units that have been marked as allocated to the first computing power resource pool to form the direct predecessor subset of the first resource pool.
[0110] For each direct predecessor computing unit in the first resource pool direct predecessor subset, the predecessor weight is calculated based on the amount of intermediate data transmitted from the direct predecessor computing unit to the current computing unit.
[0111] Sum the predecessor weights of all direct predecessor computation units in the first resource pool's direct predecessor subset to generate a predecessor influence score;
[0112] The predecessor influence score is compared with a preset influence threshold. If the predecessor influence score is greater than the preset influence threshold, a priority allocation flag is added to the current computing unit.
[0113] For example, after calculating the value assessment index for each computing unit, in order to further optimize the dynamic allocation effect of computing resources, it is necessary to analyze the resource pool allocation relationship between the current computing unit and its direct predecessor computing units. When a large number of direct predecessor computing units of a certain computing unit are concentrated in the first computing resource pool, allocating the computing unit to the first computing resource pool can significantly reduce the data transmission overhead between different resource pools.
[0114] For a current computing unit awaiting allocation, all its direct predecessor computing units are first retrieved from the topology of the multi-level computing model. A direct predecessor computing unit refers to a computing unit in the computing graph that has a direct data dependency on the current computing unit and whose execution order precedes the current computing unit. The node identifiers of these direct predecessor computing units are obtained by traversing the edge information of the computing graph. The resource pool allocation status is stored as an enumeration in the "resource_pool_type" attribute field of the computing unit object. This enumeration type has two values: "POOL_HIGH_PRECISION" indicates that it has been allocated to the first computing resource pool, and "POOL_LOW_PRECISION" indicates that it has been allocated to the second computing resource pool. For computing units that have not yet completed resource allocation, this field has a value of "POOL_UNASSIGNED," indicating an unallocated status. This status enumeration value can be directly obtained by calling the getResourcePoolType() method of the computing unit object.
[0115] For each direct predecessor computing unit, query its resource pool allocation status tag determined in the previous round of resource allocation. This tag is stored in the computing unit's metadata attributes and contains the allocation result of the first or second computing resource pool. All direct predecessor computing units marked as belonging to the first computing resource pool are filtered out to form a subset of direct predecessors for the first resource pool. This subset reflects the distribution of high-priority computing paths upstream of the current computing unit.
[0116] For each direct predecessor computation unit in the first resource pool's direct predecessor subset, its contribution to data transmission in the current computation unit needs to be quantified. The "tensor_shape" attribute of each edge is read from the edge object of the computation graph. This attribute stores the size of each dimension of the intermediate data tensor as an integer array; for example, [64, 128, 256] represents a three-dimensional tensor. The total number of tensor elements is obtained by multiplying all the dimension values in this array; in the example above, this is 64 × 128 × 256 = 2097152. The total number of elements in this tensor is calculated as the data volume. This data volume is divided by the total amount of all input data received by the current computation unit to obtain the predecessor weight of this direct predecessor computation unit. This weight value ranges from 0 to 1, representing the proportion of the output data of this direct predecessor computation unit in the input of the current computation unit.
[0117] The predecessor weights of all direct predecessor computing units in the first resource pool's direct predecessor subset are summed to generate a predecessor influence score. This score reflects the proportion of input data received by the current computing unit from the first computing resource pool to its total input data. When the predecessor influence score is close to 1, it indicates that the main input data of the current computing unit comes from the first computing resource pool; when the predecessor influence score is close to 0, it indicates that its input data mainly comes from the second computing resource pool or the initial input data.
[0118] The calculated predecessor influence score is compared numerically with a preset influence threshold. The preset influence threshold is set according to the actual application scenario, with a typical value range of 0.6 to 0.8. When the predecessor influence score is greater than the preset influence threshold, it is determined that the current computing unit has a strong data dependency relationship with the first computing power resource pool. A numerical priority allocation gain parameter "pool_affinity_boost" is added to the metadata of the current computing unit. The value is the difference between the predecessor influence score and the preset influence threshold, with a value range of 0 to 0.4. This gain parameter is used in the subsequent sorting adjustment stage of computing power resource allocation. Specifically, based on the initial sorting sequence, for computing units carrying this gain parameter, their sorting position is moved forward by an additional floor(pool_affinity_boost × 10) positions, where floor represents the floor operation. For example, when the gain parameter is 0.25, it moves forward 2 positions; when the gain parameter is 0.15, it moves forward 1 position. This identifier serves as an additional constraint during subsequent allocation of computing resources, ensuring that the computing unit is preferentially allocated to the first computing resource pool when the value assessment indicators are similar, thereby reducing data transfer delays across resource pools and improving the overall execution efficiency of inference.
[0119] This invention introduces data locality constraints based on value assessment by quantitatively analyzing the correlation strength between the computing unit and its predecessor node's resource pool. This effectively avoids the frequent data format conversion and cross-pool transmission overhead caused by the dispersion of predecessor nodes in different resource pools for high-value computing units.
[0120] In one alternative implementation, the step of accumulating the output error resulting from reduced precision execution during the execution process includes:
[0121] For each computing unit that completes its computation in the second computing power resource pool with reduced bit width and fixed number of points, obtain its actual output result and the corresponding theoretical full-precision output result, and calculate the numerical difference between the two to generate the unit output error;
[0122] Obtain the influence range value corresponding to the calculation unit, and multiply the unit output error with the influence range value to generate a weighted unit error;
[0123] The weighted unit error is added to the error accumulation sequence, which stores the weighted unit errors in the order of the execution completion time of the computing units;
[0124] For each weighted unit error in the error accumulation sequence, a time decay coefficient is calculated based on the time interval between the execution completion time of the calculation unit corresponding to the weighted unit error and the current time.
[0125] The cumulative error is generated by multiplying the errors of all weighted units in the cumulative error sequence by their corresponding time decay coefficients and then summing the results.
[0126] For example, performing reduced-precision calculations during model inference introduces computational errors. To prevent the accumulation of errors from affecting the final inference result, it is necessary to cumulatively track the output errors generated by the reduced-precision execution.
[0127] For each computing unit that completes its operation in the second computing resource pool with reduced bit width and fixed-point number, a two-layer error evaluation mechanism is used to obtain the unit output error.
[0128] The first layer is a theoretical error estimation mechanism that performs a comprehensive and rapid evaluation of all computational units undergoing precision reduction. Based on the theoretical error model of bit-width quantization, for the quantization process of reducing from 32-bit floating-point numbers to 8-bit fixed-point numbers, the upper bound of the theoretical quantization error is calculated based on the quantization step size. The quantization step size is defined as the dynamic range of the value divided by the quantization level; for an 8-bit fixed-point number, the quantization level is 2 to the power of 8, or 256. Specifically, if the dynamic range of the output value of a computational unit is [-10, 10], then the quantization step size is 20 ÷ 256 ≈ 0.078, and the theoretical maximum quantization error is half of the quantization step size, or 0.039. This theoretical error serves as a preliminary estimate of the unit's output error, with minimal computational overhead, allowing for real-time evaluation of all precision-reduced units.
[0129] The second layer is a sampling and verification mechanism. A subset of computational units are selected for precise error measurement according to a preset sampling rate, typically set to 10%. For each selected computational unit, after it completes the reduced-precision execution, the same input data is resubmitted to the first computing resource pool. The unit is then re-executed using a full-width floating-point format to obtain a full-precision output result as a theoretical reference. The precise unit output error is calculated using Euclidean distance (L2 norm). The specific steps are as follows: First, the reduced-precision actual output vector and the full-precision theoretical output vector are converted to a unified numerical space (both converted to 32-bit floating-point format). The actual output vector is denoted as y_actual=( , ,..., The theoretical output vector is y_theory=( , ,..., ), calculate the element-wise difference between the two = Then, the L2 norm of the difference vector is calculated as the unit output error E_unit=sqrt(Σ( For example, for a feature vector with an output dimension of 512, the actual output is an 8-bit fixed-point vector with elements [0.321, 0.458, ...] after conversion, and the theoretical output is a 32-bit floating-point vector with elements [0.318, 0.462, ...]. After calculating the difference vector, its L2 norm is 0.042, that is, the exact unit output error of this calculation unit is 0.042.
[0130] The precise error obtained from actual sampling is used to calibrate the first-level theoretical error estimation model. A calibration coefficient is maintained, initially set to 1.0. After each sampling verification, the ratio of the measured error to the theoretical estimation error is calculated, and the calibration coefficient is updated using an exponential moving average method. The update formula is: Calibration coefficient_new = 0.9 × Calibration coefficient_old + 0.1 × (Measured error ÷ Theoretical estimation error). For the 90% of computational units that were not sampled, their unit output error is the theoretical estimation error multiplied by the current calibration coefficient.
[0131] Because different computational units play different roles in the model, the degree to which their errors affect subsequent calculations varies. The influence range value calculated during the value assessment phase of each computational unit is obtained; this influence range is the aforementioned influence range. Its calculation method is to sum and normalize the influence weights of all subsequent nodes in the subsequent node set. The value is between 0 and 1, reflecting the overall influence strength of the current computational unit's output on the subsequent computation chain. This value quantifies the strength of the data dependency relationship between the current computational unit and subsequent calculations. The unit output error is multiplied by the influence range value to generate a weighted unit error, which amplifies the error weight of computational units with a wide influence range and relatively reduces the error weight of computational units with a narrow influence range. Specifically, if the influence range value of a convolutional layer is 0.15 and its unit output error is 0.032, then the weighted unit error of this computational unit is 0.032 × 0.15 = 0.0048. If the influence range of a certain attention layer is 0.85 and its unit output error is 0.028, then the weighted unit error is 0.028 × 0.85 = 0.0238. This value is significantly higher than that of the aforementioned convolutional layer, reflecting that the attention layer error has a more extensive impact on subsequent calculations.
[0132] Maintain an error accumulation sequence data structure to store weighted unit errors. This sequence records the error of each weighted unit and its corresponding timestamp in the order of the computation unit's execution completion time. The timestamp is represented by the global execution sequence number of the computation unit; that is, the first executed computation unit has a sequence number of 1, the second has 2, and so on. Whenever a computation unit completes execution in the second computing resource pool, its weighted unit error along with its execution sequence number is appended to the end of the sequence. This sequence is implemented using a first-in, first-out (FIFO) queue structure. Each element in the queue is a tuple (weighted unit error, execution sequence number), allowing for rapid appending of new elements and retrieval of historical error data. To prevent the queue from growing indefinitely, a maximum queue length of 100 is set; when the number of elements in the queue exceeds 100, the oldest element is automatically deleted.
[0133] Considering that the impact of errors in early computation units gradually diminishes as the inference process progresses due to numerical changes in intermediate layers, a time decay mechanism needs to be introduced. For each weighted unit error in the error accumulation sequence, the difference between the execution sequence number of the computation unit corresponding to that error and the sequence number of the currently executed computation unit is calculated. This difference is the time interval Δt, representing the number of computation units passed. The time decay coefficient is calculated based on the time interval using the exponential decay function decay=exp(-μ·Δt), where μ is the decay rate parameter. The decay rate μ is typically set between 0.1 and 0.3; in this embodiment, it is set to 0.2. The specific calculation example is as follows: Currently, the execution has reached the 50th calculation unit. The cumulative error sequence stores the weighted unit error of the 45th calculation unit, which is 0.0048. Therefore, the time interval Δt = 50 - 45 = 5, and the attenuation coefficient is calculated as exp(-0.2 × 5) = exp(-1.0) ≈ 0.368. The effective error of this weighted unit error after time attenuation is 0.0048 × 0.368 ≈ 0.00177. If the sequence also stores the weighted unit error of the 40th calculation unit, which is 0.0238, then the time interval Δt = 50 - 40 = 10, the attenuation coefficient is exp(-0.2 × 10) = exp(-2.0) ≈ 0.135, and the effective error is 0.0238 × 0.135 ≈ 0.00321. If the weighted unit error of the 35th computational unit is stored as 0.0156, then the time interval Δt = 15, the attenuation coefficient is exp(-0.2×15) = exp(-3.0) ≈ 0.050, and the effective error is 0.0156×0.050 ≈ 0.00078. This mechanism attenuates the error weight before 10 computational units to 13.5% of its original value, and the error before 15 computational units to 5.0%, fully reflecting the natural attenuation law of early error influence over time. The exponential decay function exp(-μ·Δt) is a well-known standard time decay model in the art. The range of the attenuation rate parameter μ from 0.1 to 0.3 has been clearly defined. The attenuation effect under different time intervals is clearly illustrated through the specific numerical examples above. Those skilled in the art can select an appropriate value of μ and implement this technical solution according to the layer depth and error propagation characteristics of the model.
[0134] The error accumulation sequence iterates through all weighted unit errors, multiplying each weighted unit error by its corresponding time decay coefficient to obtain the decayed error. The cumulative error E_cumulative = Σ(weighted unit error ᵢ × decay coefficient ᵢ) is then summed. Continuing with the previous example, when the 50th computation unit is executed, the error accumulation sequence stores the weighted unit errors of computation units 35, 40, and 45. The effective errors after time decay are 0.00078, 0.00321, and 0.00177, respectively. The cumulative error is calculated as E_cumulative = 0.00078 + 0.00321 + 0.00177 = 0.00576. This cumulative error comprehensively reflects the overall error level generated by the reduced precision execution at the current moment.
[0135] The accumulated error is compared with a preset tolerance value. The preset tolerance value is set according to the accuracy requirements of the inference task. For image classification tasks with high accuracy requirements, the tolerance value can be set to 0.01; for object detection tasks with relatively relaxed accuracy requirements, the tolerance value can be set to 0.02 to 0.03. When the accumulated error exceeds the tolerance value, a precision switching mechanism is triggered, generating a precision switching signal and transmitting it to the computing resource scheduling module. This switches subsequent computing units from reduced precision mode to full precision mode, thereby preventing further error propagation and unacceptable inference results. Continuing with the previous example, if the preset tolerance value is set to 0.01, and the current cumulative error of 0.00576 has not yet exceeded the tolerance value, the current precision allocation strategy will continue to be executed. If the subsequent execution of computing units 51 and 52 is both executed in the second computing power resource pool and the weighted unit errors generated are 0.0032 and 0.0041 respectively, after time decay (Δt=0, decay coefficient is 1.0), the cumulative error will be updated to 0.00576+0.0032+0.0041=0.01306, which exceeds the tolerance value of 0.01, triggering the precision switching mechanism.
[0136] After precision switching, one of the following two strategies can be adopted. Strategy 1 is a global switch, which switches all subsequent computational units to full precision and submits them to the first computing resource pool for execution, ensuring the highest accuracy of the inference results. Strategy 2 is a selective switch, which switches only the subsequent dependent computational units of the computational units with the highest contribution in the error accumulation sequence to full precision, while the remaining computational units continue to execute at reduced precision, achieving a balance between accuracy assurance and computing power conservation.
[0137] This invention can accurately identify key error sources that significantly affect inference results. Compared with traditional fixed threshold error monitoring methods, this method considers the spatial propagation range and time decay characteristics of errors, making the calculation of cumulative errors more consistent with the actual laws of neural network error propagation. It can maximize the reduction of the proportion of precision calculations while ensuring inference accuracy, and significantly improve the utilization efficiency of computing resources.
[0138] In one optional implementation, the step of switching subsequent calculation units to full calculation accuracy when the accumulated error exceeds a preset tolerance value includes:
[0139] When the accumulated error exceeds a preset tolerance value, a precision switching process is triggered.
[0140] Traverse the error accumulation sequence, extract the calculation unit identifier corresponding to each weighted unit error for each weighted unit error, obtain the influence range value of the calculation unit in the multi-level calculation model, and multiply the weighted unit error with the influence range value to generate the error contribution.
[0141] All calculation units are sorted in descending order of error contribution, and a predetermined number of calculation units with the highest error contribution are extracted to form a set of calculation units with high error contribution.
[0142] Identify all subsequent computation units in the execution order relationship of each computation unit in the high error contribution computation unit set in the multi-level computation model, and form a switching target computation unit set;
[0143] For each computing unit in the set of target computing units to be switched, if the computing unit is currently allocated to the second computing power resource pool, then the computing precision parameter of the computing unit is switched from a reduced bit width fixed-point number to a full bit width floating-point number, and the computing unit is resubmitted to the first computing power resource pool.
[0144] For example, during inference execution, the output errors generated by each computational unit performing reduced precision execution in the second computing resource pool are continuously monitored. Based on the aforementioned cumulative error calculation method, the cumulative error value is maintained in real time, taking into account unit output error, influence range weighting, and time decay effects.
[0145] When the accumulated error exceeds a preset tolerance value, the accuracy recovery mechanism is activated. The preset tolerance value is typically set based on the accuracy requirements of the inference task. For image classification tasks with high accuracy requirements, it can be set to 0.01; for object detection tasks with relatively lenient accuracy requirements, it can be set to 0.02 to 0.03; and for video processing tasks with high real-time requirements, it can be appropriately relaxed to 0.05. This tolerance value needs to strike a balance between accuracy assurance and computational efficiency based on the specific application scenario. Taking image classification tasks as an example, if the preset tolerance value is 0.01, when the calculated accumulated error reaches 0.01306, it is determined that the accumulated error has exceeded the acceptable range, and the accuracy recovery mechanism is immediately triggered.
[0146] The algorithm iterates through each element in the cumulative error sequence, where each element is stored as a binary tuple in the format (weighted unit error, execution sequence number). For each weighted unit error record in the sequence, the actual contribution of that computational unit to the cumulative error needs to be calculated. The identifier of the computational unit is obtained based on its execution sequence number. The influence range value R calculated by this unit during the value assessment phase is then retrieved using the identifier. This influence range value quantifies the strength of the computational unit's impact on subsequent computational chains, and its value is between 0 and 1. The time decay coefficient of this computational unit is calculated using the difference Δt between the current execution sequence number and the execution sequence number of this computational unit. The decay coefficient is calculated using the exponential decay function decay=exp(-μ·Δt), with the decay rate parameter μ set to 0.2. The weighted unit error, the influence range value, and the time decay coefficient are multiplied together to generate the error contribution of this computational unit. The calculation formula is: Error Contribution = Weighted Unit Error × Influence Range × Time Decay Coefficient. This error contribution reflects the actual impact of the error generated by this computational unit on the overall inference result at the current moment, comprehensively considering the error magnitude, influence range, and time decay.
[0147] For example: Suppose that the current execution is at the 50th calculation unit, the error accumulation sequence records the weighted unit error of the 35th calculation unit as 0.0156 with an influence range of 0.25, the weighted unit error of the 40th calculation unit as 0.0238 with an influence range of 0.85, the weighted unit error of the 45th calculation unit as 0.0048 with an influence range of 0.15, the weighted unit error of the 48th calculation unit as 0.0032 with an influence range of 0.62, and the weighted unit error of the 49th calculation unit as 0.0041 with an influence range of 0.78. Calculate the error contribution of each calculation unit: For the 35th time interval Δt = 15, the attenuation coefficient = exp(-0.2 × 15) ≈ 0.050, and the error contribution = 0.0156 × 0.25 × 0.050 ≈ 0.000195; For the 40th time interval Δt = 10, the attenuation coefficient = exp(-0.2 × 10) ≈ 0.135, and the error contribution = 0.0238 × 0.85 × 0.135 ≈ 0.00273; For the 45th time interval Δt = 5, the attenuation coefficient = exp(-0.2 × 15) ≈ 0.050, and the error contribution = 0.0156 × 0.25 × 0.050 ≈ 0.000195. =0.368, error contribution = 0.0048×0.15×0.368≈0.000265; the time interval Δt=2 for the 48th event, attenuation coefficient = exp(-0.2×2)≈0.670, error contribution = 0.0032×0.62×0.670≈0.00133; the time interval Δt=1 for the 49th event, attenuation coefficient = exp(-0.2×1)≈0.819, error contribution = 0.0041×0.78×0.819≈0.00262.
[0148] After calculating the error contribution of all computational units, they are sorted in descending order of their numerical values, resulting in the following ranking: Unit 40 (0.00273), Unit 49 (0.00262), Unit 48 (0.00133), Unit 45 (0.000265), and Unit 35 (0.000195). The top N computational units with the highest error contribution are selected from this ranking. The value of N can be set to 3 to 5, depending on the model size and the length of the cumulative error sequence. For small- to medium-sized models (less than 50 layers), N can be set to 3; for large-scale models (more than 100 layers), N can be set to 5. In this example, the top three computational units, namely Units 40, 49, and 48, are selected, and their identifiers are stored in a set of high-error-contribution computational units. This set identifies the computational units primarily responsible for the current cumulative error, whose errors have the most significant impact on the inference results.
[0149] Based on the topology of a multi-level computational model, this paper analyzes the subsequent dependencies of computational units in the set of high-error-contribution computational units. The topology of the multi-level computational model is represented by a directed acyclic graph (DAG), where nodes are computational units and edges represent data dependencies. For each computational unit in the set, a breadth-first traversal algorithm is executed to find all subsequent computational units that are executed after it and have direct or indirect data dependencies. The specific implementation steps are as follows: Initialize an empty switching candidate set and an access flag array. For each computational unit identifier in the set of high-error-contribution computational units, add it as the starting point of the traversal to the queue to be visited. Take a computational unit node from the queue and query all direct successor nodes of that node, i.e., nodes in the topology where there is a directed edge from that node to its successor node. For each successor node, check whether it has been marked as visited in the access flag array. If not, add its identifier to the switching candidate set, mark it as visited, and add the successor node to the queue to continue searching for its subsequent nodes. Repeat the above process until the queue to be visited is empty. Finally, after traversing all computational units in the set of computational units with high error contributions, the candidate set is switched to summarize all downstream computational units affected by high errors.
[0150] Continuing with the previous example, assume the following dependencies in the model topology for the three high error contribution calculation units No. 40, No. 49, and No. 48: the direct successors of calculation unit No. 40 are No. 43 and No. 46; the direct successors of calculation unit No. 49 are No. 51 and No. 52; the direct successors of calculation unit No. 48 are No. 51 and No. 53; the successor of No. 43 is No. 50; the successor of No. 46 is No. 52; the successors of No. 51 are No. 55 and No. 56; the successors of No. 52 are No. 56 and No. 58; and the successor of No. 53 is No. 57. Perform a breadth-first traversal: Starting from number 40, find successors 43 and 46, continue traversing to find numbers 50 and 52; starting from number 49, find successors 51 and 52, continue traversing to find numbers 55, 56, and 58; starting from number 48, find successors 51 and 53, continue traversing to find numbers 55, 56, and 57. After removing duplicates, the candidate set contains 10 subsequent calculation units: {43, 46, 50, 51, 52, 53, 55, 56, 57, 58}.
[0151] Since the candidate set for switching contains a large number of subsequent computation units, switching all of them to full precision would significantly increase computational power consumption, so further screening is required. For each computation unit in the candidate set for switching, its current resource allocation status and execution status are checked. The screening criteria include: the computation unit is currently allocated to the second computing power resource pool; the computation unit uses a fixed-point representation with reduced bit width; and the computation unit has not yet started execution or is waiting in the execution queue of the second computing power resource pool. Only computation units that meet the above three conditions are added to the final set of target computation units for switching. Computation units that have already completed execution do not need to be switched, and computation units that have already executed in the first computing power resource pool do not need to be reassigned. In the previous example, suppose that after checking, it is found that units 43, 46, and 50 have already been executed in the first computing power resource pool and do not need to be switched; unit 51 is being executed in the second resource pool but has not yet completed and can be switched; units 52, 55, and 56 are in the execution queue of the second resource pool and can be switched; and units 53, 57, and 58 have completed execution and do not need to be switched. The final target computing unit set includes four computing units: {51, 52, 55, 56}.
[0152] Iterate through each computational unit identifier in the target computational unit set and perform a precision restoration operation for each computational unit. The specific steps of the precision restoration operation are as follows: First, modify the numerical representation format of the computational unit. Query the numerical format parameter currently used by the computational unit. If the current representation is an 8-bit fixed-point number (the fixed-point number format includes 1 sign bit, 4 integer bits, and 3 decimal bits), then modify the numerical representation format parameter to a 32-bit floating-point number format (IEEE 754 standard single-precision floating-point number, including 1 sign bit, 8 exponent bits, and 23 mantissa bits). At the same time, update the data type identifier of the input / output data buffer associated with the computational unit from INT8 to FLOAT32. The second step involves modifying the computation kernel parameters. The execution of a computation unit depends on a pre-compiled computation kernel. Different precision kernels differ in their numerical operation instructions. The low-precision kernel identifier currently bound to the computation unit, such as "conv2d_int8_kernel," is queried and replaced with the corresponding full-precision kernel identifier, "conv2d_fp32_kernel." The computation kernel call parameter list is updated, including input tensor pointers, output tensor pointers, and weight parameter pointers, ensuring that the data types pointed to by all pointers match the 32-bit floating-point format. The third step adjusts resource pool allocation. The scheduling record of the computation unit is removed from the execution queue of the second computing resource pool. If the computation unit is currently executing on a computing device in the second computing resource pool and has not yet completed, an interrupt signal is sent to terminate the current execution process of the computation unit, releasing the occupied computing resources. The computation unit is then resubmitted to the high-priority queue of the first computing resource pool. The high-priority queue uses a priority scheduling strategy to ensure that computation units with restored precision can obtain computing resources from the first computing resource pool first, avoiding increased overall inference latency due to excessive waiting time. The fourth step is to synchronize the status update. In the global computing unit status table, the status identifier of the computing unit is updated from "Second resource pool - low precision execution" to "First resource pool - full precision execution". The timestamp of the status transition is recorded for subsequent performance analysis and optimization. Subsequent computing units that depend on the output of this computing unit are notified that their input data will come from the full precision execution result, and subsequent computing units need to adjust their input data format accordingly.
[0153] Taking the aforementioned target set {51, 52, 55, 56} as an example, computing unit 51 is currently executing in the second resource pool. It sends an interrupt signal to terminate the process, and the numerical format is switched from INT8 to FLOAT32. The computing core is switched from "matmul_int8_kernel" to "matmul_fp32_kernel". It is then resubmitted to the high-priority queue of the first resource pool. Computing units 52, 55, and 56 are directly removed from the queue in the second resource pool's pending execution queue. After the format is switched, they are submitted to the first resource pool.
[0154] After the accuracy switch is complete, the error monitoring state needs to be reset to establish a new error monitoring baseline for subsequent inference execution. Clear the current cumulative error sequence, delete all (weighted unit error, execution sequence number) binary records stored in the sequence, and release the memory space occupied by the sequence. Reset the cumulative error value to zero, indicating that error accumulation calculation will restart from the current moment. Reset the calibration coefficient to the initial value of 1.0, restoring the theoretical error estimation model to its default state. Record the trigger time of this accuracy recovery event, the cumulative error value at the time of triggering, the list of high error contribution calculation unit identifiers, and the number of target calculation units to be switched to the log for offline analysis and accuracy control strategy optimization.
[0155] After the state is reset, subsequent inference calculations continue. For computation units that have switched to full precision, their outputs after execution in the first computing resource pool are no longer included in the error accumulation sequence, because full precision execution does not introduce quantization errors. For subsequent computation units that are still executing in the second computing resource pool with reduced precision, the unit output error and weighted unit error are calculated according to the previous method and added to the new error accumulation sequence. The error monitoring and precision recovery mechanism is executed cyclically until the entire inference task is completed.
[0156] Compared to the global precision switching strategy, the precision recovery mechanism of this invention has a significant advantage in saving computing power.
[0157] A second aspect of this invention provides dynamic allocation of computing power during the inference process of an intelligent model, including:
[0158] The model acquisition unit is used to acquire the multi-level computing model corresponding to the task to be reasoned, wherein the multi-level computing model contains multiple computing units arranged in the order of execution.
[0159] The value assessment unit is used to extract intermediate output data for each completed calculation unit during the execution of inference, determine the uncertainty measure of the current calculation unit based on the distribution dispersion and numerical stability of the intermediate output data, determine the influence range of the current calculation unit on subsequent calculation units based on the data dependency relationship between calculation units, and generate a value assessment index by weighting and combining the uncertainty measure and the influence range.
[0160] The resource allocation unit is used to sort the subsequent computing units to be executed according to the value assessment index and allocate computing resources. The computing units with a preset proportion of the ranking are allocated to the first computing resource pool with full computing precision, and the computing units with lower ranking are allocated to the second computing resource pool with reduced computing precision. The processing capacity of the first computing resource pool is higher than that of the second computing resource pool.
[0161] The execution monitoring unit is used to execute subsequent calculation units according to the computing power resource allocation results. During the execution process, it accumulates the output error caused by the reduced precision execution. When the accumulated error exceeds the preset tolerance value, the subsequent calculation unit is switched to full calculation precision.
[0162] A third aspect of the present invention provides an electronic device, comprising:
[0163] processor;
[0164] Memory used to store processor-executable instructions;
[0165] The processor is configured to invoke instructions stored in the memory to execute the aforementioned method.
[0166] A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the aforementioned method.
[0167] This invention can be a method, apparatus, system, and / or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for performing various aspects of the invention.
[0168] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for dynamic allocation of computing power in intelligent model inference processes, characterized in that, include: Obtain the multi-level computational model corresponding to the task to be reasoned, wherein the multi-level computational model contains multiple computational units arranged in the order of execution; During the inference process, for each completed computation unit, the intermediate output data is extracted. The uncertainty measure of the current computation unit is determined based on the distribution dispersion and numerical stability of the intermediate output data. The influence range of the current computation unit on subsequent computation units is determined based on the data dependency relationship between computation units. The uncertainty measure and the influence range are weighted and combined to generate a value assessment index. The computational units to be executed are sorted and computing resources are allocated according to the value assessment index. The computational units with the highest sorting accuracy are allocated to the first computing resource pool with full computational accuracy, while the computational units with the lowest sorting accuracy are allocated to the second computing resource pool with reduced computational accuracy. The processing capacity of the first computing resource pool is higher than that of the second computing resource pool. The subsequent computing units are executed according to the computing power resource allocation results. During the execution process, the output error caused by the reduced precision is accumulated. When the accumulated error exceeds the preset tolerance value, the subsequent computing units are switched to full computing precision.
2. The method according to claim 1, characterized in that, The steps for determining the scope of influence of the current computing unit on subsequent computing units based on the data dependencies between computing units include: Construct a computation graph for the multi-level computation model, where nodes in the computation graph correspond to computation units, and directed edges represent data dependencies between computation units; Locate the current node corresponding to the current computing unit in the computation graph, perform a breadth-first traversal along the directed edges from the current node, record the level distance of each visited subsequent node, and obtain the set of reachable subsequent nodes. Identify the output node corresponding to the final output layer of the multi-level computation model in the computation graph; for each subsequent node in the set of subsequent nodes, calculate the shortest path length from the subsequent node to the output node; For each subsequent node, a distance influence factor is determined based on the hierarchical distance of that subsequent node. The distance influence factor is then multiplied by the inverse of the shortest path length from that subsequent node to the output node to generate an influence weight. The influence weights of all subsequent nodes in the set of subsequent nodes are summed and normalized to generate the influence range of the current computing unit on subsequent computing units.
3. The method according to claim 2, characterized in that, The steps for determining the distance influence factor based on the hierarchical distance of the subsequent node include: The hierarchical distance of the subsequent node is used as the input of the exponential decay function to calculate the decay coefficient. The exponential decay function makes the decay coefficient of the subsequent node with a larger hierarchical distance smaller. Obtain the type identifier of the computation unit corresponding to the subsequent node, and query the corresponding type weight coefficient from the preset type weight mapping table according to the type identifier; in the type weight mapping table, the convolution type computation unit corresponds to the first type weight coefficient, the attention type computation unit corresponds to the second type weight coefficient, and the fully connected type computation unit corresponds to the third type weight coefficient. The distance influence factor is generated by multiplying the attenuation coefficient with the type weight coefficient.
4. The method according to claim 1, characterized in that, The steps of sorting and allocating computing resources to subsequent computing units based on the value assessment indicators include: All subsequent computational units to be executed are arranged in descending order of their corresponding value assessment indicators to generate an initial sorting sequence; Traverse the initial sorting sequence, and for each computing unit, identify its direct predecessor computing unit according to the execution order relationship of the multi-level computing model; generate a priority allocation identifier for the current computing unit based on the resource pool allocation status and data transmission characteristics of the direct predecessor computing unit; The initial sorting sequence is adjusted according to the priority allocation identifier, and the positions of the computing units with the priority allocation identifier are moved forward to generate the adjusted sorting sequence. The allocation boundary position is calculated based on the total number of the adjusted sorted sequence and the preset ratio. The calculation units before the allocation boundary position are extracted to form a first allocation set, and the remaining calculation units are extracted to form a second allocation set. The computing units in the first allocation set are set to full-width floating-point numbers and submitted to the first computing power resource pool, while the computing units in the second allocation set are set to reduced-width fixed-point numbers and submitted to the second computing power resource pool.
5. The method according to claim 4, characterized in that, The steps for generating a priority allocation identifier for the current computing unit based on the resource pool allocation status and data transmission characteristics of the direct predecessor computing unit include: For each direct predecessor computing unit, query the resource pool allocation status of the direct predecessor computing unit, and extract the direct predecessor computing units that have been marked as allocated to the first computing power resource pool to form the direct predecessor subset of the first resource pool. For each direct predecessor computing unit in the first resource pool direct predecessor subset, the predecessor weight is calculated based on the amount of intermediate data transmitted from the direct predecessor computing unit to the current computing unit. Sum the predecessor weights of all direct predecessor computation units in the first resource pool's direct predecessor subset to generate a predecessor influence score; The predecessor influence score is compared with a preset influence threshold. If the predecessor influence score is greater than the preset influence threshold, a priority allocation flag is added to the current computing unit.
6. The method according to claim 1, characterized in that, The steps that accumulate and reduce the output error caused by precision execution during the execution process include: For each computing unit that completes its computation in the second computing power resource pool with reduced bit width and fixed number of points, obtain its actual output result and the corresponding theoretical full-precision output result, and calculate the numerical difference between the two to generate the unit output error; Obtain the influence range value corresponding to the calculation unit, and multiply the unit output error with the influence range value to generate a weighted unit error; The weighted unit error is added to the error accumulation sequence, which stores the weighted unit errors in the order of the execution completion time of the computing units; For each weighted unit error in the error accumulation sequence, a time decay coefficient is calculated based on the time interval between the execution completion time of the calculation unit corresponding to the weighted unit error and the current time. The cumulative error is generated by multiplying the errors of all weighted units in the cumulative error sequence by their corresponding time decay coefficients and then summing the results.
7. The method according to claim 6, characterized in that, The steps for switching subsequent calculation units to full calculation accuracy when the accumulated error exceeds a preset tolerance value include: When the accumulated error exceeds a preset tolerance value, a precision switching process is triggered. Traverse the error accumulation sequence, extract the calculation unit identifier corresponding to each weighted unit error for each weighted unit error, obtain the influence range value of the calculation unit in the multi-level calculation model, and multiply the weighted unit error with the influence range value to generate the error contribution. All calculation units are sorted in descending order of error contribution, and a predetermined number of calculation units with the highest error contribution are extracted to form a set of calculation units with high error contribution. Identify all subsequent computation units in the execution order relationship of each computation unit in the high error contribution computation unit set in the multi-level computation model, and form a switching target computation unit set; For each computing unit in the set of target computing units to be switched, if the computing unit is currently allocated to the second computing power resource pool, the computing precision parameter of the computing unit is switched from a reduced bit-width fixed-point number to a full bit-width floating-point number, and then resubmitted to the first computing power resource pool.
8. A dynamic computing power allocation system for the intelligent model reasoning process, used to implement the method of any one of claims 1-7, characterized in that, include: The model acquisition unit is used to acquire the multi-level computing model corresponding to the task to be reasoned, wherein the multi-level computing model contains multiple computing units arranged in the order of execution. The value assessment unit is used to extract intermediate output data for each completed calculation unit during the execution of inference, determine the uncertainty measure of the current calculation unit based on the distribution dispersion and numerical stability of the intermediate output data, determine the influence range of the current calculation unit on subsequent calculation units based on the data dependency relationship between calculation units, and generate a value assessment index by weighting and combining the uncertainty measure and the influence range. The resource allocation unit is used to sort the subsequent computing units to be executed according to the value assessment index and allocate computing resources. The computing units with a preset proportion of the ranking are allocated to the first computing resource pool with full computing precision, and the computing units with lower ranking are allocated to the second computing resource pool with reduced computing precision. The processing capacity of the first computing resource pool is higher than that of the second computing resource pool. The execution monitoring unit is used to execute subsequent calculation units according to the computing power resource allocation results. During the execution process, it accumulates the output error caused by the reduced precision execution. When the accumulated error exceeds the preset tolerance value, the subsequent calculation unit is switched to full calculation precision.
9. An electronic device, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1 to 7.
10. A computer-readable storage medium having computer program instructions stored thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1 to 7.