Mapping method for neural network computing in heterogeneous environment
By dynamically adjusting operator mapping in a heterogeneous environment, and combining topological convergence and performance degradation gradient, the pipeline blocking problem of neural network computation graph is solved, improving the computational efficiency and throughput of heterogeneous clusters and reducing system deployment costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAN HANGYU CHUANGTONG EQUIP MFG CO LTD
- Filing Date
- 2026-05-25
- Publication Date
- 2026-06-19
AI Technical Summary
In heterogeneous environments, the dimensionality mismatch between the topological semantics of the neural network computation graph and the hardware execution state causes transient blockages in the inference pipeline, and existing technologies have not been able to effectively solve this problem.
By analyzing the neural network computation graph, target operator nodes are identified, topological convergence and tensor asynchronous retention integral are calculated, and the operator mapping relationship is dynamically adjusted by combining the transient execution deviation rate of heterogeneous nodes. The task allocation is optimized by utilizing the synchronous waiting period and performance decay gradient.
It eliminates pipeline stalls caused by misalignment between computational model semantics and hardware physical characteristics, improves the computational efficiency and throughput of heterogeneous clusters, avoids nonlinear fluctuations in hardware performance, and reduces system deployment costs.
Smart Images

Figure CN122242603A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of neural network computing technology, and in particular relates to a mapping method for neural network computing in a heterogeneous environment. Background Technology
[0002] Currently, neural network computation graphs are distributed and executed in heterogeneous computing clusters containing general-purpose processors, graphics processing units, and dedicated neural processing units. As the scale of model parameters increases, the efficiency of mapping computational tasks to physical hardware directly determines the system's inference latency and throughput. Current technologies typically use a static slicing strategy to allocate operators to specific hardware based on the theoretical computational load of each operator in the computation graph and the peak computing power of computing nodes under ideal conditions. This approach assumes that hardware resources have constant processing power in the load balancing model and determines the task allocation scheme accordingly.
[0003] However, in large-scale heterogeneous environments, the execution state of computing nodes exhibits high dynamic uncertainty. Due to the different temperature control mechanisms of each computing unit, the chip's main frequency experiences transient frequency reduction during peak load periods. Furthermore, data transmission between nodes is limited by bus bandwidth contention. The convergence operators commonly found in neural network model topologies have a strong synchronous dependency on multi-path input tensors. If a key operator in the convergence path is mapped to a performance-degrading node, it will cause the output tensor of the predecessor branch to asynchronously linger in the target node's memory. This semantic misalignment between the model's logical structure and the hardware's physical characteristics not only triggers physical backpressure on the memory space but also causes unpredictable blockages in the overall cluster inference pipeline, reducing the actual utilization of computing resources. To alleviate these contradictions, the industry has tried to balance processing speed by expanding redundant computing nodes or increasing the global synchronization cache capacity. However, these methods only attempt to compensate at the hardware scale level and fail to address the semantics of the computation graph topology and the evolution of the node's execution state. The deep mismatch between degrees and the high cost of hardware construction significantly increases the deployment cost of the system. For example, Chinese invention patent application CN119621269A discloses an operator scheduling method, device, electronic device, storage medium and program product. It constructs a characteristic vector by extracting static features, dynamic features and topological features of the operator, and combines the processor state vector into the scheduling model to determine the operator scheduling mode. Although the scheduling scheme considers logical feature matching, it still regards the hardware as a physical entity with constant response characteristics. It only establishes the relationship between features and latency through statistical models, without penetrating to the evolution mechanism of the hardware physical state. The nonlinear performance fluctuation caused by the temperature control frequency reduction of heterogeneous nodes has lag and physical evolution characteristics. The existing technology lacks an accurate characterization of the asynchronous waiting gap caused by the topological semantics of the model and the phase relationship between the hardware physical temperature control recovery cycle. The scheduling decision cannot use the inherent gap of the model to cover the hardware performance cliff. The system throughput is limited by the transient mismatch of hardware physical characteristics.
[0004] Therefore, the technical problem to be solved by this invention is to combine the topological convergence characteristics of the neural network computation graph with the transient execution deviation rate of heterogeneous nodes to establish a mechanism that can dynamically adjust the operator mapping relationship according to the hardware physical evolution trend, so as to eliminate pipeline stagnation caused by the misalignment between the semantics of the computation model and the characteristics of the physical carrier. Summary of the Invention
[0005] This invention aims to solve the problem of transient blocking in the inference pipeline caused by dimensional mismatch between the topological semantics of the computation graph and the execution state of heterogeneous hardware.
[0006] In this technical solution, a mapping method for neural network computation in a heterogeneous environment includes the following steps:
[0007] Step 101: Analyze the neural network computation graph to be mapped to identify the target operator node; obtain the total input tensor volume and single-path output tensor volume of the target operator node, and calculate the ratio of the total input tensor volume to the single-path output tensor volume to determine the topological convergence degree; calculate the time deviation of each predecessor branch of the target operator node to the computational resource unit, and determine the tensor asynchronous retention integral by multiplying the time deviation by the tensor volume that arrives early.
[0008] Step 102: Obtain the actual processing delay of the computing resource unit for processing the operator task, and calculate the ratio of the actual processing delay to the preset theoretical processing delay to obtain the transient operation deviation rate; solve the first derivative of the time series of the transient operation deviation rate to generate the performance degradation gradient.
[0009] Step 103: Select operator nodes whose topological convergence and tensor asynchronous retention integral exceed preset limits; traverse candidate computing resource units with negative performance decay gradients and in the temperature control adjustment recovery period; use the synchronization waiting period determined by the time deviation as a timing mask to map the selected operator nodes to the candidate computing resource units, and issue a mapping instruction set containing the mapping binding relationship.
[0010] Preferably, step 101 extracts the tensor asynchronous retention integral, including: obtaining the time domain difference between the longest predecessor branch time consumption and the shortest predecessor branch time consumption of the target operator node, and determining the product of the time domain difference and the tensor volume that arrives ahead of time as the tensor asynchronous retention integral.
[0011] Preferably, generating the performance degradation gradient in step 102 includes: periodically collecting the processing feedback time of the computing resource unit and the hardware junction temperature parameters; fitting the computing output evolution curve based on the change trajectory of the transient operating deviation rate of the computing resource unit within a preset historical window, and determining the slope of the computing output evolution curve as the performance degradation gradient.
[0012] Preferably, the process of extracting the performance degradation gradient also includes the following sub-steps: introducing an exponential smoothing processing mechanism that includes historical degradation factors to reduce noise in the transient operating deviation rate; and combining the characteristic parameters of the hardware architecture to which the computing resource unit belongs to perform gain correction on the processed transient operating deviation rate to compensate for the data feedback lag caused by the difference in clock adjustment mechanism between different computing resource units.
[0013] Preferably, in step 103, mapping operator nodes to candidate computing resource units includes: calculating the real-time processing entropy value of each computing resource unit, wherein the real-time processing entropy value is determined by a weighted sum of the memory access bandwidth utilization rate of the computing resource unit and the waiting queue depth of the computing channel; constructing a mapping evaluation index that includes tensor asynchronous retention integral, performance decay gradient and real-time processing entropy value; and classifying and ranking each computing resource unit according to the value of the mapping evaluation index.
[0014] Preferably, the mapping process in step 103 further includes the following sub-steps: step 1031, monitoring the bandwidth occupancy rate of the internal bus of the heterogeneous computing resource pool; step 1032, when the real-time processing entropy value of the target computing resource unit exceeds a preset threshold and the bandwidth occupancy rate reaches a preset pressure threshold, suppressing the flow of operators with high topological convergence characteristics to the target computing resource unit.
[0015] Preferably, the method further includes the following steps: Step 1041, establishing a global tensor throughput feedback link to capture the overall computing performance of the mapped heterogeneous computing resource pool; Step 1042, dynamically correcting the weight coefficient of the performance degradation gradient in Step 102 based on the degree of convergence between the overall computing performance and the sum of the theoretical physical peak values of the hardware.
[0016] Preferably, the method further includes: parsing the logical dependency depth between operators in the neural network model to be processed, and constructing an operator mapping priority queue by combining the tensor asynchronous retention integral.
[0017] Preferably, the operator mapping priority queue is arranged in the following order: each operator node is associated with the computing resource unit determined based on the temporal mask, in order of increasing logical dependency depth.
[0018] Compared with existing technologies, the mapping method for neural network computation in heterogeneous environments of this invention has the following advantages:
[0019] 1. In the mapping of neural network computation, the topological convergence degree is determined based on the ratio of the number of in-degree to the number of out-degree of operator nodes. Combined with the transient execution deviation rate of the computing unit that deviates from the ideal throughput capacity, the semantics of the neural network computation graph and the physical execution state of heterogeneous hardware are deeply aligned. This avoids computing units in a performance degradation state from taking on closing operators with high-dimensional tensor stacking characteristics, and eliminates cluster-level synchronization waiting latency caused by critical path convergence.
[0020] 2. By analyzing the arrival time differences of multiple predecessor branches at specific operators, the asynchronous retention integral of tensors is determined. Combined with the decay gradient of the underlying physical response capability of heterogeneous computing units, the task distribution decision can predict the evolution trend of node performance, prevent the physical back pressure generated in the memory of nodes experiencing transient frequency reduction or bus contention of the feature data to be processed, and ensure the continuity of data flow and the availability of memory access bandwidth within the heterogeneous cluster.
[0021] 3. By utilizing the inherent branch synchronization time difference of the neural network computation graph as a physical timing mask, the computational unit in the temperature control adjustment recovery period is dynamically associated and mapped with the operator with specific asynchronous lingering characteristics. This allows the hardware physical state repair cycle and the unavoidable waiting time caused by the model topology to coincide in the time domain, thereby improving the actual conversion efficiency of heterogeneous computing resources under extreme inference conditions without changing the hardware physical architecture. Attached Figure Description
[0022] Figure 1 This is a flowchart of the neural network operator feature parsing and dynamic mapping process in a heterogeneous environment according to the present invention;
[0023] Figure 2 This invention relates to the logical architecture and multidimensional feedback control diagram of the neural network computation mapping system. Detailed Implementation
[0024] The technical solutions of the embodiments of this application will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of this application are within the scope of protection of this application.
[0025] It should be noted that all directional and positional terms used in this invention, such as: up, down, left, right, front, back, vertical, horizontal, inner, outer, top, bottom, transverse, longitudinal, center, etc., are only used to explain the relative positional relationship and connection between components in a specific state (as shown in the accompanying drawings). They are only for the convenience of describing this invention and do not require that this invention be constructed and operated in a specific orientation. Therefore, they should not be construed as limiting this invention. In addition, the descriptions of "first," "second," etc., in this invention are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated.
[0026] In the description of this invention, unless otherwise explicitly specified and limited, the terms installation, connection, and linking should be interpreted broadly. For example, they can refer to fixed connections, detachable connections, or integral connections; they can refer to mechanical connections; they can refer to direct connections or indirect connections through an intermediate medium; they can refer to the internal connection of two components. For those skilled in the art, the specific meaning of the above terms in this invention can be understood according to the specific circumstances.
[0027] In the description of this specification, references to the terms "an embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example, and the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0028] like Figure 1 As shown, a mapping method for neural network computation in a heterogeneous environment includes the following steps:
[0029] Step 101: Analyze the neural network computation graph to be mapped to identify the target operator node; obtain the total input tensor volume and single-path output tensor volume of the target operator node, and calculate the ratio of the total input tensor volume to the single-path output tensor volume to determine the topological convergence degree; calculate the time deviation of each predecessor branch of the target operator node to the computational resource unit, and determine the tensor asynchronous retention integral by multiplying the time deviation by the tensor volume that arrives early.
[0030] Step 102: Obtain the actual processing delay of the computing resource unit for processing the operator task, and calculate the ratio of the actual processing delay to the preset theoretical processing delay to obtain the transient operation deviation rate; solve the first derivative of the time series of the transient operation deviation rate to generate the performance degradation gradient.
[0031] Step 103: Select operator nodes whose topological convergence and tensor asynchronous retention integral exceed preset limits; traverse candidate computing resource units with negative performance decay gradients and in the temperature control adjustment recovery period; use the synchronization waiting period determined by the time deviation as a timing mask to map the selected operator nodes to the candidate computing resource units, and issue a mapping instruction set containing the mapping binding relationship.
[0032] Preferably, step 101 extracts the tensor asynchronous retention integral, including: obtaining the time domain difference between the longest predecessor branch time consumption and the shortest predecessor branch time consumption of the target operator node, and determining the product of the time domain difference and the tensor volume that arrives ahead of time as the tensor asynchronous retention integral.
[0033] Preferably, generating the performance degradation gradient in step 102 includes: periodically collecting the processing feedback time of the computing resource unit and the hardware junction temperature parameters; fitting the computing output evolution curve based on the change trajectory of the transient operating deviation rate of the computing resource unit within a preset historical window, and determining the slope of the computing output evolution curve as the performance degradation gradient.
[0034] Preferably, the process of extracting the performance degradation gradient also includes the following sub-steps: introducing an exponential smoothing processing mechanism that includes historical degradation factors to reduce noise in the transient operating deviation rate; and combining the characteristic parameters of the hardware architecture to which the computing resource unit belongs to perform gain correction on the processed transient operating deviation rate to compensate for the data feedback lag caused by the difference in clock adjustment mechanism between different computing resource units.
[0035] Preferably, in step 103, mapping operator nodes to candidate computing resource units includes: calculating the real-time processing entropy value of each computing resource unit, wherein the real-time processing entropy value is determined by a weighted sum of the memory access bandwidth utilization rate of the computing resource unit and the waiting queue depth of the computing channel; constructing a mapping evaluation index that includes tensor asynchronous retention integral, performance decay gradient and real-time processing entropy value; and classifying and ranking each computing resource unit according to the value of the mapping evaluation index.
[0036] Preferably, the mapping process in step 103 further includes the following sub-steps: step 1031, monitoring the bandwidth occupancy rate of the internal bus of the heterogeneous computing resource pool; step 1032, when the real-time processing entropy value of the target computing resource unit exceeds a preset threshold and the bandwidth occupancy rate reaches a preset pressure threshold, suppressing the flow of operators with high topological convergence characteristics to the target computing resource unit.
[0037] Preferably, the method further includes the following steps: Step 1041, establishing a global tensor throughput feedback link to capture the overall computing performance of the mapped heterogeneous computing resource pool; Step 1042, dynamically correcting the weight coefficient of the performance degradation gradient in Step 102 based on the degree of convergence between the overall computing performance and the sum of the theoretical physical peak values of the hardware.
[0038] Preferably, the method further includes: parsing the logical dependency depth between operators in the neural network model to be processed, and constructing an operator mapping priority queue by combining the tensor asynchronous retention integral.
[0039] Preferably, the operator mapping priority queue is arranged in the following order: each operator node is associated with the computing resource unit determined based on the temporal mask, in order of increasing logical dependency depth.
[0040] Example 1: As Figure 2As shown, in application scenarios deploying high-concurrency cloud computing node clusters, the computing node clusters include heterogeneous computing resource units such as general-purpose processors, graphics processing units, and dedicated neural processing units. The mapping method for neural network computing in a heterogeneous environment claimed in this invention extracts the topological convergence degree and tensor asynchronous retention integral of the operator nodes by parsing the neural network computation graph to be mapped. For each target operator node with a multi-input single-output structure in the neural network computation graph, the target operator node is obtained. Total volume of input tensors and single-channel output tensor volume ,in, For target operator node Calculate the total input tensor volume by summing the tensor volumes of all preceding branch inputs. With single-channel output tensor volume The ratio to determine the topological convergence. Simultaneously calculate the target operator node. The time deviation of each precursor branch arriving at the computing resource unit along the critical path is determined by integrating the product of the time deviation and the volume of the tensor arriving ahead of the corresponding branch, which is then used to characterize the residence state of the tensor in the node memory due to the difference in computational load and transmission bandwidth.
[0041] During system operation, in order to capture performance fluctuations caused by bus contention or temperature control adjustments in heterogeneous computing resource units, the system periodically acquires the actual processing latency of the computing resource units in processing operator tasks. And obtain the theoretical processing delay pre-stored in memory. Calculate the actual processing delay Compared with theoretical processing delay The ratio is used to derive the transient operating deviation rate. To eliminate feedback lag caused by the hardware clock adjustment mechanism, a constant smoothing coefficient is introduced. Combined with the smoothing deviation value of the previous sampling period Determine the current transient operating deviation rate ,Right now: ,in, This is a preset smoothing factor, whose value ranges from 0 to 1. The transient operating deviation rate for the current cycle. The smoothed deviation value calculated from the previous sampling period; for transient operating deviation rate The system calculates the first derivative of the time series signal to generate a performance degradation gradient. Based on the principle of discrete-time signal processing, solving the first derivative is equivalent to calculating the finite difference quotient of the transient operating deviation rate within adjacent sampling periods. The system sets initial weighting coefficients for the performance degradation gradient, which serve as the benchmark for dynamic correction control in the subsequent global feedback link. To construct a dimensionlessly unified decision benchmark, the system obtains the memory access bandwidth utilization and the waiting queue depth of each computing resource unit. It applies an extreme value normalization algorithm to map the waiting queue depth in the discrete natural number distribution to a ratio range that matches the memory access bandwidth utilization. Based on a preset ratio, the two are added to determine the dimensionless real-time processing entropy value. The preset ratio is determined based on the throughput bottleneck characteristics of the underlying architecture of heterogeneous computing resource units. For computing cores with limited on-chip cache capacity but dense floating-point operation units, the system uses a weight coefficient based on high memory access bandwidth utilization to make the scheduling strategy tend to avoid bandwidth congestion. For cores with sufficient shared memory bandwidth but short instruction scheduling queues, a weight is used based on relatively high waiting queue depth. This dynamically matched preset ratio value is directly parsed from the hardware configuration description file loaded during device initialization to ensure that the real-time processing entropy value accurately reflects the real load pressure under a specific architecture. The system uses the normalized tensor asynchronous retention integral. Carrying initial weight coefficients Performance degradation gradient and real-time processing of entropy values Calculate the mapping evaluation index Specific mapping evaluation indicators The calculation formula is as follows: The formula outputs a single scalar decision value, where the parameter is... and To characterize the fixed sensitivity constant, during the cluster initialization phase, the system injects standard test operators of pure memory access intensity and pure computation intensity into homogeneous heterogeneous nodes under no-load test conditions. It records the slope of memory bandwidth decrease caused by tensor residency and the magnitude of processing latency increase caused by queue congestion. By performing least-squares fitting on the two sets of independent performance degradation curves, the average of the partial derivatives of both with respect to the overall throughput decrease is extracted. This determines the relative impact weights of the tensor asynchronous residency integral and the real-time processing entropy on global performance. The fixed proportion values determined through this offline calibration process are then assigned to the parameters. and This eliminates the subjective bias caused by empirical assignment. The gradient is used to characterize whether the computing resource unit is in the performance degradation range or in the physical state repair period.
[0042] When performing mapping task distribution, the system filters the topology convergence degree. For target operator nodes whose asynchronous tensor retention integral exceeds a preset limit, the asynchronous waiting period formed by the synchronization time difference of the predecessor branch is used as a timing mask to orient the selected target operator nodes to computing resource units with negative performance degradation gradients. This allows computing resource units in the performance recovery period to carry operator tasks with specific asynchronous retention characteristics. This cooperative logic makes the hardware physical state repair cycle coincide with the asynchronous waiting gap caused by the model topology in the time domain. This not only eliminates the memory access bandwidth blockage caused by abnormal stacking of tensors in memory, but also improves the actual conversion efficiency of heterogeneous computing resources under extreme conditions by converting pipeline bubble time into hardware state self-adjustment time. This makes the global tensor throughput of the entire heterogeneous cluster approach the linear sum of the theoretical physical peak of the hardware.
[0043] Example 2: To verify the stability of the neural network computing mapping method in a heterogeneous environment under high load conditions, a heterogeneous computing power cluster was built, consisting of 4 general-purpose processor nodes, 8 graphics processing unit nodes, and 4 dedicated neural processing unit nodes. The computing nodes are interconnected via an external device interconnect bus with a bandwidth of 32GB / s. A deep Transformer neural network model with 1024 hidden layers was used as the object to be mapped, and its computation graph exhibited multi-path convergence characteristics. Data acquisition was based on the system kernel monitoring interface to obtain the task completion latency of each computing resource unit. To simulate complex interference in a real industrial production environment, a bus scheduling random jitter with a root mean square value of 1.2ms was actively introduced into the experimental environment, and periodic fluctuations in ambient temperature were set to induce temperature control and frequency reduction protection actions in each heterogeneous node. A smoothing coefficient was determined. At that time, the technical consideration lies in balancing the system's sensitivity to performance fluctuations with its ability to suppress sampling noise. As the value approaches the upper limit of the range, the system's ability to detect sudden changes in hardware performance increases, but it becomes more susceptible to scheduling oscillations caused by high-frequency bus contention noise; when As the value approaches the lower limit of the range, the system's ability to filter sampling glitches is enhanced, but this leads to a time-domain lag in state awareness. The temperature control repair cycle based on computing units is typically in the range of 100ms to 500ms, while the sampling cycle is set to 10ms. To capture the performance degradation trend and maintain the smoothness of the state curve within 5 sampling cycles, this experimental group, by applying the above trade-off logic, sets the smoothing coefficient... The value is determined to be 0.25.
[0044] The experiment established three comparative sample groups: the experimental group, control group A, and control group B. The experimental group employed the mapping method claimed in this invention; control group A employed a static mapping method based on the current idle rate of nodes; and control group B employed a dynamic mapping method based on the real-time processing latency of nodes for load balancing. To construct a gradient verification system, the topological convergence degree of the target operator nodes was adjusted by modifying the model parallelism. The input tensor volume varies within the range of 5 to 50, and the global tensor throughput and memory access blocking frequency of the system under different convergence pressures are monitored. When the model runs to a specific multi-input single-output operator node, the total input tensor volume of that node is obtained. The single-channel output tensor volume is 128MB. The topological convergence degree of this operator node is 4MB, calculated according to the formula. The value is 32. Due to the different read latency of each predecessor branch in the distributed storage environment, the maximum latency difference between each branch reaching the convergence point is calculated to be 15.4ms, i.e., the synchronization waiting period is 15.4ms. Combined with the preloaded tensor volume of each branch, the asynchronous retention integral of the tensor is determined. At this time, the system monitors the transient operation deviation rate of the No. 1 graphics processing unit node. The value increases from 1.05 to 1.42, and its first derivative, i.e. the performance degradation gradient output, is positive, indicating that the node is in the period of performance degradation caused by temperature-controlled frequency reduction; correspondingly, the performance degradation gradient of the No. 3 graphics processing unit node is negative, indicating that the node is in the period of performance recovery after physical repair.
[0045] Control group A, in processing the above topological convergence When the operator is 32, the large tensor is dispatched to node 1 with the highest memory idle rate because the asynchronous congestion feature is ignored. Since this node is in a temperature-controlled protection period and its memory access bandwidth is limited by asynchronous data stacking, the bus utilization of this node instantly reaches 98.5% saturation, causing a performance precipitate of 42.5ms. Although control group B identified the performance degradation of node 1, it redirected tasks to the high-load node 2 due to a lack of utilization of the synchronization wait period, resulting in a decrease in global throughput from the initial 1200 TPS to 750 TPS. In contrast, the experimental group used a 15.4ms synchronization wait period as a timing mask, mapping the high-convergence operator to node 3, where the performance degradation gradient is negative. The global tensor throughput of the experimental group remained at 1150 TPS, and the memory access blocking frequency was reduced by 65.2% compared to control group A. The experimental data reveals that performance changes with topology convergence. The nonlinear trend of change, when the topological convergence When the system throughput is between 5 and 35, the experimental group's system throughput shows a positive correlation with the hardware physical peak value, indicating that the tensor asynchronous retention integral is converted into physical repair time; when the topology convergence degree... Beyond 40, due to the physical bandwidth of memory reaching its limit, the throughput growth rate slows down and enters a saturation region. At this point, increasing the complexity of the computation graph does not produce further performance gains; when the transient operational deviation rate of critical nodes... After the degradation inflection point exceeds 2.5, the system predicts and avoids the cluster logic deadlock caused by local overheating by predicting the performance degradation gradient. The above experimental results confirm the effect of the present invention through deep collaboration between topological semantics and physical state.
[0046] Example 3: In a continuously running heterogeneous computing task scheduling environment, the heterogeneous computing resource unit includes a stream processor with an independent frequency adjustment mechanism. For neural network inference requests with dynamic branch prediction characteristics, the system records the system clock stamp of each predecessor branch tensor arriving at the memory buffer of the computing resource unit at the instruction issuance level. The system clock stamp corresponding to each branch is subtracted from the reference clock stamp corresponding to the longest predecessor branch time consumption to output a discrete time deviation vector. The asynchronous stagnation integral of the tensor is determined by combining discrete summation logic. The discrete summation logic satisfies: ,in, For the first The volume of the branch tensor. This represents the discrete-time deviation of the corresponding branch.
[0047] When monitoring the processing performance of computing resource units, the system calculates the actual processing latency for each sampling period. Compared with theoretical processing delay Calculate the second-order variance of the discrete residual sequence within the sliding time window. Calculate the second-order variance The ratio to the preset system ambient noise reference is adjusted online using a smoothing coefficient. When the second variance When the smoothing coefficient increases, the system decreases. The step value maintains the transient operating deviation rate. The system utilizes the physical monotonicity of the data. During step adjustment, the system uses a preset initial smoothing coefficient as the adjustment base, extracts the relative offset between the ratio and the threshold constant 1, and uses the reciprocal of this relative offset as a contraction multiplier. This multiplier is directly multiplied by the smoothing coefficient of the current cycle to obtain the updated step value for the next cycle. This inverse proportional decay mapping method ensures that when high-frequency noise at the underlying level causes severe oscillations in the second-order variance, the smoothing coefficient α can rapidly shrink along a mathematically determined convergence trajectory, thereby strengthening the passivation effect of the filter channel on random spikes. The system establishes a global tensor throughput feedback link, collects the current overall computing performance of the heterogeneous computing resource pool, and calculates the ratio parameter between the overall computing performance and the sum of the theoretical physical peak values of the hardware. When the ratio parameter is lower than the set performance approaching the threshold, the system proportionally increases the aforementioned performance decay gradient weight coefficient based on the difference between the ratio parameter and the constant 1. The mapping logic guides the subsequent distribution to avoid hardware attenuation nodes; the specific dynamic correction logic follows a proportional-integral adjustment mechanism. The system extracts the difference between the ratio parameter calculated in the current cycle and the constant 1 as the performance deviation. This performance deviation is multiplied by a preset baseline penalty coefficient and then accumulated to the weight coefficient of the previous scheduling cycle. If the ratio parameter fails to recover to near the performance threshold for three consecutive scheduling cycles, the system introduces an exponential gain factor on top of this linear increase to nonlinearly amplify the weighting coefficient. The value ensures that when the overall cluster throughput collapses, task flow to degraded nodes can be blocked with extremely high priority. The system calculates the smoothing deviation value between adjacent sampling periods. and The differential change is compared with the absolute value of the differential change to a preset hardware temperature control response hysteresis threshold. If the absolute value of the differential change is less than the hardware temperature control response hysteresis threshold, the system will force the performance degradation gradient of the current cycle to zero, thereby constructing a logic dead zone to filter performance jumps caused by small voltage fluctuations.
[0048] When a computing resource unit triggers hardware frequency reduction due to junction temperature exceeding a safe threshold, the frequency multiplication ratio output by its underlying registers decreases, leading to reduced actual processing latency. Nonlinear drift is generated, and the system captures the transient operating deviation rate. Breaking through the degradation boundary point; the scheduling logic reads the physical status register of the stream processor to confirm that the stream processor has entered the frequency closed-loop adjustment state. Combining the changing trend of the performance degradation gradient, the computing resource units in the main frequency recovery phase are identified as nodes in the physical state repair period. The system does not only make a single mathematical judgment based on the negative value of the first derivative, but also establishes a multi-dimensional hardware state cross-verification mechanism. When the current control cycle is captured and the performance degradation gradient changes from positive to negative, the on-chip temperature sensor and frequency multiplier status indicator of the corresponding computing unit are polled simultaneously. When the chip junction temperature is detected to show a monotonically decreasing trend and the value of the hardware frequency multiplier register starts to rise stepwise from the trough, the system judges the latency reduction at this time as temperature control recovery dominated by physical heat dissipation. This eliminates the false performance recovery phenomenon caused by accidental increase in cache hit rate or transient thread suspension at the physical level, ensuring that the mathematical representation of latency is bound to the underlying thermodynamic state. The system will extract the high topological convergence degree The target operator node is mapped to the node in the physical state repair phase, and the tensor asynchronous retention integral is accumulated in the memory buffer using tensors. The required asynchronous wait period compensates for the physical response delay required for the hardware frequency to recover from the down-frequency state to the rated state. This allows heterogeneous clusters, under specific heat dissipation hardware configurations, to improve tensor processing density per unit power consumption by aligning the phase of the model topology semantics with the hardware physical cycle. In actual high-concurrency inference pipelines, although the asynchronous wait period generated by a single forward propagation cannot directly cover the temperature control lag at the level of hundreds of milliseconds, the system performs micro-batch reorganization and continuous redirection of multiple high-convergence operators belonging to different inference requests in the time dimension. This allows the discrete asynchronous wait gaps in multiple independent requests to form a continuous temporal shield band on the target computing unit in the recovery period. This multi-task dimension gap splicing mechanism accumulates the millisecond-level micro-software space delay to the macro-physical repair scale at the level of hundreds of milliseconds, thereby substantially absorbing and masking the long physical heat dissipation cycle of a single node without increasing additional waiting overhead.
[0049] Example 4: In a heterogeneous computing node cluster environment in the pre-deployment stage, the system disconnects the dynamic frequency adjustment signal from the external device interconnection bus data request, constructs a steady-state calibration condition without bus contention, extracts all operators in the neural network to be mapped, and classifies all operators into computationally intensive operator sets and bandwidth-intensive operator sets based on the ratio of single floating-point operation volume to tensor volume. For these two types of operators, multiple sets of tensor data with increasing dimensionality are set under the steady-state calibration condition, and input into each computing resource unit for 1000 loop operations. The underlying system clock counter is read to record the physical time of the instruction cycle corresponding to each operation. The physical time of the instruction cycle is accumulated and divided by the number of loops to obtain the arithmetic mean. This arithmetic mean is written into non-volatile memory and solidified as the theoretical processing delay of the corresponding operator and specific hardware combination. Using this theory to handle time delay As a subsequent calculation of transient operating deviation rate Static reference base.
[0050] When a newly added heterogeneous computing resource unit physically connects to the heterogeneous computing node cluster, the system continuously sends idle heartbeat operators to the newly added heterogeneous computing resource unit, synchronously reads the underlying temperature sensor values and frequency multiplication ratio register status, and maps the actual processing latency. The system calculates the slope difference between adjacent temperature sampling data points on the evolution response curve as the chip junction temperature increases. A gentle temperature control interval with an absolute value of the slope difference less than 5% is defined as a logic decision dead zone related to hardware thermophysical hysteresis. When subsequently outputting neural network calculation mapping instructions, the system calculates the transient operating deviation rate between adjacent sampling periods. With smoothing deviation value The numerical difference is compared with the quantization boundary threshold of the logic decision dead zone. When the numerical difference falls within the logic decision dead zone, the system determines that the current delay increment is caused by the superposition of the underlying power supply voltage ripple and sensor thermal noise. The performance decay gradient value of the current control cycle is assigned to 0, eliminating the false frequency reduction scheduling signal caused by micro-random interference, and maintaining the physical determinism of the operator task directional mapping to the actual computing resource unit in the physical performance recovery period.
[0051] Example 5: In the initialization and debugging scenario before a heterogeneous computing cluster undertakes neural network inference services, physical differences in the batches and heat dissipation packaging of heterogeneous computing resource units lead to adaptation blind spots in the preset judgment parameters. This application provides an offline parameter calibration method for hardware temperature control response hysteresis threshold and system environmental noise benchmark. The scheduling system continuously injects floating-point multiply-accumulate test operators with progressively increasing computational density into the graphics processing unit node to be calibrated to increase the junction temperature of the underlying chip. Simultaneously, the temperature sensor values of the graphics processing unit node are read and the corresponding transient operating deviation rate is recorded. In a time series analysis, when the temperature sensor reading exceeds the hardware's rated frequency reduction trigger line, the system extracts a transient operational deviation rate sequence within 10 clock cycles before and after the frequency reduction control signal is issued. It calculates the second-order variance variation amplitude within the frequency reduction abrupt change range, summarizes multiple second-order variance variation amplitude samples obtained from multiple step-up temperature tests, constructs a Gaussian probability distribution model, extracts the lower limit of a specific confidence interval of this Gaussian probability distribution model, and determines it as the system's environmental noise benchmark. Simultaneously, it tracks the actual transient operational deviation rate between the frequency reduction control signal issuance time and the actual frequency reduction control signal issuance time. The time intervals between moments exhibiting a monotonically decreasing trend are used to extract the time intervals from multiple tests to form a response delay sequence. The arithmetic mean of this response delay sequence is then calculated and determined as the hardware temperature control response hysteresis threshold.
[0052] The system writes the determined system environmental noise benchmark and hardware temperature control response hysteresis threshold into the controller's non-volatile memory, using them as the basis for adjusting the smoothing coefficient in the online mapping task. The input parameters that define the logical decision dead zone are calibrated so that the decision boundary on which the mapping model depends is adapted to the intrinsic response characteristics of specific physical hardware. This allows the task distribution instructions of heterogeneous computing resource units to run based on quantized measurement data. The asynchronous waiting period determined by the hardware temperature control response hysteresis threshold is used to offset the physical response delay required for the hardware main frequency to recover from the down-frequency state to the rated state. This enables the heterogeneous cluster to improve the tensor processing density per unit power consumption without adding redundant heat dissipation hardware, through phase alignment between the model topology semantics and the hardware physical cycle.
[0053] The embodiments of this application have been described above with reference to the accompanying drawings. Unless otherwise specified, the embodiments and features in the embodiments of this application can be combined with each other. This application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit of this application and the scope of protection of this invention, and all of these forms are within the protection scope of this application.
Claims
1. A mapping method for neural network computation in a heterogeneous environment, characterized in that, Includes the following steps: Step 101: parse the neural network computation graph to be mapped to identify the target operator node; Obtain the total input tensor volume and the single-channel output tensor volume of the target operator node, and calculate the ratio of the total input tensor volume to the single-channel output tensor volume to determine the topological convergence degree. Calculate the time deviation of each predecessor branch of the target operator node arriving at the computing resource unit, and determine the integral of the product of the time deviation and the tensor volume arriving ahead of time as the tensor asynchronous retention integral. Step 102: Obtain the actual processing delay of the computing resource unit for processing the operator task, and calculate the ratio of the actual processing delay to the preset theoretical processing delay to obtain the transient operation deviation rate; solve the first derivative of the time series of the transient operation deviation rate to generate the performance degradation gradient. Step 103: Filter operator nodes whose topological convergence and tensor asynchronous retention integral exceed preset limits; Traverse candidate computing resource units with negative performance degradation gradients and in the temperature control adjustment recovery period; use the synchronization waiting period determined by the time deviation as a timing mask to map the selected operator nodes to the candidate computing resource units, and issue a mapping instruction set containing the mapping binding relationship.
2. The mapping method for neural network computation in a heterogeneous environment according to claim 1, characterized in that, Step 101 extracts the tensor asynchronous retention integral, including: obtaining the time domain difference between the longest predecessor branch time consumption and the shortest predecessor branch time consumption of the target operator node, and determining the product of the time domain difference and the tensor volume that arrives early as the tensor asynchronous retention integral.
3. The mapping method for neural network computation in a heterogeneous environment according to claim 1, characterized in that, The performance degradation gradient is generated in step 102, which includes: periodically collecting the processing feedback time of the computing resource unit and the hardware junction temperature parameters; fitting the computing output evolution curve based on the change trajectory of the transient operation deviation rate of the computing resource unit within a preset historical window, and determining the slope of the computing output evolution curve as the performance degradation gradient.
4. The mapping method for neural network computation in a heterogeneous environment according to claim 3, characterized in that, The process of extracting the performance degradation gradient also includes the following sub-steps: introducing an exponential smoothing mechanism that includes historical degradation factors to reduce noise in the transient operating deviation rate; and combining the characteristic parameters of the hardware architecture to which the computing resource unit belongs to perform gain correction on the processed transient operating deviation rate to compensate for the data feedback lag caused by the difference in clock adjustment mechanism between different computing resource units.
5. The mapping method for neural network computation in a heterogeneous environment according to claim 1, characterized in that, Step 103 maps operator nodes to candidate computing resource units, including: calculating the real-time processing entropy value of each computing resource unit, the real-time processing entropy value being determined by a weighted sum of the memory access bandwidth utilization rate of the computing resource unit and the waiting queue depth of the computing channel; constructing a mapping evaluation index that includes tensor asynchronous retention integral, performance decay gradient and real-time processing entropy value; and ranking each computing resource unit according to the value of the mapping evaluation index.
6. The mapping method for neural network computation in a heterogeneous environment according to claim 5, characterized in that, The mapping process in step 103 also includes the following sub-steps: Step 1031, monitoring the bandwidth occupancy rate of the internal bus of the heterogeneous computing resource pool; Step 1032, when the real-time processing entropy value of the target computing resource unit exceeds the preset threshold and the bandwidth occupancy rate reaches the preset pressure threshold, suppressing the flow of operators with high topological convergence characteristics to the target computing resource unit.
7. The mapping method for neural network computation in a heterogeneous environment according to claim 1, characterized in that, The method also includes the following steps: Step 1041, establishing a global tensor throughput feedback link to capture the overall computing performance of the mapped heterogeneous computing resource pool; Step 1042, dynamically adjusting the weight coefficient of the performance decay gradient in Step 102 based on the degree of convergence between the overall computing performance and the sum of the theoretical physical peak values of the hardware.
8. The mapping method for neural network computation in a heterogeneous environment according to claim 1, characterized in that, The method also includes: parsing the logical dependency depth between operators in the neural network model to be processed, and constructing an operator mapping priority queue by combining the tensor asynchronous retention integral.
9. The mapping method for neural network computation in a heterogeneous environment according to claim 8, characterized in that, The operator mapping priority queue is arranged according to the logical dependency depth from shallow to deep, and each operator node is associated with the computing resource unit determined based on the temporal mask.