Accelerated method and device for EPS calculation, server and storage medium

By converting the circuit verification task into a directed acyclic graph and calculating the ratio of edges to nodes, and using the ratio threshold to determine the processing method, the problem of insufficient hardware resource utilization in EPS calculation is solved, and efficient calculation is achieved under the condition of limited hardware resources.

CN122287488APending Publication Date: 2026-06-26BEIJING NEW ENERGY VEHICLE TECH INNOVATION CENT CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING NEW ENERGY VEHICLE TECH INNOVATION CENT CO LTD
Filing Date
2026-02-10
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

As the scale of integrated circuit design expands, the workload of EPS calculation grows exponentially, leading to low efficiency of traditional CPU serial processing and insufficient utilization of hardware resources such as GPUs, which in turn causes a sharp decrease in algorithm running efficiency.

Method used

By converting the circuit verification task into a directed acyclic graph, calculating the ratio of edges to nodes, and determining whether to use the SIMD instruction set for data-level parallel processing or decompose the task based on the ratio threshold, a task-level data dependency graph is generated, and a subtask queue matching the computing processing unit is generated, thus balancing data-level parallel efficiency and task-level load balancing.

Benefits of technology

Maximize throughput and avoid idle or excessive consumption of computing resources when managing complex dependencies, given limited hardware resources. Improve computing efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122287488A_ABST
    Figure CN122287488A_ABST
Patent Text Reader

Abstract

This invention discloses an accelerated EPS calculation method, apparatus, server, and storage medium, comprising: calculating the ratio of edges to nodes in a directed acyclic graph of a circuit verification task; comparing the ratio with a ratio threshold, and when the ratio is less than the ratio threshold, performing data-level parallel processing using a SIMD instruction set; when the ratio is greater than the ratio threshold, decomposing the directed acyclic graph of the circuit verification task to obtain decomposed subgraphs until a subgraph constraint condition is met; calculating the subgraph ratio of edges to nodes in each decomposed subgraph, generating a task-level data dependency graph for the decomposed subgraphs based on dependencies; and comparing the subgraph ratio with a ratio threshold, and when the ratio is less than the ratio threshold, performing data-level parallel processing using a SIMD instruction set; otherwise, generating a subtask queue matching the computational processing unit based on the decomposed subgraphs and the task-level data dependency graph.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of EPS calculation technology, and in particular to an accelerated EPS calculation method, apparatus, server, and storage medium. Background Technology

[0002] As the scale of integrated circuit design continues to expand, combinatorial equivalence checking has become a critical step in the chip verification process. Currently, the EPS (ExactProbability-based Simulation) algorithm, by utilizing probabilistic signals to accurately simulate circuits, demonstrates a significant advantage over traditional SAT solvers in high XOR density circuits.

[0003] In the process of realizing this invention, the inventors discovered the following technical problems: as the load scale increases, the probability calculation load increases exponentially with the circuit scale, and the traditional CPU serial processing efficiency is low; the calculation process is long and the mode is heterogeneous (a mixture of logical judgment and probability operation), resulting in insufficient utilization of hardware resources such as GPU, which in turn leads to a sharp decrease in the running efficiency of the algorithm. Summary of the Invention

[0004] This invention provides a method, apparatus, server, and storage medium for accelerating EPS calculation, in order to solve the technical problem of low operating efficiency in EPS calculation in the prior art.

[0005] In a first aspect, embodiments of the present invention provide a method for accelerating EPS calculation, comprising: The computational circuit verification task is to calculate the ratio of edges to nodes in a directed acyclic graph. The ratio is compared with a ratio threshold. When the ratio is less than the ratio threshold, data-level parallel processing is performed using the SIMD instruction set. When the ratio is greater than the ratio threshold, the directed acyclic graph of the circuit verification task is decomposed to obtain a decomposed subgraph until the subgraph constraint condition is met. Calculate the ratio of edges to nodes in each subgraph of the decomposed subgraph, and generate a task-level data dependency graph for the decomposed subgraph based on the dependency relationship; The subgraph ratio is compared with a ratio threshold, and when the ratio is less than the ratio threshold, data-level parallel processing is performed using the SIMD instruction set. Otherwise, a subtask queue matching the computational processing unit is generated based on the decomposed subgraph and the task-level data dependency graph.

[0006] Secondly, embodiments of the present invention also provide an acceleration device for EPS calculation, comprising: The calculation module is used to calculate the ratio of edges to nodes in the directed acyclic graph for circuit verification tasks. The ratio is compared with a ratio threshold. When the ratio is less than the ratio threshold, data-level parallel processing is performed using the SIMD instruction set. When the ratio is greater than the ratio threshold, the directed acyclic graph of the circuit verification task is decomposed to obtain a decomposed subgraph until the subgraph constraint condition is met. The generation module is used to calculate the ratio of edges to nodes in each decomposed subgraph and generate a task-level data dependency graph for the decomposed subgraph based on the dependency relationship. The comparison module is used to compare the subgraph ratio with a ratio threshold. When the ratio is less than the ratio threshold, data-level parallel processing is performed using the SIMD instruction set. The queue generation module is used to generate a subtask queue that matches the computational processing unit based on the decomposed subgraph and the task-level data dependency graph when the value is greater than the target value.

[0007] Thirdly, embodiments of the present invention also provide a server, comprising: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the EPS calculation acceleration method provided in the above embodiments.

[0008] Fourthly, embodiments of the present invention also provide a storage medium containing computer-executable instructions, which, when executed by a computer processor, are used to perform the EPS calculation acceleration method provided in the above embodiments.

[0009] The EPS calculation acceleration method, apparatus, server, and storage medium provided in this invention verify the edge-to-node ratio in a directed acyclic graph of a task through a computing circuit. The ratio is compared to a ratio threshold. If the ratio is less than the threshold, data-level parallel processing is performed using a SIMD instruction set. If the ratio is greater than the threshold, the directed acyclic graph of the task is decomposed to obtain decomposed subgraphs until a subgraph constraint condition is met. The edge-to-node ratio of each decomposed subgraph is calculated, and a task-level data dependency graph is generated based on the dependencies of the decomposed subgraphs. The subgraph ratio is then compared to a ratio threshold. If the ratio is less than the threshold, data-level parallel processing is performed using a SIMD instruction set; otherwise, a subtask queue matching the computing processing unit is generated based on the decomposed subgraphs and the task-level data dependency graph. By converting the circuit verification task into a directed acyclic graph and calculating the edge-to-node ratio, the sparsity and dependency complexity of the task are characterized. This ratio is then compared with a preset threshold to determine suitability for parallel processing. If the dependency complexity is too high for parallel processing, the graph is decomposed level by level until the subgraph constraint is met. This approach balances data-level parallel efficiency with task-level load balancing, maximizing throughput under limited hardware resources. Simultaneously, it ensures that computing resources are not idle while also avoiding excessive consumption on managing complex dependencies. Attached Figure Description

[0010] Other features, objects, and advantages of the invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1 This is a flowchart illustrating the method for accelerating EPS calculation provided in Embodiment 1 of the present invention. Figure 2 This is a schematic diagram of the heterogeneous acceleration hardware in the EPS calculation acceleration method provided in Embodiment 1 of the present invention; Figure 3 This is a flowchart illustrating the method for accelerating EPS calculation provided in Embodiment 2 of the present invention; Figure 4 This is a schematic diagram of the structure of the EPS calculation acceleration device provided in Embodiment 3 of the present invention; Figure 5 This is a schematic diagram of the server structure provided in Embodiment 3 of the present invention. Detailed Implementation

[0011] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, the accompanying drawings show only the parts relevant to the present invention, and not all of the structures.

[0012] Example 1 Figure 1 This is a flowchart of an EPS calculation acceleration method provided in Embodiment 1 of the present invention. This embodiment is applicable to the acceleration of EPS calculation. The method can be executed by an EPS calculation acceleration device, and specifically includes the following steps: Step 110: Calculate the ratio of edges to nodes in the directed acyclic graph of the circuit verification task.

[0013] In the field of circuit equivalence verification (ECV) within Electronic Design Automation (EDA), Exact Probability-based Simulation (EPS) is a combinatorial equivalence verification technique that utilizes signal probabilities for precise analysis. It efficiently detects whether two combinational circuits are functionally equivalent by substituting signal probabilities (0 or 1) for specific logic values ​​in the simulation and combining this with precise probability calculations. The signal probability is the probability that a signal will take the logic value 1 in the input space.

[0014] In this embodiment, for the two circuits (circuit A under verification and reference circuit B) requiring combinatorial equivalence verification, their netlists or gate-level description files can be read. Each logic operator (such as AND gate, XOR gate, NOT gate, etc.) is abstracted as a node in the graph, and signal lines or logic connections are abstracted as directed edges, thereby establishing a directed acyclic graph (DAG) from the input to the output. Loops are automatically detected during graph construction; if feedback paths exist, register cutoff or timing unrolling is used to ensure the graph's acyclicity. The resulting DAG accurately describes the logical dependencies of each signal in the circuit and forms the basis for subsequent parallelism type decisions and task partitioning.

[0015] Accordingly, the decomposition of the directed acyclic graph (DAG) in the circuit verification task includes: reading the node and edge information of the DAG, wherein the node and edge information includes the computational cost, fan-in and fan-out characteristics of the node, and the data volume and dependencies on the edge; based on the coupling degree between the dependency strength and the data volume calculation units, selecting nodes with a coupling degree exceeding a preset coupling degree threshold as seed nodes of the initial subgraph; expanding outward from the seed node, merging adjacent nodes with strong dependencies or high data interactions into the current subgraph; evaluating the cutting computation cost of the edges that need to be cut, and if it is greater than a preset cutting computation cost threshold, assigning the connected nodes to the current subgraph; determining whether there is a corresponding loop structure, grouped operators, or continuous branch chain based on the edge information, and if so, assigning the connected nodes corresponding to the edge to the current subgraph.

[0016] The goal of the above decomposition task is to aggregate computationally connected and communication-intensive nodes into the same subgraph while minimizing the cutting of critical dependencies that are costly to cut, all while meeting the thresholds for the number of nodes and edges. First, the entire task DAG is parsed to obtain the attributes of each node. The coupling degree between all adjacent node pairs in the graph is calculated. Node pairs with coupling degrees exceeding a preset threshold are considered "strongly coupled." Nodes in these strongly coupled pairs are marked as seed nodes. Seed nodes are the core and starting point for subsequent subgraph growth. This ensures that each subgraph is built from the most closely related seed stage. The algorithm starts with each seed node and attempts to merge its neighboring nodes. It prioritizes grouping nodes with dense communication and strong dependencies together to form a subgraph with good computational locality. When expanding the subgraph boundary, if a potential cutting edge is encountered, the algorithm evaluates the cost of cutting it. If the cost is greater than a preset threshold, cutting it is not worthwhile, and the algorithm chooses not to cut it, but instead merges the node opposite the edge into the current subgraph. (The process of evaluating the cost of cutting a potential cutting edge during subgraph boundary expansion is repeated.) If the cost exceeds a preset threshold, it indicates that cutting the edge is highly inefficient, and the system will choose not to cut it, instead merging the nodes opposite the edge into the current subgraph. This proactively avoids subgraph partitioning that generates significant communication overhead. Furthermore, it identifies specific, known, and efficient computational patterns in the graph and ensures they are not disrupted, particularly those corresponding to loop structures, grouped operators, or continuous branching chains. This preserves the program's high-level semantics and parallel processing capabilities.

[0017] Furthermore, the evaluation of the computational cost of cutting the edges to be cut includes: obtaining the amount of data transmitted each time the edge to be cut is executed; obtaining the frequency at which the edge to be cut is triggered during the overall computation process; and evaluating the computational cost of cutting the edges to be cut based on the amount of data and the frequency. Using the above method, the overhead of cutting an edge can be quantitatively modeled, considering not only the size of a single data transmission but also the total overhead of that transmission over the entire program execution lifecycle. By avoiding cutting high-cost edges, the algorithm ensures sufficient computation within the generated subgraphs while reducing the communication burden between subgraphs, thus maintaining a good overall computation-communication ratio.

[0018] Accordingly, the computational circuit verifying the ratio of edges to nodes in the directed acyclic graph may include: The calculation is performed in the following manner: , Where E represents the set of edges in the directed acyclic graph of the task, and V represents the set of nodes in the graph. This is the ratio. As can be seen from the above formula, The ratio reflects the sparsity and dependency complexity of the graph. It can be used as a basis for decision-making to determine the suitability of the current task for both full parallel processing and block processing.

[0019] Step 120: Compare the ratio with a ratio threshold. If the ratio is less than the ratio threshold, use the SIMD instruction set for data-level parallel processing. If the ratio is greater than the ratio threshold, decompose the directed acyclic graph of the circuit verification task into subgraphs until the subgraph constraint condition is met.

[0020] For example, the appropriate data-level parallelism or task-level parallelism mode can be determined based on the characteristics of the formed DAG structure. The decision-maker calculates the ratio R of the number of edges to the number of nodes in the DAG. When R is less than 0.3, it indicates that the dependencies between nodes are weak and the logical structure is regular, making it suitable for data-level parallelism using the SIMD instruction set. When R is greater than or equal to 0.3, it indicates that there are many dependencies and the structure is complex, requiring task-level parallelism. The decision result will directly affect the subsequent computation scheduling path.

[0021] When the ratio threshold is determined to be greater than the specified threshold, the directed acyclic graph (DAG) can be decomposed to further clarify the dependencies between them, thereby obtaining a dataset-parallel task to further improve computational efficiency. Through continuous decomposition, data-level parallel tasks and parallel-level tasks are separated. In this embodiment, the DAG of the circuit verification task can be decomposed to obtain subgraphs until the subgraph constraint condition is met. The subgraph constraint condition may include: the number of nodes in the subgraph is less than a preset node number threshold; and the number of edges in the subgraph is less than a preset edge number threshold. The node number threshold controls the upper limit of computation for each subtask. This ensures that a single subtask is not too large, causing it to run for too long on a single computing unit, blocking the entire pipeline, and also avoiding excessive on-chip storage resource consumption. The edge number threshold controls the dependency complexity within each subtask. A subgraph with fewer edges has simpler data dependencies between its internal operations, making it easier to implement efficient pipelining on hardware, or easier to be processed by the SIMD mode that may be enabled in subsequent steps.

[0022] Step 130: Calculate the ratio of edges to nodes in each decomposed subgraph, and generate a task-level data dependency graph for the decomposed subgraph based on the dependency relationship.

[0023] For example, the ratio of edges to nodes in the subgraph can still be calculated using the method described above. The ratio is then used to determine whether it is a data-level parallel task. If it is not a data-level parallel task, a task-level data dependency graph needs to be generated based on the dependencies.

[0024] For example, generating a task-level data dependency graph for the decomposed subgraph based on dependencies may include: traversing all cross-subgraph signals; if the production node of a signal is located in the first subgraph and the consumption node is located in the second subgraph, then adding a directed edge from the first subgraph to the second subgraph in the task-level data dependency graph; removing duplicate edges and organizing them in topological order to form a directed acyclic graph representing subtask dependencies.

[0025] For a cross-subgraph signal, if its producer node belongs to subgraph G1 and its consumer node belongs to subgraph G2, then a directed edge from subgraph G1 to subgraph G2 is added to the task-level data dependency graph. This directed edge signifies that subtask G2 depends on the output of subtask G1. During scheduling, G2 must wait for G1 to complete its computation and produce the required data before it can begin execution. Multiple cross-subgraph signals may exist between two subgraphs. For example, two different nodes in subgraph G1 might produce data for two nodes in subgraph G2. This would initially create two edges from G1 to G2; merging these duplicate edges into one makes the dependency graph clearer and reduces the number of edges. After merging, it still correctly represents G2's dependence on G1. During task scheduling, it's only necessary to know that G2 needs to wait for G1 to complete; the exact number of data signals is irrelevant. Topological sorting of the task-level data dependency graph yields one or more valid task execution sequences.

[0026] Step 140: Compare the subgraph ratio with a ratio threshold. If the ratio is less than the ratio threshold, perform data-level parallel processing using the SIMD instruction set; otherwise, generate a subtask queue that matches the computational processing unit based on the decomposed subgraph and the task-level data dependency graph.

[0027] A low ratio indicates that the subgraph has low computational density, complex data dependencies, or irregular control flow. Forcibly splitting it into multiple independent subtasks (task-level parallelism) may be counterproductive due to the significant overhead of task creation, scheduling, and synchronization. However, it may still contain regular, vectorizable loops or array operations. In this case, the optimal strategy is not to split the task, but to treat the subgraph as an atomic kernel and assign it to a single computational core. A high ratio indicates that the subgraph is a computationally intensive task with extremely high parallelism. It contains a large amount of work that can be completely parallelized independently. In this case, using only a single-core SIMD is insufficient; multiple computational processing units are needed to complete the task collaboratively to achieve task-level parallelism. Figure 2 This is a schematic diagram of the heterogeneous acceleration hardware in the EPS calculation acceleration method provided in Embodiment 1 of the present invention. See [link / reference]. Figure 2 By employing multiple computing units in the aforementioned heterogeneous acceleration hardware, parallel or dependent subtask operations can be implemented.

[0028] This embodiment calculates the edge-to-node ratio in a directed acyclic graph (DAG) of a circuit verification task. The ratio is compared to a ratio threshold. If the ratio is less than the threshold, data-level parallel processing is performed using a SIMD instruction set. If the ratio is greater than the threshold, the DAG is decomposed into subgraphs until a subgraph constraint is met. The edge-to-node ratio of each subgraph is calculated, and a task-level data dependency graph is generated based on the dependencies of the subgraphs. The subgraph ratio is then compared to a ratio threshold. If the ratio is less than the threshold, data-level parallel processing is performed using a SIMD instruction set; otherwise, a subtask queue matching the computational processing unit is generated based on the subgraphs and the task-level data dependency graph. By converting the circuit verification task into a directed acyclic graph and calculating the edge-to-node ratio, the sparsity and dependency complexity of the task are characterized. This ratio is then compared with a preset threshold to determine suitability for parallel processing. If the dependency complexity is too high for parallel processing, the graph is decomposed level by level until the subgraph constraint is met. This approach balances data-level parallel efficiency with task-level load balancing, maximizing throughput under limited hardware resources. Simultaneously, it ensures that computing resources are not idle while also avoiding excessive consumption on managing complex dependencies.

[0029] Example 2 Figure 3 This is a flowchart illustrating the EPS calculation acceleration method provided in Embodiment 2 of the present invention. This embodiment is based on the above embodiment and optimized. The method may also include the following steps: determining parallel tasks according to the directed acyclic graph; for each parallel task, assigning the parallel task to different processing units of the acceleration array; using the sensor of each processing unit to obtain the current real-time load, the load including: computing utilization and bandwidth usage; when there is a computing unit where the absolute value of the difference between the computing utilization and the average computing utilization is greater than a preset difference threshold, obtaining the task currently running in the computing unit; when traversing all potential target computing units, initially calculating the computing migration gain ratio of each potential target computing unit; setting the initial migration target computing unit according to the migration gain ratio, and verifying whether the initial migration target computing unit meets the global bandwidth constraint during migration; when the global bandwidth constraint is met, performing task migration.

[0030] See Figure 3 The method for accelerating EPS calculation includes: Step 210: Calculate the ratio of edges to nodes in the directed acyclic graph of the circuit verification task; compare the ratio with a ratio threshold; if the ratio is less than the ratio threshold, perform data-level parallel processing using the SIMD instruction set; if the ratio is greater than the ratio threshold, decompose the directed acyclic graph of the circuit verification task to obtain decomposed subgraphs until the subgraph constraint condition is met.

[0031] Step 220: Calculate the ratio of edges to nodes in each decomposed subgraph, and generate a task-level data dependency graph for the decomposed subgraph based on the dependency relationship.

[0032] Step 230: Compare the subgraph ratio with a ratio threshold. If the ratio is less than the ratio threshold, perform data-level parallel processing using the SIMD instruction set; otherwise, generate a subtask queue that matches the computational processing unit based on the decomposed subgraph and the task-level data dependency graph.

[0033] Step 240: Determine parallel tasks based on the directed acyclic graph; for each parallel task, assign the parallel task to a different processing unit of the acceleration array.

[0034] In a task-level data dependency graph, each node represents a subtask, and each directed edge represents a dependency relationship. By identifying parallel tasks and immediately assigning them to idle units, no processing units will remain idle.

[0035] Step 250: Use the sensor of each processing unit to obtain the current real-time load status, the load including: computing utilization and bandwidth usage; when there is a computing unit where the absolute value of the difference between computing utilization and the average computing utilization level is greater than a preset difference threshold, obtain the task currently running in the computing unit.

[0036] By continuously monitoring the real-time status of each processing unit using the above method, load imbalances or performance bottlenecks can be detected in a timely manner, and a decision-making basis can be provided for subsequent scheduling interventions (such as task migration and resource reallocation).

[0037] Calculate the current average utilization rate for all processing units of the same type. For each processing unit i, calculate its deviation. Check if there exists a unit k that meets a preset difference threshold.

[0038] When the load exceeds a preset difference threshold, it indicates that the computing unit is overloaded and may become a system bottleneck. The tasks running on it may be computationally intensive or encountering issues such as low cache hit rates, causing it to continuously exceed its load limits. This could lead to overheating, frequency throttling, and slowing down the completion time of the entire job. When the load is below the preset difference threshold, it indicates that the computing unit is idle or inefficient. This results in wasted hardware resources and a decrease in overall system throughput. These two types of computing units are defined as abnormal units. Detailed information about the tasks currently running in these units can be obtained, including: task ID and descriptor (used to uniquely identify the task); task attributes such as estimated computational load, memory access mode, and priority; task status such as running time, memory usage, and intermediate data generated; and corresponding dependencies.

[0039] Step 260: When traversing all potential target computing units, the gain ratio of computing migration for each potential target computing unit is initially calculated; the initial migration target computing unit is set according to the migration gain ratio, and it is verified whether the initial migration target computing unit meets the global bandwidth constraint during migration. If the global bandwidth constraint is met, the task migration is performed.

[0040] For each potential target computational unit j, the scheduler needs to estimate the gain ratio of migrating the task from the source unit src to j. If the task is idle on src waiting for data, it is assumed that migrating to the less loaded target j with more available memory bandwidth will improve its computational utilization to near the average level of j. Additionally, the scheduler needs to consider the size of the task state data to be migrated, D_migrate, the effective available transmission bandwidth between the source and target units, the task's stopping on the source unit, its starting on the target unit, and any potential synchronization overhead. The scheduler also calculates the benefit per unit migration cost. All potential target units can be sorted from highest to lowest migration cost to determine the optimal single-point migration target from a pure benefit-cost perspective. The planned migration bandwidth requirement is then added to the current system bandwidth usage. The scheduler checks whether the total bandwidth requirement exceeds the theoretical peak bandwidth of the interconnect network or memory controller, or whether it exceeds a safety threshold set to prevent congestion. If the global bandwidth constraints are met, the task migration is performed.

[0041] This embodiment adds the following steps: determining parallel tasks based on the directed acyclic graph; assigning each parallel task to a different processing unit of the acceleration array; obtaining the current real-time load using the sensor of each processing unit, the load including computational utilization and bandwidth usage; when there is a computational unit whose absolute value of the difference between its computational utilization and the average computational utilization level is greater than a preset difference threshold, obtaining the task currently running in that computational unit; when traversing all potential target computational units, initially calculating the computational migration gain ratio for each potential target computational unit; setting the initial migration target computational unit based on the migration gain ratio, and verifying whether the initial migration target computational unit meets the global bandwidth constraint during migration; when the global bandwidth constraint is met, performing task migration. The gain ratio is used to characterize the maximization of the benefits gained from migrating a computational task from the current unit to other units. By maximizing the gain ratio, the acceleration array achieves fine-grained and flexible scheduling during runtime.

[0042] In a preferred embodiment of this example, the method may further include the following step: after all subtasks have been computed, the probabilities calculated by all computing units are aggregated and compared with the output probabilities of the verification circuit and the reference circuit under the same input distribution to obtain an equivalence judgment result. The aforementioned steps are used to compute each subtask using all computing units, and all computation results are aggregated and compared to obtain the final computation result.

[0043] Example 3 Figure 4 This is a schematic diagram of the EPS calculation acceleration device provided in Embodiment 3 of the present invention. See also... Figure 4 The EPS calculation acceleration device includes: The calculation module 310 is used to calculate the ratio of edges to nodes in the directed acyclic graph of the circuit verification task. The decomposition module 320 is used to compare the ratio with a ratio threshold. When the ratio is less than the ratio threshold, data-level parallel processing is performed using the SIMD instruction set. When the ratio is greater than the ratio threshold, the directed acyclic graph of the circuit verification task is decomposed to obtain a decomposed subgraph until the subgraph constraint condition is met. The generation module 330 is used to calculate the ratio of edges to nodes in each decomposed subgraph and generate a task-level data dependency graph for the decomposed subgraph based on the dependency relationship. Comparison module 340 is used to compare the subgraph ratio with a ratio threshold, and when the ratio is less than the ratio threshold, data-level parallel processing is performed using the SIMD instruction set; The queue generation module 350 is used to generate a subtask queue that matches the computational processing unit based on the decomposed subgraph and the task-level data dependency graph when the value is greater than the target value.

[0044] The EPS calculation acceleration device provided in this embodiment verifies the ratio of edges to nodes in a directed acyclic graph of a task through a computing circuit; compares the ratio with a ratio threshold; if the ratio is less than the ratio threshold, performs data-level parallel processing using a SIMD instruction set; if the ratio is greater than the ratio threshold, decomposes the directed acyclic graph of the task into subgraphs until a subgraph constraint condition is met; calculates the subgraph ratio of edges to nodes for each subgraph, and generates a task-level data dependency graph for the subgraph based on dependencies; compares the subgraph ratio with a ratio threshold; if the ratio is less than the ratio threshold, performs data-level parallel processing using a SIMD instruction set; otherwise, generates a subtask queue matching the computing processing unit based on the subgraph and the task-level data dependency graph. By converting the circuit verification task into a directed acyclic graph and calculating the edge-to-node ratio, the sparsity and dependency complexity of the task are characterized. This ratio is then compared with a preset threshold to determine suitability for parallel processing. If the dependency complexity is too high for parallel processing, the graph is decomposed level by level until the subgraph constraint is met. This approach balances data-level parallel efficiency with task-level load balancing, maximizing throughput under limited hardware resources. Simultaneously, it ensures that computing resources are not idle while also avoiding excessive consumption on managing complex dependencies.

[0045] Based on the above embodiments, the computing module includes: The calculation unit performs calculations in the following manner: , Where E represents the set of edges in the directed acyclic graph of the task, and V represents the set of nodes in the graph. It is a ratio.

[0046] Based on the above embodiments, the subgraph constraints include: The number of nodes in the subgraph is less than a preset node count threshold; The number of edges in the subgraph is less than a preset edge count threshold.

[0047] Based on the above embodiments, the decomposition module includes: The reading unit is used to read the information of nodes and edges of a directed acyclic graph. The information of nodes and edges includes the computational cost of nodes, fan-in and fan-out characteristics, and the amount of data and dependencies on edges. The coupling degree calculation unit is used to calculate the coupling degree between units based on the dependency strength and data volume, and to use nodes with coupling degree exceeding a preset coupling degree threshold as seed nodes of the initial subgraph. The merging unit is used to expand outward from the seed node and merge adjacent nodes that have strong dependencies or high data interaction into the current subgraph. The evaluation unit is used to evaluate the cutting computation cost of the edge that needs to be cut. If the cost is greater than the preset cutting computation cost threshold, the connected node will be included in the current subgraph. The assignment unit is used to determine whether there is a corresponding loop structure, grouped operator or continuous branch chain based on the information of the edge. If there is, the connection node corresponding to the edge is assigned to the current subgraph.

[0048] Based on the above embodiments, the evaluation unit further includes: The first acquisition subunit is used to acquire the amount of data transmitted for each edge that needs to be cut. The second acquisition subunit is used to acquire the frequency of the edges to be cut during the overall calculation process; An evaluation subunit is used to evaluate the computational cost of cutting the edge that needs to be cut based on the amount and frequency of data.

[0049] Based on the above embodiments, the generation module includes: The traversal unit is used to traverse all cross-subgraph signals. If the production node of a signal is located in the first subgraph and the consumption node is located in the second subgraph, then a directed edge from the first subgraph to the second subgraph is added to the task-level data dependency graph. Forming units are used to remove duplicate edges and organize them in topological order to form a directed acyclic graph representing the dependencies between subtasks.

[0050] Based on the above embodiments, the device further includes: A parallel task determination module is used to determine parallel tasks based on the directed acyclic graph; The allocation module is used to assign each parallel task to a different processing unit in the acceleration array. A sensing module is used to obtain the current real-time load status using the sensors of each processing unit, the load including: computing utilization and bandwidth usage; The task acquisition module is used to acquire the task currently running in a computing unit when there is a computing unit whose absolute value of the difference between its computing utilization rate and the average computing utilization rate is greater than a preset difference threshold. The preliminary calculation module is used to perform a preliminary calculation of the computational migration gain ratio for each potential target computing unit when traversing all potential target computing units. The setting module is used to set the initial migration destination calculation unit according to the migration gain ratio, and to verify whether the initial migration destination calculation unit meets the global bandwidth constraint during migration. The migration module is used to migrate tasks when global bandwidth constraints are met.

[0051] Based on the above embodiments, the device further includes: The aggregation module is used to collect the probabilities calculated by all computing units after all subtasks have been completed, compare the output probabilities of the verification circuit and the reference circuit under the same input distribution, and obtain the equivalence judgment result. The EPS calculation acceleration device provided in the embodiments of the present invention can execute the EPS calculation acceleration method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method.

[0052] Example 4 Figure 5 This is a schematic diagram of the structure of a server provided in Embodiment 4 of the present invention. Figure 5 A block diagram of an exemplary server 12 suitable for implementing embodiments of the present invention is shown. Figure 5 The server 12 shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of the present invention.

[0053] like Figure 5 As shown, server 12 is presented in the form of a general-purpose computing server. The components of server 12 may include, but are not limited to: one or more processors or processing units 16, memory 28, and bus 18 connecting different system components (including memory 28 and processing unit 16).

[0054] Bus 18 represents one or more of several bus architectures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of the various bus architectures. For example, these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MAC) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect (PCI) bus.

[0055] Server 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by server 12, including volatile and non-volatile media, removable and non-removable media.

[0056] Memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and / or cache 32. Server 12 may further include other removable / non-removable, volatile / non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write non-removable, non-volatile magnetic media (… Figure 5Not shown; usually referred to as a "hard drive"). Although Figure 5 Not shown, a disk drive for reading and writing to a removable non-volatile disk (e.g., a "floppy disk") and an optical disk drive for reading and writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 via one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to perform the functions of the embodiments of the present invention.

[0057] A program / utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28. Such program modules 42 include, but are not limited to, an operating system, one or more application programs, other program modules, and program data. Each or some combination of these examples may include an implementation of a network environment. Program modules 42 typically perform the functions and / or methods described in the embodiments of the present invention.

[0058] Server 12 can also communicate with one or more external servers 14 (e.g., keyboard, pointing server, display 24, etc.), one or more servers that enable users to interact with server 12, and / or any server (e.g., network card, modem, etc.) that enables server 12 to communicate with one or more other computing servers. This communication can be performed via input / output (I / O) interface 22. Furthermore, server 12 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with other modules of server 12 via bus 18. It should be understood that, although not shown in the figures, other hardware and / or software modules can be used in conjunction with server 12, including but not limited to: microcode, server drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

[0059] Processing unit 16 executes various functional applications and data processing by running programs stored in memory 28, such as implementing the EPS calculation acceleration method provided in the embodiments of the present invention. Example 5 Embodiment 5 of the present invention also provides a storage medium containing computer-executable instructions, which, when executed by a computer processor, are used to perform an EPS calculation acceleration method as described in any of the above embodiments.

[0060] The computer storage medium of this invention can be any combination of one or more computer-readable media. A computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this document, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

[0061] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, capable of sending, propagating, or transmitting programs for use by or in connection with an instruction execution system, apparatus, or device.

[0062] Program code contained on a computer-readable medium may be transmitted using any suitable medium, including—but not limited to—wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.

[0063] Computer program code for performing the operations of this invention can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, as well as conventional procedural programming languages ​​such as "C" or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0064] Note that the above description is merely a preferred embodiment of the present invention and the technical principles employed. Those skilled in the art will understand that the present invention is not limited to the specific embodiments described herein, and various obvious changes, readjustments, and substitutions can be made without departing from the scope of protection of the present invention. Therefore, although the present invention has been described in detail through the above embodiments, the present invention is not limited to the above embodiments, and may include many other equivalent embodiments without departing from the concept of the present invention, the scope of which is determined by the scope of the appended claims.

Claims

1. An acceleration method for EPS computation, characterized by, include: The computational circuit verification task is to calculate the ratio of edges to nodes in a directed acyclic graph. The ratio is compared with a ratio threshold. When the ratio is less than the ratio threshold, data-level parallel processing is performed using the SIMD instruction set. When the ratio is greater than the ratio threshold, the directed acyclic graph of the circuit verification task is decomposed to obtain a decomposed subgraph until the subgraph constraint condition is met. Calculate the ratio of edges to nodes in each subgraph of the decomposed subgraph, and generate a task-level data dependency graph for the decomposed subgraph based on the dependency relationship; The subgraph ratio is compared with a ratio threshold, and when the ratio is less than the ratio threshold, data-level parallel processing is performed using the SIMD instruction set. Otherwise, a subtask queue matching the computational processing unit is generated based on the decomposed subgraph and the task-level data dependency graph.

2. The method of claim 1, wherein, The computational circuit verifies the ratio of edges to nodes in a directed acyclic graph, including: The calculation is performed in the following manner: , where E represents a set of edges in a directed acyclic graph of tasks, V represents a set of nodes in the graph, is a ratio.

3. The method of claim 1, wherein, The subgraph constraints include: The number of nodes in the subgraph is less than a preset node count threshold; The number of edges in the subgraph is less than a preset edge count threshold.

4. The method of claim 1, wherein, The decomposition of the directed acyclic graph (DAG) in the circuit verification task includes: Read the information of nodes and edges in a directed acyclic graph, including the computational cost and fan-in / fan-out characteristics of the nodes, and the amount of data and dependencies on the edges. Based on the degree of coupling between the computational units and the amount of data, nodes whose coupling degree exceeds a preset coupling degree threshold are used as seed nodes for the initial subgraph. Expand outward from the seed node, merging adjacent nodes that have strong dependencies or high data interaction with it into the current subgraph; Evaluate the computational cost of cutting the edges that need to be cut. If the computational cost exceeds a preset threshold, the connected nodes are added to the current subgraph. Based on the information of the edge, determine whether there is a corresponding loop structure, grouped operator or continuous branch chain. If so, the connection node corresponding to the edge is included in the current subgraph. Accordingly, the cost of calculating the cutting of the edges that need to be cut, as assessed, includes: Obtain the amount of data transmitted for each edge that needs to be cut; Get the frequency at which the edge to be cut is triggered during the overall calculation process; The computational cost of cutting the required edges is assessed based on the amount and frequency of the data.

5. The method of claim 4, wherein, The process of generating a task-level data dependency graph based on the dependencies of the decomposed subgraph includes: Traverse all cross-subgraph signals. If the producer node of a signal is located in the first subgraph and the consumer node is located in the second subgraph, then add a directed edge from the first subgraph to the second subgraph in the task-level data dependency graph. Remove duplicate edges and organize them in topological order to form a directed acyclic graph representing the dependencies between subtasks.

6. The method of claim 5, wherein, The method further includes: Parallel tasks are determined based on the directed acyclic graph. For each parallel task, the parallel task is assigned to a different processing unit in the acceleration array; The current real-time load status is obtained using the sensor of each processing unit, and the load includes: computing utilization and bandwidth usage; When there is a computing unit whose absolute value of the difference between its computing utilization rate and the average computing utilization rate is greater than a preset difference threshold, the task currently running in that computing unit is obtained. When traversing all potential target computing units, the gain ratio of computational migration for each potential target computing unit is initially calculated; The initial migration destination calculation unit is set according to the migration gain ratio, and it is verified whether the initial migration destination calculation unit meets the global bandwidth constraint during migration. Task migration is performed when global bandwidth constraints are met.

7. The method of claim 6, wherein, The method further includes: After all subtasks are computed, the probabilities calculated by all computing units are combined and compared with the output probabilities of the verification circuit and the reference circuit under the same input distribution to obtain the equivalence judgment result.

8. An apparatus for accelerating EPS computation, characterized by, include: The calculation module is used to calculate the ratio of edges to nodes in the directed acyclic graph for circuit verification tasks. The decomposition module is used to compare the ratio with a ratio threshold. When the ratio is less than the ratio threshold, data-level parallel processing is performed using the SIMD instruction set. When the ratio is greater than the ratio threshold, the directed acyclic graph of the circuit verification task is decomposed to obtain a decomposed subgraph until the subgraph constraint condition is met. The generation module is used to calculate the ratio of edges to nodes in each decomposed subgraph and generate a task-level data dependency graph for the decomposed subgraph based on the dependency relationship. The comparison module is used to compare the subgraph ratio with a ratio threshold. When the ratio is less than the ratio threshold, data-level parallel processing is performed using the SIMD instruction set. The queue generation module is used to generate a subtask queue that matches the computational processing unit based on the decomposed subgraph and the task-level data dependency graph when the value is greater than the target value.

9. A server, characterized by include: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the method for accelerating EPS calculation as described in any one of claims 1-7.

10. A storage medium containing computer-executable instructions, wherein: The computer-executable instructions, when executed by a computer processor, are used to perform the accelerated EPS calculation method as described in any one of claims 1-7.