A dynamic pressure drop solving method and system based on variable dependency relationship driving parallelization

By rearranging and decomposing the sparse matrix, and parallelizing the solution process of the lower and upper triangular matrices, the problem of low parallel computing efficiency in chip dynamic voltage drop analysis is solved, achieving efficient, balanced parallel solution and accurate results.

CN122240983APending Publication Date: 2026-06-19SHANGHAI LIXIN SOFTWARE TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI LIXIN SOFTWARE TECH CO LTD
Filing Date
2026-04-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for chip dynamic voltage drop analysis suffer from problems such as strong serial dependency in the triangulation process, insufficient analytical depth of parallel methods, unbalanced task partitioning, and low efficiency of repeated solutions, which limit the efficiency of parallel computing.

Method used

A dynamic pressure drop solution method driven by variable dependencies is adopted. By rearranging and decomposing the sparse matrix, dependency resolution, block partitioning, granular optimization and parallel solution are performed on the lower triangular and upper triangular matrices respectively, realizing the parallel processing of the forward substitution and backward substitution processes.

🎯Benefits of technology

It significantly improves the execution efficiency of the triangulation stage, optimizes task partitioning and parallel load balancing, adapts to repetitive solution application scenarios, reduces repetitive computation overhead, and ensures the accuracy and numerical stability of the solution results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240983A_ABST
    Figure CN122240983A_ABST
Patent Text Reader

Abstract

This invention proposes a dynamic voltage drop solution method and system based on variable dependency-driven parallelization, comprising: in the process of dynamic voltage drop analysis of a chip, performing AMD rearrangement and LU decomposition on the sparse matrix, and then performing dependency resolution, block partitioning, granularity optimization, and parallel solution on the forward substitution process of the lower triangular matrix and the backward substitution process of the upper triangular matrix. The invention further includes the following steps: Step S1: Performing AMD rearrangement and LU decomposition on the sparse matrix, including AMD rearrangement preprocessing and LU decomposition of the sparse matrix; Step S2: Performing dependency resolution, block partitioning, granularity optimization, and parallel solution on the forward substitution of the lower triangular matrix; Step S3: Performing dependency resolution, block partitioning, granularity optimization, and parallel solution on the backward substitution process of the upper triangular matrix.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention proposes a dynamic voltage drop solution method and system based on variable dependency-driven parallelization, which relates to the interdisciplinary fields of electronic design automation and high-performance numerical computing. Background Technology

[0002] In chip dynamic voltage drop analysis, power integrity analysis, and circuit simulation, the core computational step involves solving a large-scale sparse linear system of equations. This type of equation system is characterized by high matrix dimension, irregular distribution of non-zero elements, sparse overall structure, and dynamic updates of the excitation vector on the right-hand side with each simulation iteration. Its solution efficiency directly determines the overall simulation process's execution time.

[0003] While existing technologies can reduce the cost of repeated decomposition by reusing matrix rearrangement and LU decomposition results, the following technical shortcomings still exist in typical application scenarios such as chip dynamic voltage drop analysis where the matrix structure is fixed and the right-hand side terms are updated frequently.

[0004] First, the traditional trigonometric equation system solution process has a strong serial dependency. The subsequent variable solutions in both forward substitution and backward substitution depend on the calculation results of the preceding variables. Conventional implementation methods cannot fully utilize the parallel computing power of multi-core processors, making the trigonometric solution stage the performance bottleneck of the overall direct method process.

[0005] Secondly, existing parallel methods mostly employ coarse-grained task partitioning, resulting in insufficient depth of analysis of variable dependencies. Overly coarse task partitioning leads to long sequential dependency chains within a single task, limiting parallel efficiency; overly fine partitioning causes a surge in the number of tasks, incurring additional overhead in scheduling, memory access, data extraction, and result reconstruction, ultimately reducing the overall acceleration effect.

[0006] Third, existing methods lack a unified optimization mechanism for balancing task size. The non-zero structure of sparse matrices is highly irregular, and the computational cost of different subproblems varies significantly. If the task block size is not finely controlled, processor core load imbalance and some cores may remain idle, making it impossible to achieve stable parallel load balancing.

[0007] Fourth, in dynamic voltage drop simulation, the same coefficient matrix needs to be adapted to the repeated solution requirements of multiple time points, multiple excitations, and multiple iteration steps. Existing technologies can only reuse decomposition results, and the organization of the previous and back iteration processes after decomposition is inefficient. It has not formed an integrated technical solution based on variable dependency relationships for task partitioning, block optimization, sub-problem extraction, parallel solution and global result reconstruction. Summary of the Invention

[0008] In view of this, in order to fill the gaps and deficiencies in the existing technology, this invention proposes a dynamic voltage drop solution method and system based on variable dependency-driven parallelization to solve the problems in the background technology.

[0009] This invention proposes a dynamic voltage drop solution method and system based on variable dependency-driven parallelization, including the following:

[0010] According to a first aspect of the present invention, the present invention proposes a dynamic voltage drop solution method based on variable dependency-driven parallelization, characterized in that, in the process of dynamic voltage drop analysis of the chip, the sparse matrix is ​​rearranged and decomposed by LU, and then the forward substitution process of the lower triangular matrix and the backward substitution process of the upper triangular matrix are respectively subjected to dependency resolution, block partitioning, granularity optimization and parallel solution.

[0011] Furthermore, the dynamic voltage drop solution method further includes the following steps:

[0012] Step S1: Perform rearrangement and LU decomposition on the sparse matrix, including preprocessing the sparse matrix rearrangement and performing LU decomposition on the sparse matrix.

[0013] Step S2: Perform dependency resolution, block partitioning, granular optimization, and parallel solution on the forward substitution of the lower triangular matrix, including initial partitioning of the elimination tree of the lower triangular matrix; double partitioning and greedy merging of the lower triangular matrix blocks; extraction of the lower triangular submatrix and excitation vector; multi-core parallel solution of the lower triangular sub-equation system; and reconstruction of the global intermediate solution vector.

[0014] Step S3: Perform dependency resolution, block partitioning, granular optimization, and parallel solution for the backward substitution process of the upper triangular matrix. This includes initial partitioning of the elimination tree of the upper triangular matrix, double partitioning and greedy merging of the upper triangular matrix blocks, extraction of upper triangular submatrices and intermediate subvectors, multi-core parallel solution of the upper triangular sub-equations, reconstruction of the global solution vector, iterative activation vector update, and loop solution.

[0015] Further, step S1 includes the following:

[0016] Step S11: Sparse matrix rearrangement preprocessing includes:

[0017] Obtain the symmetric positive definite large-scale sparse coefficient matrix A in the linear system of equations Ax=b to be solved; in order to suppress the proliferation of filler elements in the subsequent LU decomposition and improve the numerical stability, the matrix A is rearranged synchronously in rows and columns using the approximate minimum degree sorting algorithm to generate the rearrangement index p, and the matrix is ​​optimized by rearranging according to the index as A1=A(p, p).

[0018] After rearrangement, the non-zero elements of the matrix cluster towards the main diagonal, and its sparse topology remains constant in subsequent iterations. Only the right-hand excitation vector b is dynamically updated with each iteration step.

[0019] Step S12: Performing LU decomposition on the sparse matrix includes:

[0020] Perform LU decomposition on the rearranged and optimized sparse matrix A1, decomposing it into the product of a lower triangular matrix L and an upper triangular matrix U. The decomposition formula is as follows:

[0021] A1=LU;

[0022] The decomposition result, where A1=LU, can be reused in all iteration steps.

[0023] Further, step S2 includes the following:

[0024] Step S21: Perform the initial partitioning of the elimination tree for the lower triangular matrix, including:

[0025] Construct a deletion tree dependency graph based on the non-zero sparse structure of the lower triangular matrix L: use the matrix row and column indices as nodes and the off-diagonal non-zero elements as the dependency edges between nodes; traverse backward from the leaf nodes of the deletion tree to the root node, extract the dependency set of each node, divide the matrix L into several initial blocks CL, and each block corresponds to a connected subtree of the deletion tree.

[0026] Step S22: Perform double partitioning and greedy merging of the lower triangular matrix, including:

[0027] For the initial block CL obtained in step three, a maximum block size threshold smax is set; where a larger value of smax results in fewer subproblems and a larger computational scale for each subproblem, and vice versa.

[0028] Furthermore, the first partitioning is performed: subtrees with high dependency and poor parallelism at the root node are divided into serial computing modules, and the remaining subtrees are divided into parallel computing modules, achieving a precise division of serial and parallel computing workloads. Then, the second partitioning is performed on the parallel computing modules: all parallelizable subtrees are traversed, and the target subtree with the largest node size is selected. When its number of nodes exceeds the threshold smax, the root node of the subtree is marked as a scheduling node and added to the thread task array. At the same time, its child nodes are traversed and the total number of nodes in the subtree is counted. When the size of the subtree corresponding to the child node is greater than smax, the splitting operation is repeated. When it is less than smax, the child node is set as the new root node, until the size of all parallelizable subtrees does not exceed smax, completing the double partitioning.

[0029] Furthermore, a bottom-up greedy merging algorithm is performed based on the dual partitioning results: traversing all blocks, and ensuring no dependencies between blocks, gradually merging micro-blocks with fewer nodes than the lower threshold smin, until all block sizes are no less than smin, generating an optimized block set subL; where a smaller smin value results in more blocks retained after merging and higher parallelism; conversely, a larger smin value results in fewer blocks but a higher computational cost per block.

[0030] Furthermore, the parallel granularity requirements are adaptively set according to the system scheduling efficiency; dual partitioning achieves parallel computing load balancing, and greedy merging reduces parallel scheduling overhead. The combination of the two achieves the optimal configuration of the block granularity.

[0031] Furthermore, step S2 also includes the following:

[0032] Step S23: Extracting the lower triangular submatrix and activation vectors includes:

[0033] Based on the optimized block subL from step four, the corresponding rows and columns of each block are extracted from the lower triangular matrix L to generate a set of mutually independent lower triangular submatrices subL. k Simultaneously extract matching components from the initial excitation vector bº to generate an excitation subvector set subb. k The extracted components in the original vector bº are set to zero to ensure that there is no data overlap between the sub-vectors, and an independent forward substitution solution is constructed.

[0034] Step S24: Perform multi-core parallel solution of the lower triangular sub-equations, including:

[0035] subL k With subb k One-to-one matching, constructing several independent lower triangular sub-equations:

[0036] subL k *subr k =subb k ;

[0037] The forward substitution solution is distributed across different cores of a multi-core processor for parallel execution; each core independently solves the sub-equations, recording the subtask time. The maximum time taken is used as the time metric for the parallel solution core, ultimately yielding the intermediate solution subvector set subr. k ;

[0038] Step S25: Reconstructing the global intermediate solution vector includes:

[0039] Based on the block index mapping relationship in step four, subbr kEach component is mapped to the corresponding position of the global intermediate solution vector r according to the original matrix dimension, and the vector reconstruction is completed by component accumulation; the reconstruction process records the time consumption of sub-vector mapping, quantifies and evaluates memory access performance, and finally obtains the complete intermediate solution vector r, completing the parallel solution of Lr=b.

[0040] Further, step S3 includes the following:

[0041] Step S31: Initial partitioning of the elimination tree for the upper triangular matrix includes:

[0042] Constructing a reverse elimination tree dependency graph based on the non-zero sparse structure of the upper triangular matrix U: using the matrix row and column indices as nodes, and non-diagonal non-zero elements as the dependency edges between nodes, adapting to the dependency characteristic of high-index nodes pointing to low-index nodes;

[0043] Furthermore, traverse from the leaf nodes of the reverse elimination tree to the root node, extract the dependency set of each node, and divide the matrix U into several initial blocks CU, each block corresponding to a connected subtree of the reverse elimination tree;

[0044] Step S32: Perform double partitioning and greedy merging of the upper triangular matrix, including:

[0045] For the initial CU block, the block size threshold smax set in step S2 is used. First, the first partitioning is performed: subtrees with high dependency and poor parallelism at the root node of the reverse elimination tree are divided into serial computing modules, and the remaining subtrees are divided into parallel computing modules, achieving precise separation of serial and parallel computing volume in the upper triangular solution process. Then, the second partitioning is performed on the parallel computing modules: all parallelizable subtrees are traversed, and the target subtree with the largest node size is selected. When its number of nodes exceeds smax, the root node of the subtree is marked as a scheduling node and added to the thread task array. At the same time, its child nodes are traversed and the total number of nodes in the subtree is counted. When the size of the subtree corresponding to the child node is still greater than smax, the splitting operation is repeated. If it is less than smax, the child node is set as the new root node, until the size of all parallelizable subtrees does not exceed smax, completing the double partitioning.

[0046] Furthermore, based on the results of the dual partitioning, a bottom-up greedy merging algorithm is performed, including: traversing all blocks, and merging micro-blocks with fewer than smin nodes gradually while ensuring that there are no dependencies between blocks, until all blocks are no smaller than smin, generating an optimized block set subU; the dual partitioning achieves load balancing for the upper triangular parallel computing.

[0047] Greedy merging is used to eliminate the scheduling overhead of micro-blocks.

[0048] The combination of dual partitioning and greedy merging ensures a high degree of matching between the granularity of the upper and lower triangular matrices, guaranteeing the synergy and efficiency of the overall parallel solution.

[0049] Furthermore, step S3 also includes the following:

[0050] Step S33: Extracting the upper triangular submatrix and intermediate subvector includes:

[0051] Based on the optimized block subU in step S3, the corresponding rows and columns of each block are extracted from matrix U to generate the upper triangular submatrix set subU. s Simultaneously extract matching components from the global intermediate solution vector r to generate an intermediate sub-vector set subbr s And set the extracted components in the original vector r to zero to ensure that each subproblem is completely independent;

[0052] Step S34: Perform multi-core parallel solution of the upper triangular sub-equations, including:

[0053] Will subU s with subr s Matching to construct independent upper triangular sub-equations:

[0054] subU s *subx s =subr s ;

[0055] The algorithm is distributed across multiple processors for parallel execution, solving the backward substitution problem. Each core independently solves the sub-equations, recording the time taken for each subtask, ultimately yielding a set of solution vectors, subx. s .

[0056] Furthermore, step S3 also includes the following:

[0057] Step S35: Reconstructing the global solution vector x includes:

[0058] Based on the block index in step S3, subx s Each component is mapped to the corresponding position of the global solution vector x. The vector is reconstructed by accumulating the components, and finally the complete solution vector x is obtained.

[0059] Step S36: Iterative activation vector update and iterative solution include:

[0060] When the solution vector x meets the preset accuracy requirement, the final solution is output; when the accuracy requirement is not met, the excitation vector is updated based on the chip simulation physical model.

[0061] ;

[0062] Where t is the iteration number, and the updated b t+1As new input, jump to step five and repeat the parallel solution process until the residual meets the accuracy threshold, thus achieving iterative convergence;

[0063] Furthermore, during this iterative update and repeated solution process, the matrix rearrangement index, LU decomposition results, and block parallel strategy remain unchanged. There is no need to repeatedly perform matrix rearrangement and decomposition operations. Only the subproblem extraction, parallel solution, and result reconstruction need to be performed quickly for the dynamically updated right-hand side.

[0064] According to a second aspect of the present invention, the present invention proposes a dynamic voltage drop solution system based on variable dependency-driven parallelization, comprising an electronic device, wherein the electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, when the processor executes the computer program, it implements a dynamic voltage drop solution method based on variable dependency-driven parallelization as described in any one of the present invention.

[0065] According to a third aspect of the present invention, the present invention proposes a dynamic voltage drop solution system based on variable dependency-driven parallelization, comprising a computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, it implements a dynamic voltage drop solution method based on variable dependency-driven parallelization as described in any one of the present invention.

[0066] The present invention has the following advantages:

[0067] This invention significantly improves the execution efficiency of the triangulation stage: After the sparse matrix LU decomposition is completed, this application proposes to perform block parallel processing on the forward substitution and backward substitution processes based on variable dependencies, decomposing the traditional serial solution process into multiple independent sub-problems, effectively shortening the total time consumption of the triangulation stage and greatly improving the computational efficiency in a multi-core environment.

[0068] This invention optimizes task partitioning and parallel load balancing: This application proposes a strategy that combines initial partitioning, dual partitioning, and greedy merging to finely control the size of task blocks, avoid the problem of parallel efficiency decay caused by task blocks being too large or too small, achieve balanced task allocation among processor cores, and reduce thread waiting and scheduling overhead.

[0069] This invention is highly adaptable to repetitive solution applications: in dynamic voltage drop simulation, time-domain analysis, and iterative calculations, the coefficient matrix remains fixed while the right-hand side is continuously updated. The matrix rearrangement results, LU decomposition results, and block relationships proposed in this application can all be reused throughout the entire process. Subsequently, only parallel solution and result reconstruction need to be performed on the new right-hand side, significantly reducing the overhead of repetitive calculations and improving the overall solution throughput.

[0070] This invention ensures the accuracy and numerical stability of the solution results: This application proposes a solution based on the direct method framework, without changing the mathematical solution logic of the original equation system or introducing additional iteration errors; through parallel solution of subproblems and global result reconstruction, the obtained solution is completely consistent with the standard LU pre- and back-substitution solution results, meeting the stringent requirements of chip simulation for numerical accuracy and stability. Attached Figure Description

[0071] Figure 1 This is a schematic diagram of the steps of the present invention.

[0072] Figure 2 This is a schematic diagram of the original sparse coefficient matrix of the present invention.

[0073] Figure 3 This is a schematic diagram of the sparse matrix structure optimized by AMD rearrangement according to the present invention.

[0074] Figure 4 This is a schematic diagram of the parallel solution process for dynamic voltage drop based on variable dependency driven by the present invention.

[0075] Figure 5 This is a schematic diagram comparing the computational performance of the proposed method of this invention with that of the traditional direct solution method. Detailed Implementation

[0076] The technical solution of the present invention will now be described in detail with reference to the accompanying drawings.

[0077] It should be noted that the following detailed description is illustrative and intended to provide further explanation of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0078] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the exemplary embodiments of the present invention; as used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise; furthermore, it should be understood that when the terms “comprising” and / or “including” are used in this specification, they indicate the presence of features, steps, operations, devices, components and / or combinations thereof.

[0079] like Figures 1 to 5 As shown, this invention proposes a dynamic voltage drop solution method and system based on variable dependency-driven parallelization, including the following:

[0080] According to a first aspect of the present invention, the present invention proposes a dynamic voltage drop solution method based on variable dependency-driven parallelization, characterized in that, in the process of dynamic voltage drop analysis of the chip, AMD rearrangement and LU decomposition are performed on the sparse matrix, and then dependency resolution, block partitioning, granularity optimization and parallel solution are performed on the forward substitution process of the lower triangular matrix and the backward substitution process of the upper triangular matrix, respectively.

[0081] Furthermore, in one embodiment of the present invention, the dynamic voltage drop solution method further includes the following steps:

[0082] Step S1: Perform AMD rearrangement and LU decomposition on the sparse matrix, including AMD rearrangement preprocessing of the sparse matrix and LU decomposition of the sparse matrix;

[0083] Step S2: Perform dependency resolution, block partitioning, granular optimization, and parallel solution on the forward substitution of the lower triangular matrix, including initial partitioning of the lower triangular matrix using a elimination tree; performing double partitioning and greedy merging of the lower triangular matrix; extracting the lower triangular submatrices and excitation vectors; performing multi-core parallel solution on the lower triangular sub-equations; and reconstructing the global intermediate solution vector.

[0084] Step S3: Perform dependency resolution, block partitioning, granular optimization, and parallel solution on the backward substitution process of the upper triangular matrix, including initial partitioning of the elimination tree of the upper triangular matrix, double partitioning and greedy merging of the upper triangular matrix blocks, extraction of upper triangular submatrices and intermediate subvectors, multi-core parallel solution of the upper triangular sub-equations, reconstruction of the global solution vector, iterative activation vector update, and loop solution.

[0085] In one embodiment of the present invention, step S1 includes the following:

[0086] Step S11: Sparse matrix AMD rearrangement preprocessing includes:

[0087] Obtain the symmetric positive definite large-scale sparse coefficient matrix A in the linear system of equations Ax=b to be solved; in order to suppress the proliferation of filler elements in the subsequent LU decomposition and improve the numerical stability, the matrix A is rearranged synchronously in rows and columns using the approximate minimum degree sorting algorithm to generate the rearrangement index p, and the matrix is ​​optimized by rearranging according to the index as A1=A(p, p).

[0088] After rearrangement, the non-zero elements of the matrix cluster towards the main diagonal, and its sparse topology remains constant in subsequent iterations. Only the right-hand excitation vector b is dynamically updated with each iteration step.

[0089] Step S12: Performing LU decomposition on the sparse matrix includes:

[0090] Perform LU decomposition on the rearranged and optimized sparse matrix A1, decomposing it into the product of a lower triangular matrix L and an upper triangular matrix U. The decomposition formula is as follows:

[0091] A1=LU.

[0092] Furthermore, in one embodiment of the present invention, tests showed that when the original unsorted matrix A0 was directly subjected to LU decomposition, the number of non-zero elements in the resulting lower triangular matrix L was as high as 7.05 million; while when the matrix A was optimized by AMD rearrangement and then subjected to LU decomposition, the number of non-zero elements in the lower triangular matrix L was reduced to less than 500,000, and the number of filling elements was significantly suppressed.

[0093] Furthermore, in one embodiment of the present invention, since the topological structure of the sparse matrix is ​​fixed, the decomposition result can be reused in all iteration steps, avoiding the computational overhead of repeated decomposition and providing basic support for subsequent parallel solutions.

[0094] Furthermore, in one embodiment of the present invention, step S2 includes the following:

[0095] Step S21: Perform the initial partitioning of the elimination tree for the lower triangular matrix, including:

[0096] Construct a deletion tree dependency graph based on the non-zero sparse structure of the lower triangular matrix L: use the matrix row and column indices as nodes, and non-diagonal non-zero elements as the dependency edges between nodes (the solution of the i-th row depends on the calculation result of the j-th < i-th row); traverse backward from the leaf nodes of the deletion tree to the root node, extract the dependency set of each node, divide the matrix L into several initial blocks CL, and each block corresponds to a connected subtree of the deletion tree;

[0097] Step S22: Perform double partitioning and greedy merging of the lower triangular matrix, including:

[0098] For the initial block CL obtained in step three, set an upper limit threshold smax for the block size (the threshold is adaptively calculated by the total dimension of the matrix and the number of processor cores, such as the total number of nodes / 128); where the larger the value of smax, the fewer the number of subproblems obtained by the partitioning, and the larger the computational scale of each subproblem, and vice versa. In practical applications, the value of this parameter needs to be determined comprehensively based on the matrix size, the number of processor cores, and the parallel scheduling overhead; first, perform the first partitioning: divide the subtrees with high dependency and poor parallelism at the root node into serial computation modules, and divide the remaining subtrees into serial computation modules. The system is divided into parallel computing modules to achieve precise separation of serial and parallel computational loads. Then, a second partitioning is performed on the parallel computing modules: all parallelizable subtrees are traversed, and the target subtree with the largest node size is selected. When the number of nodes in the target subtree exceeds the threshold smax, the root node of that subtree is marked as a scheduling node and added to the thread task array. Simultaneously, its child nodes are traversed, and the total number of nodes in the subtree is counted. When the size of the subtree corresponding to a child node is greater than smax, the splitting operation is repeated. When the size is less than smax, the child node is set as the new root node, until the size of all parallelizable subtrees does not exceed smax, completing the double partitioning.

[0099] Based on the results of the dual partitioning, a bottom-up greedy merging algorithm is performed: traversing all blocks, and merging micro-blocks with fewer nodes than the lower threshold smin, while ensuring that there are no dependencies between blocks, until the size of all blocks is no less than smin, generating an optimized block set subL; where the smaller the value of smin, the more blocks are retained after merging, and the higher the parallelism; where the larger the value of smin, the fewer blocks there are, and the greater the computational cost of a single block, which can effectively reduce the additional overhead of thread scheduling and data interaction. In practical applications, it is necessary to adaptively set it according to the parallel granularity requirements and system scheduling efficiency; the dual partitioning achieves parallel computing load balancing, and the greedy merging reduces parallel scheduling overhead. The combination of the two achieves the optimal configuration of block granularity.

[0100] Furthermore, in one embodiment of the present invention, step S2 further includes the following:

[0101] Step S23: Extracting the lower triangular submatrix and activation vectors includes:

[0102] Based on the optimized block subL from step four, the corresponding rows and columns of each block are extracted from the lower triangular matrix L to generate a set of mutually independent lower triangular submatrices subL. k Simultaneously extract matching components from the initial excitation vector bº to generate an excitation subvector set subb. k Then, the extracted components in the original vector bº are set to zero to ensure that there is no data overlap between the sub-vectors, and an independent forward substitution solution is constructed.

[0103] Step S24: Perform multi-core parallel solution of the lower triangular sub-equations, including:

[0104] subL k With subb k One-to-one matching, constructing several independent lower triangular sub-equations:

[0105] subL k *subr k =subb k ;

[0106] The forward substitution solution is distributed across different cores of a multi-core processor for parallel execution. Each core independently solves the sub-equations, recording the time taken for each subtask. The maximum time taken is used as the time metric for the parallel solution core, ultimately yielding the intermediate solution subvector set `subr`. k .

[0107] Step S25: Reconstructing the global intermediate solution vector includes:

[0108] Based on the block index mapping relationship in step four, subbr k Each component is mapped to the corresponding position of the global intermediate solution vector r according to the original matrix dimension, and the vector reconstruction is completed by accumulating the components. The reconstruction process records the time consumption of sub-vector mapping, quantifies and evaluates memory access performance, and finally obtains the complete intermediate solution vector r, completing the parallel solution of Lr=b.

[0109] Furthermore, in one embodiment of the present invention, step S3 includes the following:

[0110] Step S31: Initial partitioning of the elimination tree for the upper triangular matrix includes:

[0111] A reverse elimination tree dependency graph is constructed based on the non-zero sparse structure of the upper triangular matrix U: the matrix row and column indices are used as nodes, and the off-diagonal non-zero elements are used as the computational dependency edges between nodes (the solution of the i-th row depends on the result of the j > i-th row), adapting to the dependency characteristic of high-index nodes pointing to low-index nodes. Traversing from the leaf nodes of the reverse elimination tree to the root node, the dependency sets of each node are extracted, and the matrix U is divided into several initial blocks CU, each block corresponding to a connected subtree of the reverse elimination tree;

[0112] Step S32: Perform double partitioning and greedy merging of the upper triangular matrix, including:

[0113] For the initial CU block, the block size threshold smax set in step S2 is used. First, the first partitioning is performed: subtrees with high dependency and poor parallelism at the root node of the reverse elimination tree are divided into serial computing modules, and the remaining subtrees are divided into parallel computing modules, achieving precise separation of serial and parallel computing volume in the upper triangular solution process. Then, the second partitioning is performed on the parallel computing modules: all parallelizable subtrees are traversed, and the target subtree with the largest node size is selected. When its number of nodes exceeds smax, the root node of the subtree is marked as a scheduling node and added to the thread task array. At the same time, its child nodes are traversed and the total number of nodes in the subtree is counted. When the size of the subtree corresponding to the child node is still greater than smax, the splitting operation is repeated. If it is less than smax, the child node is set as the new root node, until the size of all parallelizable subtrees does not exceed smax, completing the double partitioning.

[0114] Based on the results of the dual partitioning, a bottom-up greedy merging algorithm is performed: traversing all blocks, and merging micro-blocks with fewer than smin nodes while ensuring no dependencies between blocks, until all block sizes are no less than smin, generating an optimized block set subU; the dual partitioning achieves load balancing for upper triangular parallel computing.

[0115] Greedy merging is used to eliminate the scheduling overhead of micro-blocks. The combination of double partitioning and greedy merging ensures a high degree of matching between the granularity of the upper and lower triangular matrices, guaranteeing the synergy and efficiency of the overall parallel solution.

[0116] Furthermore, in one embodiment of the present invention, step S3 further includes the following:

[0117] Step S33: Extracting the upper triangular submatrix and intermediate subvector includes:

[0118] Based on the optimized block subU in step S3, the corresponding rows and columns of each block are extracted from matrix U to generate the upper triangular submatrix set subU. s Simultaneously extract matching components from the global intermediate solution vector r to generate an intermediate sub-vector set subbr s And set the extracted components in the original vector r to zero to ensure that each subproblem is completely independent.

[0119] Step S34: Perform multi-core parallel solution of the upper triangular sub-equations, including:

[0120] Will subU s with subr s Matching to construct independent upper triangular sub-equations:

[0121] subU s *subx s =subr s ;

[0122] The algorithm is distributed across multiple processors for parallel execution, solving the system of equations by substituting the inputs. Each core independently solves the sub-equations, recording the time taken for each subtask, ultimately yielding a set of solution vectors, subx. s .

[0123] Furthermore, in one embodiment of the present invention, step S3 further includes the following:

[0124] Step S35: Reconstructing the global solution vector x includes:

[0125] Based on the block index in step S3, subx s Each component is mapped to the corresponding position of the global solution vector x. The vector is reconstructed by accumulating the components, and finally the complete solution vector x is obtained.

[0126] Step S36: Iterative activation vector update and iterative solution include:

[0127] When the solution vector x meets the preset accuracy requirement, the final solution is output; when the accuracy requirement is not met, the excitation vector is updated based on the chip simulation physical model.

[0128] ;

[0129] Where t is the iteration number, and the updated b t+1 As new input, jump to step five and repeat the parallel solution process until the residual meets the accuracy threshold, achieving iterative convergence. During this iterative update and repeated solution process, the matrix rearrangement index, LU decomposition results, and block parallel strategy remain unchanged. There is no need to repeat the matrix rearrangement and decomposition operations. Only the subproblem extraction, parallel solution, and result reconstruction need to be performed quickly for the dynamically updated right-hand side, thereby significantly improving the overall efficiency of dynamic pressure drop iterative analysis while ensuring simulation accuracy.

[0130] According to a second aspect of the present invention, the present invention proposes a dynamic voltage drop solution system based on variable dependency-driven parallelization, comprising an electronic device, wherein the electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, when the processor executes the computer program, it implements a dynamic voltage drop solution method based on variable dependency-driven parallelization as described in any one of the present invention.

[0131] According to a third aspect of the present invention, the present invention proposes a dynamic voltage drop solution system based on variable dependency-driven parallelization, comprising a computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, it implements a dynamic voltage drop solution method based on variable dependency-driven parallelization as described in any one of the present invention.

[0132] In one embodiment of the present invention, such as Figure 2 and Figure 3 As shown, the present invention Figure 2 This is a schematic diagram of the original sparse coefficient matrix of the present invention. Figure 3 This is a schematic diagram of the sparse matrix structure optimized by AMD rearrangement according to the present invention. As can be seen from the figure, the method of the present invention significantly improves the sparse matrix structure.

[0133] In one embodiment of the present invention, Figure 4 is a schematic diagram comparing the computational performance of the proposed variable dependency-driven parallelization solution method with the traditional direct solution method. The test data were all implemented and collected using MATLAB programming in the same computer hardware environment. The traditional direct solution method uses the standard direct solution process of sparse matrix LU decomposition plus serial forward substitution and serial backward substitution to perform a complete solution on the same large-scale sparse linear equation system and record the overall serial time as performance data. In contrast, the parallel solution method of the present invention adopts a block-based parallel solution strategy driven by variable dependency, which decomposes the original large-scale solution problem into multiple sub-problems that are independent of data dependency and solve them in parallel in a multi-core processor environment. The longest time among all sub-tasks is used as the overall parallel solution time. The two methods use the same sparse coefficient matrix, right-hand excitation vector and solution accuracy conditions. The only difference is the serial versus parallel solution process, thus ensuring that the performance comparison results are real, objective and reproducible, and accurately reflect the computational efficiency improvement effect of the present invention compared with the traditional method.

[0134] In one embodiment of the present invention, such as Figure 5 As shown, six sets of comparative experiments were conducted to assess the time consumption of parallel solutions and direct solutions. The results all showed that the time consumption of parallel solutions was shorter than that of direct solutions.

[0135] The above are preferred embodiments of the present invention. Any changes made to the technical solution of the present invention that do not exceed the scope of the technical solution of the present invention shall fall within the protection scope of the present invention.

Claims

1. A dynamic pressure drop solving method based on variable dependency driven parallelization, characterized in that, In the process of dynamic voltage drop analysis of the chip, rearrangement and LU decomposition are performed on the sparse matrix, and then the forward substitution process of the lower triangular matrix, the backward substitution process of the upper triangular matrix are respectively analyzed, divided, optimized and solved in parallel.

2. The method of claim 1, wherein, The dynamic voltage drop solving method further comprises the following steps: Step S1: rearranging and LU decomposing the sparse matrix, including rearranging and preprocessing the sparse matrix, and performing LU decomposition on the sparse matrix; Step S2: performing dependency analysis, block division, granularity optimization and parallel solving on the forward substitution of the lower triangular matrix, including initial division of the elimination tree of the lower triangular matrix, double division and greedy merging of the lower triangular matrix blocks, extraction of lower triangular sub-matrices and excitation sub-vectors, and multi-core parallel solving of lower triangular sub-equation groups; and reconstruction of the global intermediate solution vector; Step S3: performing dependency analysis, block division, granularity optimization and parallel solving on the backward substitution process of the upper triangular matrix, including initial division of the elimination tree of the upper triangular matrix, double division and greedy merging of the upper triangular matrix blocks, extraction of upper triangular sub-matrices and intermediate sub-vectors, multi-core parallel solving of upper triangular sub-equation groups, reconstruction of the global solution vector, and iterative excitation vector updating and cyclic solving.

3. The method of claim 2, wherein, Step S1 includes the following: Step S11: sparse matrix rearrangement preprocessing includes: Obtaining a symmetric positive definite large-scale sparse coefficient matrix A in a linear equation group Ax=b to be solved; in order to suppress the growth of fill-in elements in subsequent LU decomposition and improve numerical stability, a nearly minimum degree ordering algorithm is used to perform row and column synchronous rearrangement on the matrix A, generate a rearrangement index p, and rearrange and optimize the matrix A1=A(p,p) according to the index; The non-zero elements of the rearranged matrix are concentrated on the main diagonal, and the sparse topological structure remains constant in subsequent iterative solving, only the right end excitation vector b is dynamically updated with the iteration step; Step S12: performing LU decomposition on the sparse matrix includes: Performing LU decomposition on the rearranged and optimized sparse matrix A1 to decompose it into the product of a lower triangular matrix L and an upper triangular matrix U, and the decomposition formula is: A1=LU; Wherein the decomposition result A1=LU can be reused in all iteration steps.

4. The method of claim 3, wherein, Step S2 includes the following: Step S21: performing initial division of the elimination tree of the lower triangular matrix includes: Based on the non-zero sparse structure of the lower triangular matrix L, an elimination tree dependency graph is constructed: taking the matrix row and column indexes as nodes, and the non-diagonal non-zero elements as the calculation dependency edges between nodes; traversing the nodes from the leaf nodes to the root nodes in reverse, extracting the dependency set of each node, and dividing the matrix L into several initial blocks CL, each block corresponding to a connected sub-tree of the elimination tree; Step S22: double division and greedy merging of the lower triangular matrix blocks includes: For the initial blocks CL obtained in step three, set a block size upper limit threshold smax; the larger the value of smax, the fewer the number of sub-problems obtained by division, and the larger the calculation scale of a single sub-problem, and vice versa, the more the number of sub-problems, and the smaller the size of a single block, Furthermore, the first partitioning is performed: subtrees with high dependency and poor parallelism at the root node are divided into serial computing modules, and the remaining subtrees are divided into parallel computing modules, achieving a precise division of serial and parallel computing workloads. Then, the second partitioning is performed on the parallel computing modules: all parallelizable subtrees are traversed, and the target subtree with the largest node size is selected. When its number of nodes exceeds the threshold smax, the root node of the subtree is marked as a scheduling node and added to the thread task array. At the same time, its child nodes are traversed and the total number of nodes in the subtree is counted. When the size of the subtree corresponding to the child node is greater than smax, the splitting operation is repeated. When it is less than smax, the child node is set as the new root node, until the size of all parallelizable subtrees does not exceed smax, completing the double partitioning. Furthermore, a bottom-up greedy merging algorithm is performed based on the dual partitioning results: traversing all blocks, and ensuring no dependencies between blocks, gradually merging micro-blocks with fewer nodes than the lower threshold smin, until all block sizes are no less than smin, generating an optimized block set subL; where a smaller smin value results in more blocks retained after merging and higher parallelism; conversely, a larger smin value results in fewer blocks but higher computational cost per block. Furthermore, the parallel granularity requirements are adaptively set according to the system scheduling efficiency; dual partitioning achieves parallel computing load balancing, and greedy merging reduces parallel scheduling overhead. The combination of the two achieves the optimal configuration of the block granularity.

5. The method of claim 4, wherein, Step S2 also includes the following: Step S23: Extracting the lower triangular submatrix and activation vectors includes: Based on the optimized sub-blocks subL of step four, extract the corresponding rows and columns of each sub-block from the lower triangular matrix L to generate a set of independent lower triangular sub-matrices subL k ; extract matching components from the initial excitation vector bº synchronously to generate a set of excitation sub-vectors subb k , and set the extracted components in the original vector bº to zero to ensure that each sub-vector has no data overlap and construct independent forward substitution sub-problems; Step S24: Perform multi-core parallel solution of the lower triangular sub-equations, including: subL k subb k are matched one by one, and several independent lower triangular sub-equation groups are constructed: subL k *subr k =subb k ; The forward substitution is executed in parallel by being distributed to different cores of a multi-core processor; each core independently solves a sub-equation set and records the time consumed by the sub-task, and the maximum time consumed is taken as an index of the time consumed by parallel solving, and finally a sub-vector set of intermediate solutions subr is obtained k ; Step S25: Reconstructing the global intermediate solution vector includes: According to the sub-block index mapping relationship of step four, subr k Each component is mapped to the corresponding position of the global intermediate solution vector r according to the original matrix dimension, and the vector reconstruction is completed through component accumulation; the sub-vector mapping time consumption is recorded in the reconstruction process, the memory performance is quantitatively evaluated, and finally the complete intermediate solution vector r is obtained, and the parallel solution of Lr=b is completed.

6. The method of claim 5, wherein, Step S3 includes the following: Step S31: Initial partitioning of the elimination tree for the upper triangular matrix includes: Constructing a reverse elimination tree dependency graph based on the non-zero sparse structure of the upper triangular matrix U: using the matrix row and column indices as nodes, and non-diagonal non-zero elements as the dependency edges between nodes, adapting to the dependency characteristic of high-index nodes pointing to low-index nodes; Furthermore, traverse from the leaf nodes of the reverse elimination tree to the root node, extract the dependency set of each node, and divide the matrix U into several initial blocks CU, each block corresponding to a connected subtree of the reverse elimination tree; Step S32: Perform double partitioning and greedy merging of the upper triangular matrix, including: For the initial CU block, the block size threshold smax set in step S2 is used. First, the first partitioning is performed: subtrees with high dependency and poor parallelism at the root node of the reverse elimination tree are divided into serial computing modules, and the remaining subtrees are divided into parallel computing modules, achieving precise separation of serial and parallel computing volume in the upper triangular solution process. Then, the second partitioning is performed on the parallel computing modules: all parallelizable subtrees are traversed, and the target subtree with the largest node size is selected. When its number of nodes exceeds smax, the root node of the subtree is marked as a scheduling node and added to the thread task array. At the same time, its child nodes are traversed and the total number of nodes in the subtree is counted. When the size of the subtree corresponding to the child node is still greater than smax, the splitting operation is repeated. If it is less than smax, the child node is set as the new root node, until the size of all parallelizable subtrees does not exceed smax, completing the double partitioning. Furthermore, based on the results of the dual partitioning, a bottom-up greedy merging algorithm is performed, including: traversing all blocks, and merging micro-blocks with fewer than smin nodes gradually while ensuring that there are no dependencies between blocks, until all blocks are no smaller than smin, generating an optimized block set subU; the dual partitioning achieves load balancing for the upper triangular parallel computing. Greedy merging is used to eliminate the scheduling overhead of micro-blocks. The combination of dual partitioning and greedy merging ensures a high degree of matching between the granularity of the upper and lower triangular matrices, guaranteeing the synergy and efficiency of the overall parallel solution.

7. The method of claim 6, wherein, Step S3 also includes the following: Step S33: Extracting the upper triangular submatrix and intermediate subvector includes: Based on the optimized sub-block subU of step S3, the corresponding rows and columns of each sub-block are extracted from matrix U to generate an upper triangular sub-matrix set subU s ; the matching components are extracted from the global intermediate solution vector r to generate an intermediate sub-vector set subr s , and the extracted components in the original vector r are set to zero to ensure that each solution sub-problem is completely independent; Step S34: Perform multi-core parallel solution of the upper triangular sub-equations, including: subU s subr s Match build independent upper triangular subsystems: subU s *subx s =subr s ; The backward substitution is distributed to the multi-core processor for parallel execution; wherein each core independently completes the solution of the sub-equation set, records the time consumption of the sub-task, and finally obtains a solution sub-vector set subx s .

8. The method of claim 7, wherein, Step S3 also includes the following: Step S35: Reconstructing the global solution vector x includes: According to the block index of step S3, subx s Each component is mapped to the corresponding position of the global solution vector x, and the vector reconstruction is completed by component accumulation, and finally the complete solution vector x is obtained. Step S36: Iterative activation vector update and iterative solution include: When the solution vector x meets the preset accuracy requirement, the final solution is output; when the accuracy requirement is not met, the excitation vector is updated based on the chip simulation physical model. ; where t is the iteration number, and the updated b t+1 As a new input, jump to step five to repeat the parallel solving process until the residual meets the accuracy threshold, achieving iterative convergence. Furthermore, during this iterative update and repeated solution process, the matrix rearrangement index, LU decomposition results, and block parallel strategy remain unchanged. There is no need to repeatedly perform matrix rearrangement and decomposition operations. Only the subproblem extraction, parallel solution, and result reconstruction need to be performed quickly for the dynamically updated right-hand side.

9. A dynamic pressure drop solving system driven by parallelization based on variable dependency relationship, comprising an electronic device, wherein the electronic device comprises a memory, a processor, and a computer program stored in the memory and executable on the processor, and the computer program comprises the following steps of, When the processor executes the computer program, it implements a dynamic voltage drop solution method based on variable dependency-driven parallelization as described in any one of claims 1 to 8.

10. A dynamic pressure drop solving system driven by parallelization based on variable dependency relationship, comprising a computer readable storage medium, the computer readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements a dynamic voltage drop solution method based on variable dependency-driven parallelization as described in any one of claims 1 to 8.