Pcg optimization method and system for shallow water equation facing sunway core

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By optimizing data partitioning and parallel processing, the performance limitations of the PCG algorithm on the Sunway supercomputer were resolved, enabling efficient computation of the diagonal preconditioned PCG algorithm on the Sunway 26010pro processor, thus improving the computation speed and performance of shallow water equations.

CN117707785BActive Publication Date: 2026-06-19SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN
Filing Date: 2023-12-28
Publication Date: 2026-06-19

Application Information

Patent Timeline

28 Dec 2023

Application

19 Jun 2026

Publication

CN117707785B

IPC: G06F9/50; G06F13/28; G06F13/40

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In existing technologies, the PCG algorithm using diagonal preconditions cannot fully utilize the performance of the Sunway 26010pro processor on the Sunway supercomputer, resulting in slow data interaction and frequent data transfer, leading to low computational efficiency of shallow water equations.

⚗Method used

The diagonal preconditioning PCG algorithm is optimized by employing row-based data partitioning, two-level parallelism, partitioned DMA transfer, shared memory regions, and communication optimization techniques. The coefficient matrix is read through row-compressed storage format, data is transferred using DMA, and parallel computation and data-level parallel processing are implemented to reduce communication waiting time.

🎯Benefits of technology

The calculation speed and performance of shallow water equations on the Sunway supercomputer have been improved, making full use of the computing cores and memory access capabilities of the Sunway 26010pro processor and reducing the calculation waiting time.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117707785B_ABST

Patent Text Reader

Abstract

This invention proposes a many-core optimization method and system for PCG on the Sunway supercomputer for shallow water equations, relating to the field of data processing technology. The method includes reading the coefficient matrix and right-hand side terms, setting basic conditions; in the management core, dividing the coefficient matrix into data blocks based on entire rows, uniformly dividing it along the row direction to obtain block data; uniformly dividing the LDM of the computing core into two partitions, transmitting the block data to the computing core multiple times, with the two partitions implementing the transmission and computation processes in parallel; summing the data calculated by each computing core within the same core group, placing it in a shared memory area, and then summing it again by a designated computing core to compare the residuals and determine whether the residual reduction requirement is met. This invention uses a row-based partitioning method, two-level parallelism, and communication avoidance to accelerate computation, providing an efficient implementation of diagonally preconditioned PCG on the Sunway supercomputer for shallow water equations.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of data processing technology, and in particular relates to the PCG Shenwei many-core optimization method and system for shallow water equations. Background Technology

[0002] Shallow water equations are a set of nonlinear partial differential equations describing the evolution of water waves when the water depth is relatively small compared to the wavelength. They are commonly used to study coastal and ocean dynamics, river flow, and wave propagation. In practical problems, these equations are usually solved numerically because analytical solutions are often difficult to obtain. Numerical solutions to shallow water equations involve discretizing space and time to simulate the evolution of water waves using computers.

[0003] The preconditioned conjugate gradient (PCG) algorithm has long been a dominant method in the field of iterative linear system solvers due to its efficiency and versatility. The PCG method is widely used in shallow water equation calculations because of its advantages such as small storage requirements, simple iterative format, and the ability to complete the iterative process using only the objective function value and its gradient.

[0004] In the PCG algorithm, preconditioning is a technique used to improve algorithm performance. By appropriately selecting preconditions, the condition number of the problem is reduced, thereby accelerating the convergence speed of the algorithm. A precondition is a matrix used to transform the original problem, thus improving convergence performance. This matrix is typically chosen as an approximate inverse matrix to transform the original problem into a more solvable form. The goal of preconditioning is to reduce the condition number of the problem, enabling the algorithm to converge to a solution faster.

[0005] Common preconditioning matrices include diagonal preconditioning, incomplete factorization (ILU) preconditioning, and algebraic multigrid (AMG) preconditioning. The core idea of diagonal preconditioning is to replace the nonlinear and complex parts of the original matrix with a linear approximation, thereby simplifying the problem. Diagonal preconditioning has a relatively low computational cost because it only involves the diagonal elements of the original matrix, making it attractive for large-scale problems and often used in simple PCG algorithms.

[0006] In shallow water equations, continuous physical quantities need to be discretized to facilitate numerical calculations by computers. Since water simulations typically involve large-scale grids, the discrete matrix is usually sparse, making it well-suited for solving using PCG.

[0007] China's latest independently developed Sunway supercomputer is a key device for achieving exascale computing capabilities. It is equipped with the Sunway 26010pro processor and adopts a heterogeneous multi-core architecture. Each processor consists of six core groups, each including a Management Process Element (MPE) and a set of 8x8 Computing Process Elements (CPEs), as well as 16GB of main memory. Each computing core has a 256KB high-speed Local Data Memory (LDM) block. The computing cores can use Direct Memory Access (DMA) to read and write data on main memory. Core groups are connected via a ring-shaped on-chip network (NoC).

[0008] The inventors discovered that although the new generation of Sunway supercomputers boasts outstanding performance, due to the uniqueness of its architecture and platform, the diagonally preconditioned PCG cannot fully utilize the performance of the Sunway 26010pro on the new machine. For example, when running under the Sunway architecture, the diagonally preconditioned PCG suffers from problems such as not being able to utilize the Sunway acceleration core, slow data interaction, and the need for frequent data transfers. This limitation hinders the rapid computation of shallow-water equations on the Sunway machine. Summary of the Invention

[0009] To overcome the shortcomings of the prior art, this invention provides a PCG many-core optimization method and system for shallow water equations. It uses row-based partitioning, two-level parallelism, and communication avoidance to accelerate the calculation speed. It also provides an efficient implementation of diagonal preconditioned PCG on the Sunway supercomputer for shallow water equations.

[0010] To achieve the above objectives, one or more embodiments of the present invention provide the following technical solutions:

[0011] The first aspect of this invention provides a PCG Shenwei many-core optimization method for shallow water equations.

[0012] The PCG Shenwei many-kernel optimization method for shallow water equations includes the following steps:

[0013] Step 1: Read the coefficient matrix of the computationally intensive algorithm part of the program in row-compressed storage format, and read in the right-hand side. Set the basic conditions, including residual descent requirements and maximum number of iterations.

[0014] Step 2: In the management core, the coefficient matrix is divided into data based on the entire row, and the data is evenly divided into blocks along the row direction to obtain block data;

[0015] Step 3: Divide the LDM of the computing core into two partitions evenly, further divide the block data to meet the space requirements of each partition, and transmit the block data to the computing core in multiple times. The two partitions perform the transmission and computing processes in parallel.

[0016] Step 4: Create a shared memory region. Sum the data calculated by each computing core in the same core group and put it into the shared memory region. The specified computing core will sum the data to obtain the residual. Compare the residual with the set residual reduction requirement to determine whether the condition is met. If not, iterate from step 2 to step 4 until the residual obtained after iteration meets the residual reduction requirement, and obtain the solution to the linear equation system.

[0017] Optionally, the calculation process can be analyzed and hotspots can be located by combining the swprof tool with manual instrumentation to find the computationally intensive parts of the program.

[0018] Optionally, when the computing core retrieves block data:

[0019] First, the computational core uses the DMA channel to obtain the total number of rows in the coefficient matrix and the starting address of the data;

[0020] Each computing core obtains the starting index size through the starting index calculation formula, and uses DMA to acquire data based on the data's starting address plus the calculated starting index size.

[0021] The formula for calculating the starting index is:

[0022]

[0023] Where ROWS is the total number of rows in the coefficient matrix; Num cpe Number of compute cores enabled; ID cpe Here is the index number of the current computing core; the 64 computing cores have index numbers ranging from 0 to 63; MIN is the minimum function; MOD is the modulo function, which obtains the ROWS value relative to Num. cpe The remainder.

[0024] Optionally, when the computing core retrieves data from main memory, it uses the dma_get and dma_put interfaces to transfer contiguous long data in memory through the DMA engine.

[0025] Optionally, the amount of data transferred to the computing core each time is less than the capacity of a single partition space established in the LDM space. The two partitions are transferred alternately to achieve parallel computing and data transfer within a single computing core.

[0026] Optionally, in the management core, multiple ordinary data types can be merged into a single long data type, and a single instruction statement can be used to operate on the long data type, achieving data-level parallelism of single instruction stream and multiple data streams.

[0027] Optionally, in data-level parallelism, first set a threshold N and a long data length L, and use conditional judgments and loop processing to construct a complete data structure:

[0028] When the amount of data to be calculated exceeds the threshold N, a merging operation is performed, and the data is transformed by iterative judgment.

[0029] When the amount of data to be calculated reaches the threshold N but is less than the long data length L, fill the empty spaces with 0 to reach the long data length L, and then transform the data.

[0030] If the amount of data to be calculated is less than the threshold N, no conversion is performed.

[0031] The second aspect of this invention provides a PCG Shenwei many-core optimization system for shallow water equations.

[0032] The PCG Shenwei many-core optimization system for shallow water equations includes:

[0033] The reading module is configured to read the coefficient matrix of the computationally intensive algorithm part of the program in a row-compressed storage format, read in the right-hand side, and set basic conditions, including residual descent requirements and maximum number of iterations.

[0034] The data partitioning module is configured to: in the management core, partition the coefficient matrix based on the entire row, and evenly divide it into blocks in the row direction to obtain block data;

[0035] The partitioned parallel module is configured to: evenly divide the LDM of the computing core into two partitions, further divide the block data to meet the space of a single partition, and transmit the block data to the computing core in multiple times, with the two partitions performing the transmission and computing processes in parallel;

[0036] The iterative judgment module is configured to: create a shared memory region, sum the data calculated by each computing core in the same core group, put it into the shared memory region, obtain the residual by the specified computing core, compare the residual with the set residual reduction requirement, and determine whether the condition is met. If not, the iterative data partitioning module is transferred to the iterative judgment module until the residual obtained after iteration meets the residual reduction requirement, and the solution of the linear equation system is obtained.

[0037] A third aspect of the present invention provides a computer-readable storage medium having a program stored thereon, which, when executed by a processor, implements the steps of the PCG Shenwei many-core optimization method for shallow water equations as described in the first aspect of the present invention.

[0038] A fourth aspect of the present invention provides an electronic device including a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps in the PCG Shenwei many-core optimization method for shallow water equations as described in the first aspect of the present invention.

[0039] The above one or more technical solutions have the following beneficial effects:

[0040] This invention proposes a PCG many-core optimization method and system for shallow water equations, which sequentially optimizes data partitioning, memory access, partition buffering, data-level parallelism, and communication in the PCG iterative algorithm. It provides a reliable optimization method for the development of the next-generation Sunway supercomputer platform.

[0041] The task partitioning of the PCG algorithm computation data can fully utilize the chip performance of the Shenwei 26010pro processor while taking into account the data-level parallelism capability. Using a row-based partitioning method can improve the computational performance of sparse matrices.

[0042] Memory access optimization: DMA technology is used to transfer contiguous data in memory, avoiding frequent discrete memory accesses and improving memory access performance;

[0043] Using data-level parallelism can increase running speed while reducing computation instructions;

[0044] Partition buffer optimization: Simultaneously using memory access units and computing units to achieve parallel data transmission and computation, with communication time and computation time masking each other;

[0045] Communication optimization: Reuse data during iteration to reduce data communication and computation waiting time.

[0046] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0047] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0048] Figure 1 This is a flowchart of the method in the first embodiment.

[0049] Figure 2 This is a schematic diagram of data partitioning and transmission for the first embodiment.

[0050] Figure 3This is a schematic diagram of the first embodiment of Single Instruction Stream Multiple Data Stream.

[0051] Figure 4 This is a flowchart of data in the communication optimization process of the first embodiment.

[0052] Figure 5 The first embodiment shows the acceleration effect. Detailed Implementation

[0053] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0054] It should be noted that the terminology used herein is for the purpose of describing particular implementations only and is not intended to limit the exemplary implementations of the present invention.

[0055] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0056] Example 1

[0057] This embodiment discloses the PCG Shenwei many-core optimization method for shallow water equations.

[0058] like Figure 1 As shown, the PCG Shenwei many-kernel optimization method for shallow water equations includes the following steps:

[0059] Step 1: Since the discrete matrix of the shallow water equation is usually sparse, the coefficient matrix of the computationally dense algorithm part in the program is read in row-compressed storage format, and the right-hand side is read in. Basic conditions are set, including residual descent requirements and maximum number of iterations.

[0060] Step 2: In the management core, the coefficient matrix is divided into data based on the entire row, and the data is evenly divided into blocks along the row direction to obtain block data;

[0061] Step 3: Divide the LDM of the computing core into two partitions evenly, further divide the block data to meet the space requirements of each partition, and transmit the block data to the computing core in multiple times. The two partitions perform the transmission and computing processes in parallel.

[0062] Step 4: Create a shared memory region. Sum the data calculated by each computing core in the same core group and put it into the shared memory region. The specified computing core will sum the data to obtain the residual. Compare the residual with the set residual reduction requirement to determine whether the condition is met. If not, iterate from step 2 to step 4 until the residual obtained after iteration meets the residual reduction requirement, and obtain the solution to the linear equation system.

[0063] The residual is the sum of the data calculated by each computational core within the core group.

[0064] Shallow water equations typically need to be discretized into a system of linear algebraic equations. For large systems of linear algebraic equations, traditional direct solution methods may be limited by computational cost. PCG is a commonly used choice. As an iterative solution method, PCG improves the efficiency of the solution by gradually approximating the solution in each iteration step.

[0065] This invention provides a diagonal preconditioning PCG method based on the Sunway supercomputer architecture. Under the Sunway architecture, diagonal preconditioning PCG suffers from problems such as inability to utilize the Sunway acceleration core, slow data interaction, and the need for frequent data transfers. The optimized method can fully utilize the computing performance of the Sunway supercomputer by optimizing data partitioning, memory access, and communication performance to improve computing speed and reduce waiting time.

[0066] The improved diagonal preconditioning PCG algorithm of this invention can fully utilize the computing cores of the Sunway supercomputer, rationally allocating the solution task to the computing cores and accelerating the solution process through the powerful computing cores. During optimization, appropriate tools are first used for program analysis, employing a combination of swprof and manual instrumentation to analyze the computational process and locate hotspots. In computationally intensive parts of the program, the data is rationally partitioned to facilitate parallelism. Based on the specific mathematical process, the diagonal preconditioning PCG algorithm can be decomposed into operations such as sparse matrix-vector multiplication, element-wise vector multiplication, vector addition, and residual calculation.

[0067] For the specific computational behavior, optimization methods can be categorized into the following schemes:

[0068] 1. Use row-based data partitioning

[0069] To accelerate the program in parallel, row-based data partitioning is used to leverage data-level parallelism in sparse matrix-vector multiplication operations. In a coefficient matrix with an uneven distribution of non-zero elements, rows with more non-zero elements are easier to access using data-level parallelism, thus balancing the computation time of different partitions and reducing the waiting time of computation units. Whole-row data partitioning reduces data exchange in matrix-vector operations compared to non-whole-row partitioning.

[0070] We use a uniform row partitioning method for matrix division. The computation core utilizes a DMA channel to obtain the total number of rows in the matrix and the starting address of the data. Each computation core obtains the starting index size using the starting index calculation formula, where ROWS is the total number of rows in the matrix and Num is the starting index size. cpe ID represents the number of compute cores enabled. cpeThis refers to the index number of the current computing core. The 64 computing cores have index numbers ranging from 0 to 63. MIN is the minimum function, and MOD is the modulo function. This will yield ROWS relative to Num. cpe The remainder is then used to acquire the data using DMA, based on the data's starting address and the calculated starting index.

[0071] Starting index:

[0072] 2. Use the partitioned DMA method for data transfer.

[0073] When the computing core retrieves data from main memory, it uses the dma_get and dma_put interfaces to transfer contiguous long data in memory via the DMA engine, reducing latency caused by discrete memory access. Combined with a partitioned buffering strategy, the LDM is divided into two distinct regions, allowing each region to perform computation and DMA transfer processes simultaneously.

[0074] like Figure 2 As shown, after the data is divided into blocks in the row direction, the data that has already been divided is further divided so that the amount of data in each part is less than the capacity of a single partition space established in the LDM space. The computational data allocated to each computing core is transmitted multiple times. Parallel computation and data transmission within a single computing core are achieved by transmitting data alternately through two partitions.

[0075] Using a partitioned buffer strategy can mask communication time with computation time, further reducing computation wait time when accessing memory. After using DMA to put data into the LDM, the data access process is changed from direct access to main memory to access to the local LDM, and intermediate computation variables are directly established in the LDM, reducing the main memory access overhead in computation.

[0076] 3. Use a multi-level parallel strategy

[0077] Thread-level parallelism and data-level parallelism are used in conjunction for acceleration. Thread-level parallelism is achieved by enabling the compute cores using the `athread` command in the management kernel. Based on this, Single Instruction Multiple Data (SIMD) is used for data-level acceleration. SIMD combines multiple ordinary data types into a single long data type, allowing a single instruction to operate on the long data type, effectively performing operations on multiple ordinary data types simultaneously. Figure 2 As shown.

[0078] When using data parallelism in the SpMV process, a threshold is set. When the amount of data to be computed exceeds the threshold, the data to be computed is manually SIMD operated on, and the ordinary data structure double is converted into long vector doublev8 data by looping to determine the conversion.

[0079] For data with more than 8 entries, the first 8 entries will be converted before calculation;

[0080] For data that reaches the threshold but is less than 8, fill the empty spaces with 0 to reach the length of 8, and then convert the data.

[0081] If the amount of data to be calculated is less than the threshold, no transformation is used.

[0082] All calculations utilize conditional statements and loop processing to construct complete data structures, thereby accelerating the calculation speed.

[0083] 4. Communication optimization

[0084] During the iteration of PCG, reduction functions are frequently used, requiring the data already allocated to computing cores to be calculated and then merged. This invention designs a method for data reduction using LDM (Local Decomposition Model). By logically merging a portion of the LDM from all computing cores in the core group, a shared memory area that can be accessed by all computing cores with low latency is created. After the data is summed in the computing cores, the calculated data is placed into the shared memory area. After completion, a computing core summarizes the results and compares the residuals.

[0085] By optimizing communication and avoiding data communication with main memory, program execution speed can be accelerated and the time spent waiting for data can be reduced. For the calculation process of data in memory, such as... Figure 4 As shown.

[0086] 5. Other optimizations

[0087] Multiply-add optimization: When updating vector solutions and calculating residuals, the multiplication and addition parts are combined into a single statement by changing the calculation order and merging calculations. The instruction FMA is added at compile time to use the multiplication-add operation, and the two instructions of multiplication and addition are merged into a single multiply-add instruction to make full use of the multiply-add unit in the hardware to accelerate the calculation.

[0088] Loop unrolling optimization: Copy the code of the loop body multiple times and arrange it in order, and then adjust the loop termination condition accordingly to reduce the control overhead of the loop structure.

[0089] Compute core cache prefetch optimization: Configure a portion of the idle space in LDM as cache space. When the communication channel is not in use, the data to be used later is prefetched from main memory into the compute core cache. By prefetching instructions and data into the cache, communication time is reduced.

[0090] Experimental verification:

[0091] like Figure 5 As shown, the Shenwei 26010pro processor was used to test three different scales of computational examples. In the single-core group test, compared with the original algorithm, the optimized algorithm made full use of the processor's performance. Through runtime analysis, it can be seen that the average speedup ratio reached 28.1.

[0092] The optimized diagonal preconditioning PCG algorithm of this invention can fully utilize the performance of the computing core and dedicated vector register unit, multiply-accumulate unit and communication unit in the Shenwei 26010pro processor to accelerate the calculation speed of shallow water equations.

[0093] Example 2

[0094] This embodiment discloses the PCG Shenwei many-core optimization system for shallow water equations.

[0095] The PCG Shenwei many-core optimization system for shallow water equations includes:

[0096] The reading module is configured to read the coefficient matrix of the computationally intensive algorithm part of the program in a row-compressed storage format, read in the right-hand side, and set basic conditions, including residual descent requirements and maximum number of iterations.

[0097] The data partitioning module is configured to: in the management core, partition the coefficient matrix based on the entire row, and evenly divide it into blocks in the row direction to obtain block data;

[0098] The partitioned parallel module is configured to: evenly divide the LDM of the computing core into two partitions, further divide the block data to meet the space of a single partition, and transmit the block data to the computing core in multiple times, with the two partitions performing the transmission and computing processes in parallel;

[0099] The iterative judgment module is configured to: create a shared memory region, sum the data calculated by each computing core in the same core group, put it into the shared memory region, obtain the residual by the specified computing core, compare the residual with the set residual reduction requirement, and determine whether the condition is met. If not, the iterative data partitioning module is transferred to the iterative judgment module until the residual obtained after iteration meets the residual reduction requirement, and the solution of the linear equation system is obtained.

[0100] Example 3

[0101] The purpose of this embodiment is to provide a computer-readable storage medium.

[0102] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps in the PCG Shenwei many-core optimization method for shallow water equations as described in Embodiment 1 of this disclosure.

[0103] Example 4

[0104] The purpose of this embodiment is to provide an electronic device.

[0105] An electronic device includes a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps in the PCG Shenwei many-core optimization method for shallow water equations as described in Embodiment 1 of this disclosure.

[0106] The steps and methods involved in the apparatuses of Embodiments 2, 3, and 4 above correspond to those in Embodiment 1. For specific implementation details, please refer to the relevant description section of Embodiment 1. The term "computer-readable storage medium" should be understood as a single medium or multiple media including one or more instruction sets; it should also be understood as including any medium capable of storing, encoding, or carrying an instruction set for execution by a processor and enabling the processor to perform any of the methods in this invention.

[0107] Those skilled in the art will understand that the modules or steps of the present invention described above can be implemented using general-purpose computer devices. Optionally, they can be implemented using computer-executable program code, thereby allowing them to be stored in a storage device for execution by a computer device, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. The present invention is not limited to any particular combination of hardware and software.

[0108] While the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.

Claims

1. A pcg optimization method for shallow water equations on Sunway many-core, characterized in that, Includes the following steps: Step 1: Read the coefficient matrix of the computationally intensive algorithm part of the program in row-compressed storage format, and read in the right-hand side. Set the basic conditions, including residual descent requirements and maximum number of iterations. Step 2: In the management core, the coefficient matrix is divided into data based on the entire row, and the data is evenly divided into blocks along the row direction to obtain block data; Step 3: Divide the LDM of the computing core into two partitions evenly, further divide the block data to meet the space requirements of each partition, and transmit the block data to the computing core in multiple times. The two partitions perform the transmission and computing processes in parallel. Step 4: Create a shared memory region. Sum the data calculated by each computing core in the same core group and put it into the shared memory region. The specified computing core will sum the data to obtain the residual. Compare the residual with the set residual reduction requirement to determine whether the condition is met. If not, iterate from step 2 to step 4 until the residual obtained after iteration meets the residual reduction requirement, and obtain the solution to the linear equation system.

2. The PCG Shenwei many-core optimization method for shallow water equations as described in claim 1, characterized in that, The calculation process and hotspots were analyzed and identified by combining the swprof tool with manual instrumentation, thus finding the computationally intensive parts of the program.

3. The pcg for shallow water equation oriented knl optimization method of claim 1, wherein, When the computing core retrieves block data: First, the computational core uses the DMA channel to obtain the total number of rows in the coefficient matrix and the starting address of the data; Each computing core obtains the starting index size through the starting index calculation formula, and uses DMA to acquire data based on the data's starting address plus the calculated starting index size. The formula for calculating the starting index is: Where ROWS is the total number of rows in the coefficient matrix; Num cpe Number of compute cores enabled; ID cpe Here is the index number of the current computing core; the 64 computing cores have index numbers ranging from 0 to 63; MIN is the minimum function; MOD is the modulo function, which obtains the ROWS value relative to Num. cpe The remainder.

4. The pcg for shallow water equation oriented knl optimization method of claim 1, wherein, When the computing core retrieves data from main memory, it uses the dma_get and dma_put interfaces to transfer contiguous long data in memory through the DMA engine.

5. The pcg for shallow water equation oriented knl optimization method of claim 1, wherein, The amount of data transferred to the computing core in each chunk is less than the capacity of a single partition space established in the LDM space. The two partitions are transferred alternately to achieve parallel computing and data transfer within a single computing core.

6. The pcg for shallow water equation oriented knl optimization method of claim 1, wherein, In the management core, multiple ordinary data types are merged into a long data type, and a single instruction statement is used to operate on the long data type, realizing data-level parallelism of single instruction stream and multiple data stream.

7. The pcg for shallow water equation oriented knl optimization method of claim 6, wherein, In data-level parallelism, first set a threshold N and a long data length L, and use conditional statements and loop processing to construct a complete data structure: When the amount of data to be calculated exceeds the threshold N, a merging operation is performed, and the data is transformed by iterative judgment. When the amount of data to be calculated reaches the threshold N but is less than the long data length L, fill the empty spaces with 0 to reach the long data length L, and then transform the data. If the amount of data to be calculated is less than the threshold N, no conversion is performed.

8. A pcg sunway optimization system for shallow water equations characterized by: include: The reading module is configured to read the coefficient matrix of the computationally intensive algorithm part of the program in a row-compressed storage format, read in the right-hand side, and set basic conditions, including residual descent requirements and maximum number of iterations. The data partitioning module is configured to: in the management core, partition the coefficient matrix based on the entire row, and evenly divide it into blocks in the row direction to obtain block data; The partitioned parallel module is configured to: evenly divide the LDM of the computing core into two partitions, further divide the block data to meet the space of a single partition, and transmit the block data to the computing core in multiple times, with the two partitions performing the transmission and computing processes in parallel; The iterative judgment module is configured to: create a shared memory region, sum the data calculated by each computing core in the same core group, put it into the shared memory region, obtain the residual by the specified computing core, compare the residual with the set residual reduction requirement, and determine whether the condition is met. If not, the iterative data partitioning module is transferred to the iterative judgment module until the residual obtained after iteration meets the residual reduction requirement, and the solution of the linear equation system is obtained.

9. A computer-readable storage medium having stored thereon a program, characterized in that, When executed by the processor, the program implements the steps in the PCG Shenwei many-core optimization method for shallow water equations as described in any one of claims 1-7.

10. An electronic device comprising a memory, a processor, and a program stored in the memory and capable of running on the processor, characterized by When the processor executes the program, it implements the steps in the PCG Shenwei many-core optimization method for shallow water equations as described in any one of claims 1-7.