A GPU optimization method and system based on heterogeneous regional decomposition of sea wave model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using heterogeneous region decomposition and dynamic task partitioning, combined with paged memory and multi-stream asynchronous parallel technology, the problems of insufficient GPU support and high data transmission overhead in wave simulation are solved, achieving efficient parallel processing of wave mode calculation and communication, and improving the computing performance of heterogeneous platforms.

CN122240316APending Publication Date: 2026-06-19INST OF SOFTWARE - CHINESE ACAD OF SCI

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: INST OF SOFTWARE - CHINESE ACAD OF SCI
Filing Date: 2026-03-20
Publication Date: 2026-06-19

Application Information

Patent Timeline

20 Mar 2026

Application

19 Jun 2026

Publication

CN122240316A

IPC: G06F9/50; G06T1/20

AI Tagging

Application Domain

Resource allocation Processor architectures/configuration

Technology Topics

Computational scienceSea waves

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Sled flatness detection method and device, computer device and storage medium
CN116697940BUsing optical meansComputational sciencePoint cloud
A cartographic method for real-time map rendering optimization
CN122244279A3D-image rendering 3D modellingComputational scienceModel reconstruction
Point cloud processing device and point cloud processing method
WO2026126322A1Using optical meansComputational scienceCloud processing
Air quality monitoring graphical user interface for electronic devices
CN310039648SComputational scienceEngineering
Methods, systems, and media for coherence bunching of rays
CN116309718BImage analysis Electromagnetic wave reradiationComputational scienceEngineering

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing wave numerical simulation techniques lack support for heterogeneous GPU devices, have low computational density, suffer from unbalanced load due to fixed region decomposition strategies, and have high data transfer overhead between CPU and GPU, making it difficult to meet the real-time requirements of high-resolution wave simulation.

Method used

A heterogeneous region decomposition method is adopted to dynamically divide the computing region into inner ring, outer ring and halo sub-region. By using paged memory and multi-stream asynchronous parallel technology, performance indicators are monitored in real time and dynamically adjusted to optimize task binding and achieve parallel processing of computing and communication.

Benefits of technology

It significantly improves the computational efficiency and high-resolution simulation real-time performance of wave patterns on heterogeneous platforms, reduces data transmission overhead, and achieves balanced resource utilization and increased computational throughput.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240316A_ABST

Patent Text Reader

Abstract

This invention discloses a GPU optimization method and system for wave patterns based on heterogeneous region decomposition, belonging to the field of scientific computing technology. To address the problems of uneven load and high communication overhead in existing technologies under heterogeneous environments, this invention determines the mesh partitioning and time step based on initial parameters and resource information; performs secondary partitioning based on resource performance characteristics to obtain sub-regions with different computational dependencies; maps and binds these sub-regions to CPUs or GPUs according to communication capabilities and computational density; configures paged memory and multi-stream parameters to achieve asynchronous parallelism between computation and communication; and dynamically adjusts the partitioning range and binding relationships based on real-time performance monitoring. This invention achieves a high degree of overlap between computation and communication, eliminates computational bottlenecks on heterogeneous platforms, and significantly improves the parallel speedup ratio of wave patterns.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of scientific computing technology, specifically relating to a GPU optimization method and system for wave patterns based on heterogeneous region decomposition. Background Technology

[0002] In existing numerical wave calculations, performance optimization typically employs isomorphic optimization or the OpenACC method, specifically including the following two approaches: First, isomorphic load balancing: To ensure load balancing among processes, existing optimization methods usually achieve load balancing in a isomorphic environment through different region decomposition strategies. Second, OpenACC guided statement optimization: Through simple guided statement optimization, code modifications can be completed quickly, accelerating the program. Developers only need to add OpenACC guided statements before and after the loops requiring optimization to perform the optimization.

[0003] However, existing technologies have the following drawbacks: 1. Lack of heterogeneous computing support: Most existing wave numerical simulation technologies only support homogeneous computing devices (such as CPU clusters), and have insufficient support for heterogeneous devices such as GPUs, making it difficult to meet the real-time requirements of high-resolution wave simulation; 2. Low computational density: Simple GPU implementations typically have low computational density, mainly involving memory-intensive computations, which cannot fully utilize the parallel computing capabilities of heterogeneous devices such as GPUs, limiting further improvements in computational efficiency; 3. Fixed region decomposition strategy: Existing technologies use fixed region decomposition strategies on homogeneous devices, which cannot adapt to the characteristics of heterogeneous computing devices, resulting in unbalanced computational loads and low resource utilization; 4. High data transfer overhead: In heterogeneous computing environments, the data transfer overhead between CPUs and GPUs is significant, and existing technologies lack efficient data transfer optimization mechanisms, further reducing overall computational performance. Summary of the Invention

[0004] The purpose of this invention is to address the technical problems in existing wave numerical simulations, such as insufficient support for heterogeneous GPU devices, low computational density, load imbalance caused by fixed region decomposition strategies, and high data transfer overhead between CPU and GPU. This invention proposes a GPU optimization method and system for wave models based on heterogeneous region decomposition. By using dynamic task partitioning and computation and communication hiding techniques, this invention effectively eliminates load imbalance and transmission delay, significantly improving the computational efficiency and real-time performance of high-resolution wave model simulations on heterogeneous platforms.

[0005] To achieve the above objectives, the present invention adopts the following technical solution.

[0006] A GPU optimization method for ocean wave patterns based on heterogeneous region decomposition includes the following steps: Based on the initial parameters of the wave model and the computational resource information of the heterogeneous platform, the mesh generation result and time step are determined. The grid partitioning results are divided according to the performance characteristics and load capacity of the computing resource information to obtain multiple sub-regions with different computing dependency attributes; Based on the communication capabilities of the heterogeneous platform and the computational dependency attributes of the sub-regions, different sub-regions are mapped and bound to the corresponding CPUs or GPUs; Configure paged memory and multi-stream asynchronous parallel parameters for wave mode calculation, and use the multi-stream asynchronous parallel parameters to execute the calculation and communication tasks of each sub-region, so as to realize parallel processing of calculation and communication; The performance metrics of the CPU and GPU are monitored in real time during the calculation process, and the division range of the sub-region and the task binding relationship are dynamically adjusted according to the performance metrics until the performance metrics reach the preset optimization target.

[0007] Furthermore, based on the initial parameters of the wave model and the computational resource information of the heterogeneous platform, the mesh generation results and time steps are determined, including: Input the initial wave state, topography, wind field, wave energy packet velocity, and background current velocity; define the wave calculation area and set the spatial resolution. Based on the grid topology of the wave calculation region, background current velocity, and wave energy packet velocity, the time step is calculated according to the CFL condition that has passed the stability and convergence determination. The number of CPU cores, clock speed, and instruction set are obtained through processor instruction identification tools, and the number and type of GPUs are obtained through the GPU driver interface; The theoretical computing power of the CPU is evaluated based on the product of the number of cores, the single-core clock speed, and the floating-point calculation value per cycle, and the network bandwidth, communication latency, and data transmission rate of the heterogeneous platform are obtained.

[0008] Furthermore, the time step is calculated based on the CFL conditions that have passed stability and convergence tests, specifically including: Determine the minimum spatial spacing of grid points in the wave pattern; Obtain the vector sum of the wave energy packet group velocity and the background flow velocity, and calculate the maximum magnitude of the vector sum; The upper limit of the time step is determined based on the ratio of the minimum spatial spacing to the maximum module length.

[0009] Furthermore, the grid partitioning result is divided according to the performance characteristics and load capacity of the computing resource information to obtain multiple sub-regions with different computing dependency attributes, including: Based on the CPU's logical control characteristics, multi-level cache structure characteristics, and GPU's parallel computing characteristics, and combined with the number of grid points that each computing resource can handle, the grid partitioning results are initially decomposed. Based on whether grid points participate in inter-process communication or source function computation, a heterogeneous region decomposition strategy is executed to further divide the initially decomposed region into an inner ring region that performs only computation tasks, an outer ring region that performs both computation and communication tasks, and a halo region that performs only communication tasks.

[0010] Furthermore, a heterogeneous region decomposition strategy is implemented, specifically including: Identify the set of grid points located at the edge of the computation region that need to exchange boundary data with adjacent processes, and delineate them as the outer ring region; The remaining grid points within the calculation area, excluding the outer ring area, are designated as the inner ring area; A buffer for receiving data from adjacent processes is established outside the outer ring region, and this buffer is designated as the halo region.

[0011] Furthermore, based on the communication capabilities of the heterogeneous platform and the computational dependency attributes of the sub-regions, different sub-regions are mapped and bound to corresponding CPUs or GPUs, including: Acquire heterogeneous platform communication capabilities, including message passing interface communication bandwidth, peripheral component interconnection standard bandwidth, and data transmission rate; Sub-regions with computational density higher than a preset threshold are mapped and bound to the GPU, while data processing sub-regions and logical judgment sub-regions with computational density lower than a preset threshold are mapped and bound to the CPU core.

[0012] Furthermore, sub-regions with computational density exceeding a preset threshold are mapped and bound to the GPU, specifically including: Calculate the number of floating-point operations and branch instructions required per unit grid point in each sub-region; The ratio of the number of floating-point operations to the number of branch instructions is calculated as a computational density evaluation value; Sub-regions whose computational density evaluation value is greater than a preset ratio coefficient are identified as high-load computing tasks and bound to the GPU.

[0013] Furthermore, paged memory and multi-stream asynchronous parallel parameters are configured for wave mode calculation. The calculation and communication tasks of each sub-region are executed using these multi-stream asynchronous parallel parameters to achieve parallel processing of calculation and communication, including: Before the wave mode calculation is started, paged memory technology is used to allocate non-paged memory space in the host memory. Configure multiple CUDA asynchronous streams to distribute the processing tasks of the inner loop region, outer loop region, and halo region to different asynchronous streams; The asynchronous stream drives the GPU to perform computational tasks, while the peripheral interconnect standard bus is used to perform asynchronous data transfer between the non-paged memory space and the GPU video memory.

[0014] Furthermore, the performance metrics of the CPU and GPU are monitored in real time during the computation process, and the sub-region division range and task binding relationship are dynamically adjusted according to the performance metrics until the performance metrics reach the preset optimization target, including: Real-time CPU and GPU utilization data during the computation process are obtained through the average utilization measurement interface; Based on the real-time utilization data analysis, it is possible that there is insufficient memory capacity leading to frequent exchanges or excessive communication latency causing computational waiting. For computing devices with utilization rates below the preset target, the regional division is readjusted and task binding is re-executed.

[0015] A GPU optimization system for ocean wave patterns based on heterogeneous region decomposition includes the following steps: The resource initialization configuration module is used to determine the mesh generation results and time step based on the initial parameters of the wave pattern and the computing resource information of the heterogeneous platform. The heterogeneous region dynamic partitioning module is used to divide the mesh partitioning result according to the performance characteristics and load capacity of the computing resource information, and obtain multiple sub-regions with different computing dependency attributes. The sub-region task mapping module is used to map and bind different sub-regions to corresponding CPUs or GPUs based on the communication capabilities of the heterogeneous platform and the computational dependency attributes of the sub-regions. The asynchronous computation and communication optimization module is used to configure paged memory and multi-stream asynchronous parallel parameters for wave mode computation, and to execute the computation and communication tasks of each sub-region using the multi-stream asynchronous parallel parameters to achieve parallel processing of computation and communication. The performance monitoring and feedback optimization module is used to monitor the performance indicators of the CPU and GPU in real time during the calculation process, and dynamically adjust the division range of the sub-region and the task binding relationship according to the performance indicators until the performance indicators reach the preset optimization target.

[0016] The present invention has achieved the following beneficial effects.

[0017] 1. This invention employs a heterogeneous region decomposition and dynamic task partitioning method. Based on the logical control capabilities of the CPU and the parallel computing characteristics of the GPU, the computing region is divided into inner ring, outer ring, and halo sub-regions and then differentiated and bound. This achieves a balanced distribution of the load in the heterogeneous system and effectively solves the problem of low resource utilization caused by the traditional fixed decomposition strategy.

[0018] 2. This invention utilizes page-locked memory technology and multi-stream asynchronous parallel technology. By configuring multiple CUDA asynchronous streams, the inner-loop computation and outer-loop communication and data access processes are executed in an overlapping manner, achieving deep hiding of computation and communication. This significantly reduces the data transfer overhead between the CPU and GPU and improves the overall computational throughput.

[0019] 3. This invention uses a real-time performance indicator monitoring and dynamic feedback adjustment mechanism to identify computing bottlenecks or memory swapping anomalies in real time by utilizing CPU and GPU utilization measurement APIs, and dynamically reconstructs the region partitioning scheme accordingly, ensuring that the computing process is always in an optimal resource configuration state.

[0020] 4. This invention supports collaborative scheduling optimization for heterogeneous devices such as CPUs and GPUs, fully leveraging the hardware potential of heterogeneous clusters. Experiments show that, under the same hardware configuration, a significant speedup improvement can be achieved compared to the baseline, meeting the stringent real-time requirements of large-scale, high-resolution ocean wave simulation. Attached Figure Description

[0021] Figure 1 This is a flowchart of the GPU optimization method for wave patterns based on heterogeneous region decomposition in the embodiment. Figure 2 This is a block diagram of the GPU optimization system for wave patterns based on heterogeneous region decomposition in the embodiment. Detailed Implementation

[0022] To make the various technical features, advantages, or effects of the present invention more apparent and understandable, detailed descriptions are provided below through embodiments.

[0023] This invention provides a GPU optimization method for wave patterns based on heterogeneous region decomposition, the processing flow of which is as follows: Figure 1 As shown, the specific steps include: Step S1: Determine the mesh generation result and time step based on the initial parameters of the wave model and the computational resource information of the heterogeneous platform.

[0024] Step S11: Initial condition input and mesh generation.

[0025] Based on the computational requirements of the wave model, including computational stability and convergence requirements, initial conditions such as initial wave state, topography, wind field, wave energy packet velocity, and background current velocity are input. The wave computation region is defined, and a spatial resolution is selected to mesh the region. Based on the stability and convergence criteria in numerical analysis, namely the CFL (Courant-Friedrichs-Lewy) condition, the time step is calculated using the topological relationships of the mesh points, as well as the background current velocity and wave energy packet velocity.

[0026] Step S12: Identification and performance evaluation of computing resources.

[0027] Identify and configure available computing resources, including various computing devices such as central processing units (CPUs) and graphics processing units (GPUs). Use the `lscpu` command to identify the number of CPU cores, clock speed, and instruction set used; use GPU commands such as `nvidia-smi` or the CUDA unified computing architecture application programming interface (API) `cudaGetDeviceProperties` to identify the number and type of GPUs and obtain GPU device information. Calculate the theoretical computing power of the CPU and GPU respectively, and obtain network bandwidth, communication latency, and data transfer rate information for heterogeneous platforms.

[0028] In an optional embodiment of the present invention, the process of initializing the wave model calculation and configuring computing resources includes: inputting the initial wave state, topography, wind field, wave energy packet velocity, and background current velocity according to the calculation stability and convergence requirements of the wave model; delineating the wave calculation area and performing grid division; calculating the time step based on the CFL condition using the topological relationship of the grid points, the background current velocity, and the wave energy packet velocity; obtaining the number of CPU cores, clock speed, and instruction set through the lscpu command; obtaining GPU device information through nvidia-smi; and evaluating the theoretical computing power of the CPU according to the following formula.

[0029] FLOPS = Number of cores × Single-core clock frequency × Floating-point value per cycle; FLOPS represents the number of floating-point operations per second.

[0030] In other embodiments of the present invention, the theoretical computing power of the GPU can also be obtained by querying white papers published by GPU manufacturers.

[0031] Step S2: Divide the grid partitioning result according to the performance characteristics and load capacity of the computing resource information to obtain multiple sub-regions with different computing dependency attributes.

[0032] Step S21: Dynamically calculate the initial division of the region.

[0033] Based on the computational region determined in step S1, which is based on latitude, longitude, and land-sea physical characteristics, the computational region is dynamically divided according to the performance characteristics and load capacity of different computing resources, combined with the overall computational requirements of the wave pattern. Performance characteristics include the architectural differences between CPUs and GPUs. CPUs excel at logic control and complex computational tasks, possessing a multi-level cache structure, including L1, L2, and L3 caches; GPUs are optimized for parallel computing and have stronger intensive computing capabilities. Load capacity refers to the number of grid points that each computing device can handle.

[0034] Step S22: Secondary subdivision of the sub-region.

[0035] Based on the computational characteristics of CPUs and GPUs, a heterogeneous region decomposition strategy is adopted to further divide each computational region into multiple sub-regions with different computational dependencies, thus aligning with the computational and communication characteristics of heterogeneous platforms. Information from surrounding processes is essential for propagation computation, i.e., Stencil computation. Depending on whether grid points participate in communication or source function computation, the computational region is divided into an inner-loop region for computation only, an outer-loop region for both computation and communication, and a halo region for communication only. By considering the differences in computational and communication requirements among different sub-regions, task allocation is optimized and load balancing is achieved.

[0036] In an optional embodiment of the present invention, the process of dynamic computing region partitioning and load balancing includes: taking into account the characteristics of CPUs having fewer cores, strong single-core performance and good logic control, and GPUs having more cores and being suitable for parallel computing of the same type of data-intensive computing, the computing grid determined in step S1 is initially divided according to load capacity; then a heterogeneous region decomposition strategy is executed to further refine the grid into inner ring region, outer ring region and halo region, and to provide a foundation for subsequent parallel computing on heterogeneous platforms by utilizing the different computing dependency attributes of each sub-region.

[0037] In other embodiments of the present invention, the further decomposition of the computational domain in step S1 can also be carried out by dynamically adjusting the boundary range according to the computational dependency of different sub-regions.

[0038] Step S3: Based on the communication capabilities of the heterogeneous platform and the computational dependency attributes of the sub-regions, map and bind the different sub-regions to the corresponding CPUs or GPUs.

[0039] Step S31: Task mapping and resource binding.

[0040] Based on the current computing resource load and the various communication capabilities of the heterogeneous platform, including Message Passing Interface (MPI) communication bandwidth, PCIe bandwidth, and data transfer rate, the allocation of computing tasks is dynamically adjusted. The different sub-regions defined in step S22 are bound to computing resources, specifically by prioritizing the allocation and binding of high-density task sub-regions to the GPU, while prioritizing the allocation and binding of lower-density data processing and logical judgment task sub-regions to CPU cores.

[0041] Step S32, Task scheduling and bottleneck optimization.

[0042] Tasks are allocated based on estimated computation time, and then assigned to matching devices according to their characteristics. More computing resources are allocated to bottlenecks encountered during execution to ensure that the computational and communication needs of different sub-regions are met. The CPU, utilizing its cache structure and support for complex control instructions, handles low-load tasks or tasks involving frequent communication and data transfer; the GPU, leveraging its parallel computing optimization design, efficiently completes intensive computational tasks involving large amounts of similar data.

[0043] In an optional embodiment of the present invention, the process of dynamic computing task allocation and resource optimization includes: combining the sub-region features obtained by secondary subdivision, mapping high-load computing tasks to GPUs and low-load or complex logic tasks to CPUs; during execution, if a specific computing task is identified as a bottleneck in the overall pipeline, the number of CPU cores allocated to that task is dynamically increased to achieve deep matching between computing tasks and heterogeneous resources.

[0044] In other embodiments of the present invention, the dynamic allocation of computing tasks can also be achieved by predicting the workload of subsequent iteration tasks based on historical execution time statistics.

[0045] Step S4: Configure paged memory and multi-stream asynchronous parallel parameters for wave mode calculation, and use the multi-stream asynchronous parallel parameters to execute the calculation and communication tasks of each sub-region to achieve parallel processing of calculation and communication.

[0046] Step S41, configure parallel multi-stream technology.

[0047] Before the computation begins, parallel multi-streaming techniques are configured for wave mode computation, specifically including the application of pinned memory and CUDA multi-streaming. Pinned memory ensures the stability of data transmission, and multiple CUDA streams are set up to achieve hybrid computation.

[0048] Step S42, computation and communication hiding optimization.

[0049] By leveraging CUDA multistreaming technology, computational tasks can be executed asynchronously and concurrently, with necessary data communication occurring during the process, thereby achieving overlap and hiding of computation and communication. Simultaneously, data transmission and computational communication strategies are optimized, including implementing asynchronous data access and concurrent task execution, ensuring that data transmission and communication can proceed in parallel with the computation process, thus improving the overall utilization rate of heterogeneous computing resources.

[0050] In an optional embodiment of the present invention, the process of configuring parallel multi-stream technology and optimizing computational communication includes: allocating non-paged memory space through paged memory technology, and combining CUDA multi-stream technology to allocate the propagation computation task of the wave mode and the boundary data communication task of the outer ring region to different streams for asynchronous execution, so that while the GPU is processing computational tasks, the CPU and GPU can exchange data through the PCIe bus.

[0051] In other embodiments of the present invention, the optimization of computing and communication can also be achieved by using a method based on remote direct memory access (RDMA) technology to directly perform asynchronous data transmission between heterogeneous nodes.

[0052] Step S5: During the calculation process, monitor the performance indicators of the CPU and GPU in real time, and dynamically adjust the division range of the sub-region and the task binding relationship according to the performance indicators until the performance indicators reach the preset optimization target.

[0053] Step S51: Real-time performance metric monitoring.

[0054] During computation, performance metrics, including computation time, resource utilization, and memory usage, are monitored in real time. Average utilization data is obtained by calling the CPU and GPU average utilization measurement APIs. Memory usage is analyzed to determine if there is frequent data exchange due to insufficient memory capacity, and computation time is assessed to determine if there is excessive computation or communication latency, in order to identify problems with low computing resource utilization and analyze their causes.

[0055] Step S52: Dynamically adjust resource allocation.

[0056] If the utilization rate of computing resources is found to be below the preset optimization target, an adjustment mechanism is triggered, returning to step S2 to redistribute the computing regions, and then to step S3 to redistribute tasks. Through this closed-loop feedback process, the load distribution and task mapping of sub-regions are continuously optimized until the utilization rate of computing resources reaches the expected optimization target, thereby ensuring that the resources of the heterogeneous platform are effectively utilized.

[0057] In an optional embodiment of the present invention, the process of performance monitoring and resource optimization includes: using a performance monitoring tool to record in real time the GPU kernel function execution time and MPI communication time during the operation of the wave mode; if it is found that the computing utilization of a certain GPU node is lower than a set threshold, then by returning to step S2 to adjust the regional grid size of the node, and re-executing the task binding in step S3.

[0058] In other embodiments of the present invention, the optimization and adjustment of computing resources may also adopt a strategy of dynamically adjusting the task load weight based on the thermal power consumption status or energy efficiency ratio of each computing device.

[0059] This invention also provides a GPU optimization system for wave patterns based on heterogeneous region decomposition, such as... Figure 2 As shown, it includes the following steps: The resource initialization configuration module is used to determine the mesh generation results and time step based on the initial parameters of the wave pattern and the computing resource information of the heterogeneous platform. The heterogeneous region dynamic partitioning module is used to divide the mesh partitioning result according to the performance characteristics and load capacity of the computing resource information, and obtain multiple sub-regions with different computing dependency attributes. The sub-region task mapping module is used to map and bind different sub-regions to corresponding CPUs or GPUs based on the communication capabilities of the heterogeneous platform and the computational dependency attributes of the sub-regions. The asynchronous computation and communication optimization module is used to configure paged memory and multi-stream asynchronous parallel parameters for wave mode computation, and to execute the computation and communication tasks of each sub-region using the multi-stream asynchronous parallel parameters to achieve parallel processing of computation and communication. The performance monitoring and feedback optimization module is used to monitor the performance indicators of the CPU and GPU in real time during the calculation process, and dynamically adjust the division range of the sub-region and the task binding relationship according to the performance indicators until the performance indicators reach the preset optimization target.

[0060] Experimental test: The test metric is the speedup improvement after optimizing the wave mode. The test data uses a computational region of -70° to 70° and -180° to 180°, with a resolution of 5°×5°, simulating a 5-day wave mode run. The experiment was conducted in a heterogeneous cluster with 2 Kunpeng 920 processors and 4 A100 GPUs on each compute node, using the performance of 8 Kunpeng 920 processors running the computational example in parallel as the performance baseline. Experimental results show that the GPU-ported version of the wave mode, using 8 GPUs, achieves a 20x speedup compared to the baseline. Building on this, homogeneously applying the optimization method proposed in this invention, with the same 8-GPU configuration, further improves computational performance by approximately 1.05x, ultimately achieving a speedup of over 40x compared to the baseline.

[0061] Although the present invention has been disclosed above with reference to embodiments, it is not intended to limit the present invention. Appropriate modifications or equivalent substitutions made by those skilled in the art to the technical solutions of the present invention should be covered within the protection scope of the present invention, which is defined by the claims.

Claims

1. A GPU optimization method for ocean wave patterns based on heterogeneous region decomposition, characterized in that, Includes the following steps: Based on the initial parameters of the wave model and the computational resource information of the heterogeneous platform, the mesh generation result and time step are determined. The grid partitioning results are divided according to the performance characteristics and load capacity of the computing resource information to obtain multiple sub-regions with different computing dependency attributes; Based on the communication capabilities of the heterogeneous platform and the computational dependency attributes of the sub-regions, different sub-regions are mapped and bound to the corresponding CPUs or GPUs; Configure paged memory and multi-stream asynchronous parallel parameters for wave mode calculation, and use the multi-stream asynchronous parallel parameters to execute the calculation and communication tasks of each sub-region, so as to realize parallel processing of calculation and communication; The performance metrics of the CPU and GPU are monitored in real time during the calculation process, and the division range of the sub-region and the task binding relationship are dynamically adjusted according to the performance metrics until the performance metrics reach the preset optimization target.

2. The method as described in claim 1, characterized in that, Based on the initial parameters of the wave model and the computational resource information of the heterogeneous platform, the mesh generation results and time steps are determined, including: Input the initial wave state, topography, wind field, wave energy packet velocity, and background current velocity; define the wave calculation area and set the spatial resolution. Based on the grid topology of the wave calculation region, background current velocity, and wave energy packet velocity, the time step is calculated according to the CFL condition that has passed the stability and convergence determination. The number of CPU cores, clock speed, and instruction set are obtained through processor instruction identification tools, and the number and type of GPUs are obtained through the GPU driver interface; The theoretical computing power of the CPU is evaluated based on the product of the number of cores, the single-core clock speed, and the floating-point calculation value per cycle, and the network bandwidth, communication latency, and data transmission rate of the heterogeneous platform are obtained.

3. The method as described in claim 2, characterized in that, The time step is calculated based on the CFL conditions that have been determined by stability and convergence criteria, specifically including: Determine the minimum spatial spacing of grid points in the wave pattern; Obtain the vector sum of the wave energy packet group velocity and the background flow velocity, and calculate the maximum magnitude of the vector sum; The upper limit of the time step is determined based on the ratio of the minimum spatial spacing to the maximum module length.

4. The method as described in claim 1, characterized in that, The grid partitioning result is divided based on the performance characteristics and load capacity of the computing resource information to obtain multiple sub-regions with different computing dependency attributes, including: Based on the CPU's logical control characteristics, multi-level cache structure characteristics, and GPU's parallel computing characteristics, and combined with the number of grid points that each computing resource can handle, the grid partitioning results are initially decomposed. Based on whether grid points participate in inter-process communication or source function computation, a heterogeneous region decomposition strategy is executed to further divide the initially decomposed region into an inner ring region that performs only computation tasks, an outer ring region that performs both computation and communication tasks, and a halo region that performs only communication tasks.

5. The method as described in claim 4, characterized in that, The heterogeneous region decomposition strategy includes: Identify the set of grid points located at the edge of the computation region that need to exchange boundary data with adjacent processes, and delineate them as the outer ring region; The remaining grid points within the calculation area, excluding the outer ring area, are designated as the inner ring area; A buffer for receiving data from adjacent processes is established outside the outer ring region, and this buffer is designated as the halo region.

6. The method as described in claim 1, characterized in that, Based on the communication capabilities of the heterogeneous platform and the computational dependency attributes of the sub-regions, different sub-regions are mapped and bound to corresponding CPUs or GPUs, including: Acquire heterogeneous platform communication capabilities, including message passing interface communication bandwidth, peripheral component interconnection standard bandwidth, and data transmission rate; Sub-regions with computational density higher than a preset threshold are mapped and bound to the GPU, while data processing sub-regions and logical judgment sub-regions with computational density lower than a preset threshold are mapped and bound to the CPU core.

7. The method as described in claim 6, characterized in that, Mapping and binding sub-regions with computational density exceeding a preset threshold to the GPU specifically includes: Calculate the number of floating-point operations and branch instructions required per unit grid point in each sub-region; The ratio of the number of floating-point operations to the number of branch instructions is calculated as a computational density evaluation value; Sub-regions whose computational density evaluation value is greater than a preset ratio coefficient are identified as high-load computing tasks and bound to the GPU.

8. The method as described in claim 4 or 5, characterized in that, Configure paged memory and multi-stream asynchronous parallel parameters for wave mode calculation, and use the multi-stream asynchronous parallel parameters to execute the calculation and communication tasks of each sub-region, realizing parallel processing of calculation and communication, including: Before the wave mode calculation is started, paged memory technology is used to allocate non-paged memory space in the host memory. Configure multiple CUDA asynchronous streams to distribute the processing tasks of the inner loop region, outer loop region, and halo region to different asynchronous streams; The asynchronous stream drives the GPU to perform computational tasks, while the peripheral interconnect standard bus is used to perform asynchronous data transfer between the non-paged memory space and the GPU video memory.

9. The method as described in claim 1, characterized in that, The performance metrics of the CPU and GPU are monitored in real time during the computation process, and the sub-region division range and task binding relationship are dynamically adjusted according to the performance metrics until the performance metrics reach the preset optimization target, including: Real-time CPU and GPU utilization data during the computation process are obtained through the average utilization measurement interface; Based on the real-time utilization data analysis, it is possible that there is insufficient memory capacity leading to frequent exchanges or excessive communication latency causing computational waiting. For computing devices with utilization rates below the preset target, the region division adjustment is re-triggered, and the task binding is re-executed.

10. A GPU optimization system for wave patterns based on heterogeneous region decomposition, characterized in that, Includes the following steps: The resource initialization configuration module is used to determine the mesh generation results and time step based on the initial parameters of the wave pattern and the computing resource information of the heterogeneous platform. The heterogeneous region dynamic partitioning module is used to divide the mesh partitioning result according to the performance characteristics and load capacity of the computing resource information, and obtain multiple sub-regions with different computing dependency attributes. The sub-region task mapping module is used to map and bind different sub-regions to corresponding CPUs or GPUs based on the communication capabilities of the heterogeneous platform and the computational dependency attributes of the sub-regions. The asynchronous computation and communication optimization module is used to configure paged memory and multi-stream asynchronous parallel parameters for wave mode computation, and to execute the computation and communication tasks of each sub-region using the multi-stream asynchronous parallel parameters to achieve parallel processing of computation and communication. The performance monitoring and feedback optimization module is used to monitor the performance indicators of the CPU and GPU in real time during the calculation process, and dynamically adjust the division range of the sub-region and the task binding relationship according to the performance indicators until the performance indicators reach the preset optimization target.