An MPI communication optimization method based on pre-sampling and adaptive parameter estimation
By collecting MPI communication characteristics during the pre-sampling stage, calculating and injecting optimal parameters, and optimizing the MPI communication process, the problem that traditional MPI tools cannot reflect real application behavior is solved, thereby improving communication efficiency and resource utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-11
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional MPI communication tools cannot accurately reflect real-world application behavior, resulting in limited communication efficiency. Too many small messages cause queue congestion, and large messages are not effectively fragmented, failing to fully utilize bandwidth.
The communication behavior information of the MPI application is collected during the pre-sampling phase, and the optimal communication parameters, including the optimal eager threshold and the optimal message fragment length, are calculated and injected into the configuration of the MPI application during the formal execution phase to optimize the communication process.
It effectively reduces communication overhead and improves the overall operating efficiency and resource utilization of parallel applications.
Smart Images

Figure CN122240352A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of high-performance computing (HPC) communication optimization technology, specifically relating to an MPI communication optimization method based on presampling and adaptive parameter estimation. Background Technology
[0002] In the fields of modern scientific computing and engineering simulation, the scale of problems is increasingly massive, and the computational complexity is increasing dramatically. From global climate simulation and astrophysical research to the analysis of complex flow mechanisms, the demand for computing resources has far exceeded the processing capacity of a single computer. To address this challenge, parallel computing technology has emerged and become a core means of solving large-scale computational problems. The core idea of parallel computing is to decompose a large computational task into multiple smaller subtasks, distribute them across multiple computing units for simultaneous execution, and finally aggregate the results through collaboration, thereby significantly shortening the solution time. Message Passing Interface (MPI), due to its high flexibility, portability, and excellent performance, has become a reliable tool for implementing large-scale parallel computing on distributed memory systems (such as high-performance computing clusters). It allows processes running on multiple processors (or computing nodes) to collaborate by sending and receiving messages. Each process has its own independent address space, and data exchange and synchronization between processes must be accomplished through explicit MPI calls (such as MPI_Send, MPI_Recv).
[0003] With the evolution of computing hardware, especially the rise of general-purpose graphics processing units (GPUs) in scientific computing, traditional computing architectures have evolved into heterogeneous models where CPUs and GPUs work together. MPI has also evolved accordingly, closely integrating with accelerator computing models (such as NVIDIA's CUDA) to form a hybrid parallel programming paradigm of "MPI + X". In this paradigm, MPI is responsible for macroscopic process-level communication and task coordination between multiple computing nodes, while "X" (such as CUDA) is responsible for performing extreme thread-level parallel computing within a node, utilizing the GPU's many-core architecture. This division of labor fully leverages the advantages of each component: GPUs focus on computationally intensive tasks, providing extremely high floating-point throughput; MPI is responsible for "glueing" multiple GPUs into a unified, massive virtual computer, enabling cross-node data exchange and global synchronization. This powerful capability of combining MPI and GPU is demonstrated in computational fluid dynamics applications represented by GPUSPH. GPUSPH is a free and open-source software based on the Smooth Particle Hydrodynamics (SPH) method, specifically designed for GPU acceleration, used to simulate complex fluid phenomena. In simulating a large-scale scenario, such as a dam collapse, it's necessary to track hundreds of millions or even billions of fluid particles. This is where MPI (Meanwhile, Persistent Dynamics) plays a crucial role. The simulation domain is spatially decomposed into multiple subdomains, and each MPI process (typically corresponding to a GPU) is responsible for calculating the forces and positions of all particles within a subdomain. At each time step, the process first efficiently calculates the forces and positions of particles within its subdomain using the CUDA kernel. Subsequently, boundary particles from neighboring subdomains migrate, and processes need to use point-to-point MPI communication to send and receive data from these "ghost particles" or "halo particles" to ensure the continuity of force calculations at the boundaries.
[0004] Frequent data exchange between nodes results in massive communication volumes, and the communication efficiency of MPI parallel systems is limited by the mismatch between default parameters and real-world application characteristics. An excessive number of small messages can easily lead to eager queue congestion; large messages, without effective fragmentation, fail to fully utilize bandwidth. Traditional communication tuning tools use fixed benchmarks rather than real-world application behavior, and therefore cannot accurately reflect the communication patterns of simulation programs such as GPUSPH. Summary of the Invention
[0005] To address the aforementioned problems in the prior art, this invention proposes an MPI communication optimization method based on presampling and adaptive parameter estimation, comprising:
[0006] S1: Start the MPI application and run it within a preset number of time steps to form the presampling phase;
[0007] S2. Collect communication behavior information of the MPI application during the pre-sampling stage, and calculate communication characteristics based on the communication behavior information;
[0008] S3: Calculate the optimal communication parameters based on communication characteristics; the optimal communication parameters include: the optimal eager threshold and the optimal message fragment length;
[0009] S4: Inject the optimal communication parameters into the MPI application configuration, resume the MPI application's operation, and the MPI application enters the formal execution phase;
[0010] S5: Merge the data generated in the pre-sampling phase and the formal execution phase to obtain the complete application execution result.
[0011] The beneficial effects of this invention are:
[0012] This invention proposes an MPI communication optimization method based on presampling and adaptive parameter estimation. By using PMPI technology to collect real communication characteristics during the presampling stage, and restoring application operation without recalculation based on the checkpoint mechanism, the optimal communication parameters are adaptively calculated and injected in combination with the communication characteristics, thereby achieving the tuning of MPI communication parameters, effectively reducing communication overhead, and improving the overall operating efficiency and resource utilization of parallel applications. Attached Figure Description
[0013] Figure 1 is a flowchart of an MPI communication optimization method based on presampling and adaptive parameter estimation according to the present invention.
[0014] Figure 2 is a schematic diagram of the structure of an MPI communication optimization method based on presampling and adaptive parameter estimation according to the present invention. Detailed Implementation
[0015] To make the technical solutions in the embodiments of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only for explaining the present invention and are not intended to limit the present invention.
[0016] like Figure 1 , Figure 2 As shown, this embodiment of the invention provides an MPI communication optimization method based on presampling and adaptive parameter estimation, including:
[0017] S1. Start the MPI application in presampling mode and run it within a preset number of time steps to form the presampling phase;
[0018] Specifically, the process for handling the presampling mode startup method of MPI applications includes:
[0019] Adding presampling start parameters, sampling time steps, and sampling data save path to the MPI application is to ensure that the application can correctly enter the presampling stage and save intermediate results.
[0020] S2. Collect communication behavior information of the MPI application during the pre-sampling stage, and calculate communication characteristics based on the communication behavior information;
[0021] PMPI technology is used to intercept MPI communication functions called in MPI applications and calculate communication characteristics such as message size distribution, message concurrency, inter-node round-trip latency, and effective bandwidth.
[0022] Specifically, the basic process of the pre-sampling operation phase is as follows: the computing node execution module is responsible for performing simulation calculations according to the established rules; the PMPI interception and acquisition module intercepts the MPI communication functions called in the MPI application, collects the data and puts it into the buffer; the acquisition and statistics module is responsible for collecting network-related data, performing structured statistics on the data in the buffer, and outputting a communication statistics file, which includes communication characteristics such as message size distribution, message concurrency, inter-node round-trip delay and effective bandwidth.
[0023] The specific process of the compute node execution module includes:
[0024] The MPI application starts MPI processes on multiple computing nodes. Each process executes the computational logic normally to ensure that the communication mode can reflect the actual simulation behavior. It runs the time steps specified in the presampling phase. Each process saves the checkpoint file and simulation data of the presampling phase. The checkpoint file contains complete information about the particles within the time step, including position and velocity, which is used for subsequent formal simulation startup and complete simulation data statistics.
[0025] The specific process of PMPI interception and collection module processing includes:
[0026] The MPI communication functions called in the application are wrapped using the MPI standard's Profiling Interface (PMPI) technology. Specifically, each MPI_interface function (such as MPI_Send, MPI_Recv, etc.) corresponds to a version with the PMPI_ prefix (such as PMPI_Send, PMPI_Recv, etc.). Taking MPI_Send as an example, the execution flow of its wrapper function is as follows:
[0027] The wrapper function first calls the corresponding PMPI_Send interface to perform the actual MPI communication operation; after the PMPI_Send call returns successfully, the data acquisition logic is then executed, thus ensuring that data statistics are performed only after a valid MPI communication call is completed.
[0028] In the data acquisition logic, the acquired communication events include the following information: sender rank, receiver rank, message size, message type (distinguished by the type of MPI function called), timestamp, and sequence number (used for subsequent send / receive event pairing), and the acquired information is stored in a dedicated buffer.
[0029] Preferably, the above-described packaging method is also applicable to other MPI communication functions (such as MPI_Recv, MPI_Isend, MPI_Irecv, MPI_Bcast, etc.) to achieve interception and data collection of all target MPI communication functions in the application.
[0030] The specific process of data collection and statistics module includes:
[0031] The acquisition and statistics module is responsible for reading, summarizing, and structurally statistically analyzing the communication events temporarily stored in the buffer by the PMPI interception acquisition module to form communication features for use in the subsequent parameter estimation stage. Specifically, the statistical processing flow is triggered when the pre-sampling stage ends or when preset statistical triggering conditions are met (such as reaching a preset number of time steps, a threshold for the number of sampling events, or a threshold for iteration steps).
[0032] Preferably, to ensure data consistency, a buffer freeze operation can be performed before the statistics are triggered, that is, to suspend the writing of new events, and to make all processes enter a unified statistical preparation state through MPI barrier synchronization (MPI_Barrier).
[0033] In this embodiment, the message size distribution is statistically analyzed by constructing a histogram using a bucketing method. Specifically, messages from each communication event are mapped to preset bucket intervals based on their message size (preferably using logarithmic scale bucketing, for example, with boundaries of 2KB, 8KB, 32KB, 64KB, 128KB, 512KB, 2MB, etc.). The number of messages in each bucket interval is counted to obtain the local histogram representation result, i.e., the message size distribution.
[0034] The process of calculating message concurrency includes: pairing communication events using sequence numbers to obtain multiple synchronous point-to-point communication events. The two communication events of a synchronous point-to-point communication event are a sending communication event and a receiving communication event, for example, MPI_Send is paired with the corresponding MPI_Recv, or MPI_Isend is completed to MPI_Wait; for each synchronous point-to-point communication event, if the timestamps of the two communication events of the synchronous point-to-point communication event are not in the same time window, the message corresponding to the synchronous point-to-point communication event is in an incomplete state; the number of messages in an incomplete state in each time window is counted, and the maximum or average value of the number of messages counted in all time windows is taken as the message concurrency k; preferably, to reduce the impact of instantaneous spikes, truncated mean or median filtering can be used.
[0035] The round-trip time (RTT) calculation process includes: pairing communication events using sequence numbers to obtain multiple synchronous point-to-point communication events; calculating the difference between the timestamps of the sending and receiving communication events (i.e., the sending completion timestamp recorded by the sender and the receiving completion timestamp recorded by the receiver) in each synchronous point-to-point communication event as the estimated round-trip time; and taking the median of the estimated round-trip times of all synchronous point-to-point communication events as the RTT.
[0036] The calculation process for effective bandwidth includes: pairing communication events using sequence numbers to obtain multiple synchronous point-to-point communication events; filtering out synchronous point-to-point communication events with message sizes greater than a preset threshold (e.g., 32KB); calculating the difference between the timestamps of the sending and receiving communication events in each filtered synchronous point-to-point communication event as the transmission duration; calculating the ratio of the message size to the transmission duration of each synchronous point-to-point communication event to obtain the single-transmission bandwidth of each synchronous point-to-point communication event; and taking the median of the single-transmission bandwidths of all synchronous point-to-point communication events as the effective bandwidth to avoid underestimation caused by instantaneous congestion.
[0037] Optionally, dedicated network testing (such as point-to-point ping-pong or one-way large message transmission) can still be retained as a supplement or alternative to provide a cleaner network profile when business communication samples are insufficient.
[0038] The communication statistics file uses an easily parsed structured format (such as JSON or key-value text); optionally, the file is written by the rank 0 process to a shared file system path for external scheduling scripts to read.
[0039] S3: Calculate the optimal communication parameters for the MPI application based on communication characteristics; the optimal communication parameters include: the optimal eager threshold and the optimal message fragment length;
[0040] The eager threshold and message fragment length are core parameters affecting communication performance, especially related to short / long message transmission strategies, memory usage, and latency.
[0041] In this embodiment, step S3 is the optimal parameter calculation stage. Its purpose is to perform feature analysis, network profiling and adaptive threshold decision-making based on the communication characteristics (including message size distribution, message concurrency k, round-trip time RTT and effective bandwidth BW) recorded in the communication statistics information file generated in step S2, to obtain the optimal communication parameters, form the optimal communication parameter set, and output it for use by external scheduling scripts.
[0042] S31. Read the communication statistics file output in step S2 and parse it to obtain the following key statistical results: message size distribution, message concurrency k, round-trip time (RTT), and effective bandwidth (BW).
[0043] S32. Calculate the bandwidth-delay product (BDP), expressed as follows:
[0044]
[0045] BDP is used to characterize the amount of data that a link can "in transit" within a round-trip time.
[0046] S33. Calculate the optimal eager threshold, expressed as follows:
[0047]
[0048] in, This is an empirical proportionality coefficient, preferably ranging from 0.3 to 0.8; and These are the lower and upper bounds of the threshold, respectively, used to suppress performance degradation caused by abnormal sampling or extreme parameters; preferably, , (In this embodiment, take) , (to adapt to TCPBTL limitations in typical Ethernet environments). The function implementation is as follows: if the calculated value is less than Then take If greater than Then take Otherwise, take the original value.
[0049] Experience ratio coefficient The calculation process includes: calculating the proportion of small messages based on the message size distribution. Empirical proportion coefficient Small messages are those smaller than the message threshold, and the default message threshold for MPI is 64KB.
[0050] S34. Calculate the optimal message fragment length, expressed as follows:
[0051]
[0052] Where k is the message concurrency estimated during the presampling stage; in this embodiment, if k < 1, then k = 1 is conservatively chosen. MSS is the maximum segment size of the link, which is preferably obtained by querying through the system interface (for example, in Linux, after obtaining the MTU through ip link, subtract the IP / TCP header overhead, the typical value is 1460 bytes (standard MTU 1500) or 8960 bytes (giant frame MTU 9000)); optionally, if it cannot be queried, the default value (such as 1460 bytes) is used. The function represents the The rounding function for alignment to MSS is implemented to align downwards or to the nearest multiple of MSS (preferably rounded down to the nearest multiple of MSS) to ensure that fragmentation adapts to link segments and reduces fragmentation.
[0053] S35. Combine the optimal eager threshold and the optimal message fragment length to obtain the optimal parameter set. The parameter set is output in key-value pair form for subsequent steps to be read by external scripts and injected into the MPI runtime environment.
[0054] S4. Inject the optimal communication parameters into the configuration of the MPI application, resume the operation of the MPI application, and enter the formal execution phase;
[0055] Specifically, it includes:
[0056] S41. Construct MPI startup parameters for the formal execution phase, and inject the optimal eager threshold and message fragment length calculated in step S3 into the MPI runtime parameter configuration.
[0057] Specifically, the communication module is configured via startup parameters using the Modular Component Architecture (MCA) interface provided by the MPI runtime. Preferably, the eager threshold is set via the MCA parameter `btl_tcp_eager_limit`, and the message fragment size is set via the MCA parameter `btl_tcp_rdma_pipeline_send_length`. These parameters are appended to the MPI startup parameters as key-value pairs, enabling the MPI runtime to load and apply the optimal communication parameters during the initialization phase.
[0058] S42. Construct a complete MPI application startup command and resume running the MPI application through the checkpoint mechanism supported by the application (the checkpoint file saved in step S2).
[0059] Specifically, the startup command, while keeping the application executable file, number of processes, and input dataset unchanged, updates only the runtime parameter configurations related to MPI communication, in conjunction with the checkpoint file generated during the presampling phase.
[0060] Preferably, the resumption of the MPI application is triggered by an external scheduling script. By calling the MPI startup command (such as mpirun) and specifying the checkpoint file path and optimal communication parameters, each MPI process loads checkpoint data during the initialization phase and restores to the calculation state at the end of the presampling phase, thereby continuing to execute the subsequent calculation process.
[0061] S5: Merge the data generated in the pre-sampling phase and the formal execution phase to obtain the complete application execution result.
[0062] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A method for MPI communication optimization based on pre-sampling and adaptive parameter estimation, characterized in that, include: S1: Start the MPI application and run it within a preset number of time steps to form the presampling phase; S2. Collect communication behavior information of the MPI application during the pre-sampling stage, and calculate communication characteristics based on the communication behavior information; S3: Calculate the optimal communication parameters based on communication characteristics; the optimal communication parameters include: the optimal eager threshold and the optimal message fragment length; S4: Inject the optimal communication parameters into the MPI application configuration, resume the MPI application's operation, and the MPI application enters the formal execution phase; S5: Merge the data generated in the pre-sampling phase and the formal execution phase to obtain the complete application execution result.
2. The MPI communication optimization method based on pre-sampling and adaptive parameter estimation according to claim 1, characterized in that, The collection of communication behavior information of MPI applications during the pre-sampling phase includes: intercepting MPI communication functions called by MPI applications using PMPI technology to obtain multiple communication events, i.e., communication behavior information; where each communication event includes: sender, receiver, message size, message type, timestamp, and sequence number.
3. The MPI communication optimization method based on presampling and adaptive parameter estimation according to claim 2, characterized in that, Communication characteristics include: message concurrency, effective bandwidth, round-trip time (RTT), and message size distribution.
4. The MPI communication optimization method based on presampling and adaptive parameter estimation according to claim 3, characterized in that, The process of calculating message size distribution includes: mapping messages of each communication event to preset bucket intervals according to message size, counting the number of messages in each bucket interval, and obtaining message size distribution.
5. The MPI communication optimization method based on presampling and adaptive parameter estimation according to claim 3, characterized in that, The process of calculating message concurrency includes: dividing the presampling stage into multiple time windows; pairing communication events using sequence numbers to obtain multiple synchronous point-to-point communication events; for each synchronous point-to-point communication event, if the timestamps of the two communication events of the synchronous point-to-point communication event are not in the same time window, the message corresponding to the synchronous point-to-point communication event is in an incomplete state; counting the number of messages in an incomplete state in each time window, and taking the maximum value among the number of messages counted in all time windows as the message concurrency.
6. The MPI communication optimization method based on presampling and adaptive parameter estimation according to claim 3, characterized in that, The round-trip time (RTT) calculation process includes: pairing communication events using sequence numbers to obtain multiple synchronous point-to-point communication events; calculating the difference between the timestamps of two communication events in each synchronous point-to-point communication event as the estimated round-trip time; and taking the median of the estimated round-trip times of all synchronous point-to-point communication events as the RTT.
7. The MPI communication optimization method based on presampling and adaptive parameter estimation according to claim 3, characterized in that, The calculation process for effective bandwidth includes: pairing communication events using sequence numbers to obtain multiple synchronous point-to-point communication events; filtering out synchronous point-to-point communication events whose message size is greater than a preset threshold, and calculating the difference in timestamps between two communication events in each filtered synchronous point-to-point communication event as the transmission duration; calculating the ratio of message size to transmission duration for each synchronous point-to-point communication event to obtain the single-transmission bandwidth of each synchronous point-to-point communication event; and taking the median of the single-transmission bandwidths of all synchronous point-to-point communication events as the effective bandwidth.
8. The MPI communication optimization method based on presampling and adaptive parameter estimation according to claim 3, characterized in that, Optimal eager threshold The calculation formula is: ; ; in, Here, BDP is the empirical proportionality coefficient, BW is the bandwidth-delay product, RTT is the round-trip time, and LowerBound and UpperBound are the lower and upper bounds of the eager threshold, respectively. This indicates the function that cuts off the upper and lower bounds.
9. The MPI communication optimization method based on presampling and adaptive parameter estimation according to claim 8, characterized in that, Experience ratio coefficient The calculation process includes: calculating the proportion of small messages based on the message size distribution. Empirical proportion coefficient .
10. The MPI communication optimization method based on presampling and adaptive parameter estimation according to claim 8, characterized in that, Optimal message fragment length: ; Where k is the message concurrency level, and MSS is the maximum segment size of the link. Indicates will A rounding function aligned to MSS.