Method and apparatus for thermodynamic data processing in supercomputing systems based on memory mapping

By employing memory mapping technology and delayed synchronous parallelism (SSP) in the domestic supercomputing system, three-dimensional thermodynamic data is mapped to the kernel cache, solving the memory limitation problem, achieving efficient data processing and storage, and significantly improving computational efficiency and accuracy.

CN120406842BActive Publication Date: 2026-06-30HUNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUNAN UNIV
Filing Date
2025-04-22
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In a domestically developed supercomputing environment, when generating three-dimensional thermodynamic data using the traditional MPI method, each computing node needs to store a complete three-dimensional array, which leads to memory limitations and makes it impossible to generate data larger than MB, resulting in memory bottlenecks and low computing efficiency.

Method used

A memory-mapping-based thermodynamic data processing method for supercomputing systems is adopted. By mapping data files to the kernel cache based on the array's starting address and one-dimensional index, the process optimizes data transmission and computation by combining memory mapping technology and delayed synchronous parallelism (SSP).

Benefits of technology

It effectively overcomes the memory bottleneck of generating ultra-large-scale data, reduces the number of I/O accesses, improves computing and storage efficiency, significantly optimizes data management and access, reduces computing time to 20% of the original, reduces CPU load, and increases system throughput.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120406842B_ABST
    Figure CN120406842B_ABST
Patent Text Reader

Abstract

This application relates to a method, apparatus, device, and storage medium for thermodynamic data processing in a supercomputing system based on memory mapping. A process uses the array's starting address and one-dimensional index in a user buffer to perform memory mapping calls on corresponding portions of the data file within the one-dimensional array. This maps the data file to the kernel cache within the shared user buffer, and data is then transferred between the kernel cache and hardware devices based on the mapped data file. By using memory mapping calls on portions of the data file based on the array's starting address and one-dimensional index, memory congestion is avoided, overcoming the memory bottleneck of generating ultra-large-scale data, reducing I / O access frequency, optimizing data management and access, and improving computational and storage efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer equipment technology, and in particular to a method and apparatus for processing thermodynamic data in a supercomputing system based on memory mapping. Background Technology

[0002] Thermodynamic 3D data refers to high-dimensional datasets used to describe the distribution of thermodynamic state variables (such as temperature, pressure, density, internal energy, entropy, etc.) in three-dimensional space. This type of data typically originates from high-precision numerical simulations, experimental measurements, or theoretical calculations, and is widely used in research fields such as heat conduction analysis, computational fluid dynamics (CFD), and phase change process modeling. Due to increasing computational demands and improved experimental resolution, the size of thermodynamic 3D data often reaches the terabyte (TB) level, requiring efficient data storage and processing technologies.

[0003] Common data storage formats include HDF5, NetCDF, and custom binary formats to ensure high-throughput data read and write performance. For data processing, technologies such as MPI (Mean Interchange Interface) parallel computing and CUDA (Unified Computing Device Architecture) acceleration are typically used to optimize the efficiency of large-scale data access, computation, and visualization. In domestic supercomputing environments, when generating three-dimensional thermodynamic data using the traditional MPI method, each computing node needs to create a complete three-dimensional array containing boundary temperatures. During traversal, the next temperature at each point is calculated using the six-neighbor temperature until convergence to an external isothermal state. However, because each node needs to store a complete three-dimensional array, the data size is limited by memory, generating at most MB-level data; further expansion will lead to memory overflow. Overcoming the memory bottleneck in generating ultra-large-scale data is a problem that urgently needs to be solved. Summary of the Invention

[0004] Therefore, it is necessary to provide a memory-mapped method, apparatus, device, and storage medium for thermodynamic data processing in supercomputing systems that can overcome the memory bottleneck of generating ultra-large-scale data, in order to address the above problems.

[0005] The first aspect of this application provides a method for processing thermodynamic data in a supercomputing system based on memory mapping, including:

[0006] Obtain the one-dimensional array obtained from the conversion of three-dimensional thermodynamic data in the user buffer, as well as the array's starting address and one-dimensional index;

[0007] The process uses the array's starting address and one-dimensional index to perform memory mapping calls on the corresponding portion of the data file in the one-dimensional array, mapping the data file to the kernel cache; the kernel cache shares the user buffer.

[0008] Data is transferred between the kernel cache and hardware devices based on the data files mapped in the kernel cache.

[0009] In one embodiment, the process using the data file performs a memory mapping call on the corresponding portion of the data file in the one-dimensional array based on the array's starting address and the one-dimensional index, mapping the data file to the kernel cache, including:

[0010] Based on the array's starting address, the one-dimensional index, and the preset process processing volume, determine the corresponding data file to be converted in the one-dimensional array;

[0011] The data file to be converted is divided into multiple data blocks according to the preset buffer capacity. Each data block is converted serially, and the converted data file is mapped to the kernel cache.

[0012] In one embodiment, the buffer capacity is the buffer capacity of the input buffer and the output buffer; the step of sequentially converting each of the data blocks serially and mapping the converted data file to the kernel cache includes:

[0013] Based on the one-dimensional index, a sliding window strategy is used to analyze the boundary data and internal data of the current data block in the input buffer. The boundary data amplitude is assigned to the corresponding position in the output buffer. After reading the adjacent data of the internal data, the corresponding position of the output is calculated using a thermodynamic calculation formula. The data file obtained from the output buffer is then mapped to the kernel cache.

[0014] In one embodiment, the memory mapping triggers a page table lookup when a page fault occurs via a request paging mechanism, dynamically loads the required file pages into the kernel cache, and dynamically swaps out low-priority pages when the kernel cache overflows using a page replacement algorithm.

[0015] In one embodiment, there are multiple processes. Each process determines the corresponding data file to be converted in the one-dimensional array based on the array's starting address, the one-dimensional index, and the offset. Each process performs memory mapping calls on the corresponding data file to be converted in parallel, mapping the data file to the kernel cache area. The offset is determined based on the process's processing capacity.

[0016] In one embodiment, the method further includes:

[0017] When the difference in computation steps between different processes reaches the preset maximum computation step threshold, a delayed synchronous parallel mechanism is used to adjust each process, and an error compensation mechanism is introduced to compensate for the data.

[0018] In one embodiment, the introduction of an error compensation mechanism for data compensation includes: after each process reads neighboring data from its internal data, it obtains the approximate true value of the current data and neighboring data at the same moment using an error compensation formula, and then writes the obtained data and the new calculation step composite data into the corresponding position of the buffer according to the thermodynamic calculation formula.

[0019] In one embodiment, the error compensation mechanism uses the following thermodynamic calculation formula:

[0020]

[0021] Among them, κ, Δt, h x h y h z θ is a constant. n [i][j][k] represents the temperature at time n, at position (i, j, k), and θ n+1 [i][j][k] represents the temperature at time n+1, at position (i, j, k).

[0022] In one embodiment, the error compensation mechanism uses the following error compensation formula:

[0023] θ n+m [i][j][k]=α m ·θ n [i][j][k]

[0024] α = W x ·(2+dx)+W y (2+dy)+W z (2+dz)

[0025]

[0026] Where α is a constant, θ n+m [i][j][k] represents the temperature at point (i, j, k) at time n+m; m is the time difference before and after compensation.

[0027] A second aspect of this application provides a memory-mapped supercomputing system thermodynamic data processing device, comprising:

[0028] The data acquisition module is used to acquire a one-dimensional array obtained by converting three-dimensional thermodynamic data in the user buffer, as well as the array starting address and one-dimensional index of the one-dimensional array;

[0029] The data mapping module is used by a process to perform memory mapping calls on the corresponding part of the data file in the one-dimensional array based on the array's starting address and one-dimensional index, mapping the data file to the kernel cache area; the kernel cache area shares the user buffer.

[0030] The data transmission module is used to transmit data between hardware devices based on the data files mapped in the kernel cache.

[0031] A third aspect of this application provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the above-described method.

[0032] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described method.

[0033] The aforementioned memory-mapped supercomputing system thermodynamic data processing method, apparatus, device, and storage medium utilize a process to perform memory mapping calls on corresponding portions of data files within a one-dimensional array, based on the array's starting address and one-dimensional index in the user buffer. This maps the data files to the kernel cache within the shared user buffer, and data is then transferred between the kernel cache and hardware devices based on the mapped data files. By using the process to perform memory mapping calls on portions of the data files within the one-dimensional array based on the array's starting address and one-dimensional index, memory congestion is avoided, overcoming the memory bottleneck of generating ultra-large-scale data, reducing I / O access frequency, optimizing data management and access, and improving computational and storage efficiency. Attached Figure Description

[0034] Figure 1 This is a flowchart of a memory-mapped supercomputing system thermodynamic data processing method in one embodiment;

[0035] Figure 2 This is a schematic diagram of the thermodynamic 3D data composition in the X, Y, and Z dimensions of one embodiment;

[0036] Figure 3 This is a schematic diagram illustrating the process of converting three-dimensional data into one-dimensional data in one embodiment;

[0037] Figure 4 This is a schematic diagram of partial file mapping in one embodiment;

[0038] Figure 5 This is a schematic diagram of the mmap I / O operation process in one embodiment;

[0039] Figure 6 This is a flowchart in one embodiment of a process performing memory mapping calls on the corresponding part of the data file in the one-dimensional array based on the array's starting address and one-dimensional index, thus mapping the data file to the kernel cache area;

[0040] Figure 7This is a schematic diagram of the process for generating thermodynamic data based on memory mapping and buffer in a single computation step and single process in one embodiment.

[0041] Figure 8 This is a schematic diagram illustrating the total time composition of thermodynamic data generated under the BSP framework in one embodiment.

[0042] Figure 9 This is a schematic diagram illustrating the total time composition of thermodynamic data generated under the SSP framework in one embodiment.

[0043] Figure 10 This is a schematic diagram comparing the thermodynamic data generation time of adding mmap+SSP+error compensation mechanism and using only mmap in one embodiment;

[0044] Figure 11 This is a schematic diagram of the process for generating thermodynamic data based on memory mapping + SSP + error compensation in a single computation step and single process in one embodiment.

[0045] Figure 12 This is a structural block diagram of a memory-mapped supercomputing system thermodynamic data processing device in one embodiment;

[0046] Figure 13 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0047] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0048] In domestic supercomputing environments, due to limited memory on a single node, using MPI parallel computing combined with buffers generates a large number of I / O (input / output) requests. To efficiently generate and store ultra-large-scale 3D thermodynamic data, this application provides a memory-mapped method for processing thermodynamic data in supercomputing systems. This method uses a process to perform memory mapping calls on corresponding portions of the data files in a one-dimensional array based on the array's starting address and one-dimensional index in the user buffer. This maps the data files to the kernel cache within the shared user buffer, and data is then transferred between the kernel cache and hardware devices based on the mapped data files. By using the process to perform memory mapping calls on portions of the data files in the transformed one-dimensional array based on the array's starting address and one-dimensional index, memory congestion is avoided, overcoming the memory bottleneck of generating ultra-large-scale data, reducing the number of I / O accesses, optimizing data management and access, and improving computational and storage efficiency.

[0049] In one embodiment, such as Figure 1As shown, a method for processing thermodynamic data in a supercomputing system based on memory mapping is provided, including:

[0050] Step S110: Obtain the one-dimensional array obtained from the conversion of the three-dimensional thermodynamic data in the user buffer, as well as the array's starting address and one-dimensional index.

[0051] The array's starting address (mapped_data_input / output) is the first address of the one-dimensional array stored in the user buffer, used for subsequent data reading. Each data point in the one-dimensional array corresponds to a one-dimensional index (Index_one_dimension), used to determine whether the data belongs to the boundary or the inner data. For example... Figure 2 As shown, the three-dimensional thermodynamic data is based on a regular cubic grid, where each grid point corresponds to a physical location in space and stores the temperature value at that point. Typically, the data for these grid points is stored as single-precision or double-precision floating-point numbers. Figure 2 In the example shown, the grid consists of 2 × 4 × 2 = 16 cells, each containing a temperature data value. Since the calculation of each data point depends on six adjacent points, x is generated... * y * z-sized data requires execution 6 * x * y * Data access occurs every z times, but disk I / O latency is significant. Traditional I / O modes frequently switch between user mode and kernel mode, limiting computational efficiency. On the Tianhe next-generation supercomputing system, traditional methods use 100 nodes and 1600 cores to generate 5TB of data. Due to the limited memory of a single node, using MPI parallel computing combined with buffers generates a large number of I / O requests, causing each step of the computation to take 13 hours, severely restricting the speed of generating large-scale data.

[0052] To overcome the memory bottleneck of generating ultra-large-scale data, a buffer mechanism is used to store data in one dimension, and memory usage is controlled by a block-based streaming input method, thereby avoiding memory explosion and enabling the program to run stably. Figure 3A schematic diagram for converting 3D data to 1D storage: Taking a 3*3*3 data block as an example, numbers 0-26 represent the 1D index of the corresponding block, arranged in the Z direction first, then the Y direction, and finally the X direction. Except for the data corresponding to index 13, which is internal data, the rest are boundary data (constant room temperature). If the generated 3D data is x_size*y_size*z_size, considering the boundary conditions, the actual calculated data size is (x_glo=x_size+2)*(y_glo=y_size+2)*(z_glo=z_size+2). Therefore, the formula for converting the 3D array index data[i][j][k] to the 1D index Index_one_dimension is:

[0053] Index_one_dimension=k+j*z glo+i*z glo*y glo

[0054] For example, the first data element, data[0][0][0], corresponds to the one-dimensional index 0, and the last data element, data[2][2][2], corresponds to 26. Similarly, one-dimensional indices can be converted back to three-dimensional array indices.

[0055] The one-dimensional array `data` stores index information. `data[0]` stores the x-dimensional index, `data[1]` stores the y-dimensional index, and `data[2]` stores the z-dimensional index. By converting the one-dimensional index to three dimensions, it is easy to determine whether the data is boundary data. That is, if `data[0]` equals 0 or `x_glo`, or `data[1]` equals 0 or `y_glo`, or `data[2]` equals 0 or `z_glo`, it is boundary data and can be directly put into the output buffer without any further processing. Otherwise, the data of the six adjacent points in the three-dimensional space of the point need to be combined with the thermodynamic formula and output to the output buffer. In the one-dimensional index, their distances are ±z_glo*y_glo, ±z_glo, and ±1, respectively.

[0056] Step S120: The process uses the array's starting address and one-dimensional index to perform memory mapping calls on the corresponding part of the data file in the one-dimensional array, mapping the data file to the kernel cache area.

[0057] The kernel buffer shares the user buffer. mmap (Memory Mapping) refers to directly mapping a memory region in user space to kernel space, enabling data sharing between user and kernel modes. Memory mapping is a highly efficient I / O mechanism based on the operating system's virtual memory management, providing a transparent and high-performance file access method. Through the mmap() system call, file contents can be logically mapped directly to the process's address space, allowing the process to read and write file data as if accessing memory, without explicitly calling read() and write() for data transfer. This mechanism greatly optimizes I / O processing efficiency, especially suitable for large-scale data computation and high-performance computing (HPC) scenarios. Figure 4 As shown, after successful memory mapping, user modifications to the memory region can be directly synchronized to the kernel space, and vice versa, thus avoiding the data copying overhead of the traditional read / write method. In scenarios requiring frequent large-scale data transfers, memory mapping can significantly improve data access efficiency and reduce system overhead.

[0058] By implementing this memory mapping, processes can directly read and write to the mapped memory region, suitable for reading and writing the one-dimensional three-dimensional data mentioned earlier. The system automatically writes dirty pages (Page Cache) back to the disk file, thus completing file operations without explicitly calling system calls such as read / write. Furthermore, memory mapping not only improves single-process I / O efficiency but also supports multiple processes sharing the same file mapping region. Kernel modifications to this mapping region can be directly synchronized to the user space of all processes, thus supporting efficient file sharing between different processes. It also enables efficient inter-process communication (IPC). This makes it particularly suitable for distributed and parallel computing applications, such as data exchange on HPC systems.

[0059] Specifically, memory mapping technology can map extremely large files to memory because it relies on virtual memory technology at its core. The addressing capabilities of a 64-bit operating system can cover terabytes of data. Even if physical memory cannot fully accommodate it, demand paging can trigger page table lookups during page faults, loading pages from secondary storage into memory. This dynamically loads the required file pages into memory instead of loading the entire file at once, significantly reducing I / O overhead. This allows even terabyte-sized files to be accessed efficiently without consuming excessive physical memory. Furthermore, page replacement algorithms dynamically swap out low-priority pages when memory overflows, ensuring stable system operation while providing near-memory access speeds.

[0060] In one embodiment, there are multiple processes. Each process determines the corresponding data file to be converted in the one-dimensional array based on the array's starting address, one-dimensional index, and offset. Each process performs memory mapping calls on the corresponding data file to be converted in parallel, mapping the data file to the kernel cache.

[0061] The offset is determined by the process throughput, which in turn is determined by the total size of the one-dimensional array and the number of processes. For example, if the total size of the one-dimensional array is M and the number of processes is N, then the throughput of a single process is M / N, and the offsets of the starting addresses of the data read by the N processes are 0, M / N, 2M / N, ..., M, respectively. That is, the first process is responsible for memory mapping calls of the data file from "array starting address" to "array starting address + M / N - 1", the second process is responsible for memory mapping calls of the data file from "array starting address + M / N" to "array starting address + 2M / N - 1", the third process is responsible for memory mapping calls of the data file from "array starting address + 2M / N" to "array starting address + 3M / N - 1", and so on. The data file of the one-dimensional array is evenly distributed among multiple processes for memory mapping calls, and each process can execute in parallel, improving data processing efficiency. Furthermore, within each process, the corresponding data file to be converted can be divided into multiple data blocks based on the size of the internal cache. The internal cache is then used to sequentially map and call each data block in memory, thus avoiding memory blockage due to excessive data size.

[0062] It is understood that in other embodiments, the data file of the one-dimensional array may not be evenly distributed among the processes. Instead, the processing volume of each process may be allocated according to actual needs, and memory-mapped calls may be executed in parallel by multiple processes.

[0063] Step S130: Data transfer is performed between the data file mapped in the kernel cache and the hardware device. Since the kernel cache shares the user buffer, the CPU can directly transfer the mapped data file from the kernel cache to the hardware device for storage, eliminating the need for multiple data reads and writes from the user buffer to move the data file to the kernel cache, thus reducing the number of I / O accesses.

[0064] Specifically, such as Figure 5As shown, the hardware device is a hard disk, and the kernel storage area uses a dirty page cache. In traditional I / O mechanisms, the data transfer path is: hard disk (DMA) — kernel buffer (CPU) — user buffer (CPU). In Linux's cached I / O system, data must first be copied from the disk to the kernel space buffer, managed by the Page Cache, and then copied from the kernel buffer to the application address space for user-mode numerical computation. During this process, data undergoes multiple copies between the application address space and the cache, leading to a significant increase in CPU and memory overhead. In ultra-large-scale data processing scenarios, the I / O bottleneck is particularly prominent, severely impacting computational efficiency.

[0065] Under the mmap mechanism, the data transfer path is optimized to: Disk (DMA) — Kernel buffer (shared user buffer) (CPU). First, only one mmap() system call is needed to map the file into the process address space. Subsequent access is treated like memory operation, eliminating the need for frequent read / write calls and thus avoiding frequent switching between user and kernel modes. Second, because the system kernel buffer and user buffer are shared, mmap reduces one copy operation from the system cache to the application cache when accessing disk data. This is equivalent to disk data being directly transferred to the user buffer via DMA, reducing CPU overhead, improving data access efficiency, and significantly optimizing system performance.

[0066] mmap offers significant advantages in large-scale data computation, reducing I / O bottlenecks and improving computational efficiency. In ultra-large data computing tasks, the limited memory of each node leads to frequent data copying and system calls using traditional I / O methods, easily becoming a performance bottleneck. Using mmap() significantly reduces CPU and memory overhead for data transfer, increases the throughput of computing nodes, and is suitable for high-performance parallel computing. In domestic supercomputing environments, large-scale data often needs to be processed in parallel across multiple computing nodes. Combining mmap() with MPI parallel programming reduces data exchange overhead between processes, improves data loading speed, and optimizes multi-node data sharing mechanisms. It also supports efficient one-dimensional and three-dimensional data storage. In scientific computing and visualization rendering, three-dimensional data typically needs to be flattened into one-dimensional arrays for storage and computation. mmap() can map very large files to a contiguous virtual address space, making the access mode of three-dimensional data closer to memory operations, improving data locality, and optimizing caching performance.

[0067] In one embodiment, such as Figure 6 As shown, step S120 includes steps S122 and S124.

[0068] Step S122: Determine the corresponding data files to be converted in the one-dimensional array based on the array's starting address, one-dimensional index, and preset process processing capacity. For ease of understanding, taking the first process as an example, the total size of the one-dimensional array is M, and the number of processes is N. Based on the array's starting address, one-dimensional index, and preset process processing capacity, the data files to be converted for the first process can be determined as the data files from "array starting address" to "array starting address + M / N - 1" in the one-dimensional array.

[0069] Step S124: Divide the data file to be converted into multiple data blocks according to the preset buffer capacity, and sequentially convert each data block. Map the converted data file to the kernel cache. For example, Figure 7 As shown, the buffer capacity n is the buffer capacity of the input buffer buffer0 and the output buffer buffer1. Taking the first process as an example, the data file to be converted can be divided into multiple data blocks according to the buffer capacity n. Each time, one data block in the input buffer buffer0 is converted, and the converted data file is output to the output buffer buffer1. Then, the next data block in the input buffer buffer0 is converted. Following this serial conversion process, each process divides the file mapping part (old) in the large one-dimensional array in external storage (old) into data blocks and converts them sequentially. Finally, the corresponding file mapping part (new) is obtained. The file mapping parts (new) obtained by all processes in parallel are merged as the large one-dimensional data (new) in external storage. That is, the converted file of the one-dimensional array is obtained in the user buffer and synchronously mapped to the kernel cache so that data can be transferred between the data file mapped in the kernel cache and the hardware device.

[0070] Further, in step S124, each data block is sequentially converted serially, and the converted data file is mapped to the kernel cache. This includes: analyzing the boundary data and internal data of the current data block in the input buffer using a sliding window strategy based on a one-dimensional index, setting the boundary data magnitude to the corresponding position in the output buffer, and after reading the adjacent data of the internal data, calculating the corresponding position of the output buffer using a thermodynamic calculation formula, and mapping the converted data file from the output buffer to the kernel cache.

[0071] Continue to refer to Figure 7Both reading and writing operations are performed on the one-dimensional array `mapped_data_read / write`, treating large files in external storage as if they were arrays in memory. First, the entire three-dimensional thermodynamic data is logically mapped to a one-dimensional array, returning the array's starting address (`mapped_data_input / output`). A single calculation step requires only one `mmap` call to map the file to the process's address space; subsequent access is treated like memory operations, eliminating the need for frequent `read / write` calls and avoiding frequent switching between user and kernel modes. Each process uses a dedicated one-dimensional index `index_one_dimension` for boundary condition checks, allowing direct access from memory instead of multiple reads from external storage when processing internal data, improving read efficiency. The input buffer `buffer0` and output buffer `buffer1` employ a sliding window strategy (`buffer0 / 1 = mapped_data_input / output + index_one_dimension`) to ensure dynamic adjustment at the end of traversal, preventing memory overflow.

[0072] Taking the first process as an example, the buffer capacity of input buffer buffer0 and output buffer buffer1 is n. This process uses a one-dimensional index `index_one_dimension` to determine whether the data is boundary data. It iterates through the data blocks in input buffer buffer0 from 0 to n-1, incrementing `index_one_dimension` by 1 with each iteration. The process then uses `index_one_dimension` to convert to a three-dimensional index to determine if the data at the corresponding position is boundary data. For boundary data, it is directly copied to the corresponding position in output buffer buffer1. When it is determined to be internal data, for example... Figure 7 The data corresponding to the one-dimensional indices i-2 to j (highlighted in red) are internal data. Therefore, the neighboring data of Index_one_dimension±z_glo*y_glo, ±z_glo, ±1 corresponding to the one-dimensional index i-2 (based on the current internal data index) are read from the one-dimensional array returned by mmap. The results are calculated using thermodynamic formulas and output to the i-2 position in the output buffer buffer1. The same applies to other internal data. After mmap mapping, modifying the data in the output buffer buffer1 allows direct mapping to external memory.

[0073] In one embodiment, the method further includes: when the difference in computation steps between different processes reaches a preset maximum computation step threshold, using a delayed synchronous parallel mechanism to adjust each process, and introducing an error compensation mechanism to compensate for data.

[0074] like Figure 8As shown, the total time to generate thermodynamic data consists of the single-step computation time (Tc) and the synchronization barrier time (Tb). Under the Synchronous Parallel Framework (BSP), the single-step computation time Tc is determined by the bottleneck effect, meaning that the global progress is limited by the slowest process (e.g., Rank2 out of Rank0 to Rank3). When there are many allocated processes, such as when tens of thousands of cores are allocated, the communication time of the synchronization barrier time Tb cannot be ignored. Tb originates from the communication overhead generated by all processes aligning their states through explicit synchronization barriers (such as MPI_Barrier) after each computation step. To address the performance bottleneck, this application first uses memory mapping (mmap) technology to optimize the local computation efficiency of the slowest process, reducing its single-step time Tc by reducing disk I / O operations. Furthermore, a Delayed Synchronous Parallel (SSP) mechanism is introduced, allowing for controllable step delays between processes, thereby significantly reducing the number of synchronization barrier triggers and the time Tb. Simultaneously, Tc is further compressed by merging high-frequency mmap operations. In addition, the SSP framework embeds an error compensation algorithm to offset the potential accuracy loss of asynchronous computation. Ultimately, through the collaborative optimization of mmap and SSP, the system achieves secondary acceleration based on memory mapping technology, balancing computational efficiency and convergence accuracy, and is particularly suitable for distributed thermodynamic modeling in heterogeneous hardware environments.

[0075] Stale Synchronous Parallel (SSP) is a parallel computing paradigm originating from distributed machine learning training. At the synchronization mechanism level, it constructs a flexible coordination model that falls between Bulk Synchronous Parallel (BSP) and Asynchronous Parallel (ASP). The core mechanism of SSP is achieved by defining a maximum staleness threshold: the maximum computational step threshold (Threshold) strictly defines the maximum computational step size that any worker node can exceed that of the slowest node globally. When the difference between a node's computational progress and the slowest node's step size reaches the maximum computational step threshold (Threshold), that node will actively enter a blocking state until the lagging node catches up, causing the step size difference to fall back below the maximum computational step threshold (Threshold).

[0076] In particular, SSP exhibits compatibility with the classic model: when threshold = 0, SSP degenerates into a strict BSP mode, requiring all nodes to align according to the synchronization barrier; when threshold → +∞, SSP degenerates into an ASP mode, completely eliminating the waiting overhead between nodes.

[0077] like Figure 9 The diagram illustrates the time-series decomposition of thermodynamic data generation based on the SSP framework. This mechanism significantly optimizes the negative impact of straggler nodes (fallback nodes) on overall throughput in long-running tasks by dynamically balancing computational efficiency and consistency errors. When the computation steps of the current process exceed the Threshold threshold of the slowest process, it needs to wait until the slowest process catches up. It can be seen that, except for the final synchronization time, the SSP framework does not require the frequent synchronization of the BSP framework, greatly reducing Tb.

[0078] from Figure 7 As can be seen, if N computation steps are required, N mmap calls are needed. Essentially, this is still an I / O operation, and the time overhead cannot be ignored. However, by using the SSP framework and error compensation mechanism, N mmap calls can be reduced to one mmap call, which greatly reduces the Tc_i of each process, thereby reducing Tc.

[0079] While the SSP framework can improve computation speed, an unavoidable problem is that each process performs asynchronous computation. In computation step S, the current point needs data from six adjacent points in 3D. However, the data from these six points may be calculated in computation step Sm in a slower process, or in computation step S+m in a faster process (where m is any integer value). Directly calculating data at different times will result in a large precision offset and incorrect data. Therefore, an error compensation mechanism needs to be introduced to ensure the accuracy of the data.

[0080] Specifically, an error compensation mechanism is introduced to compensate for data, including: after each process reads neighboring data from its internal data, it obtains the approximate true value of the current data and neighboring data at the same moment through the error compensation formula, and then writes the obtained data and the new calculation step composite data into the corresponding position of the buffer according to the thermodynamic calculation formula.

[0081] The thermodynamic calculation formula used in the error compensation mechanism is as follows:

[0082]

[0083] Among them, κ, Δt, h x h y h z θ is a constant.n [i][j][k] represents the temperature at time n, at position (i, j, k), and θ n+1 [i][j][k] represents the temperature at point (i, j, k) at time n+1. This can be understood as θ n [i+1][j][k] represents the temperature at time n, at position (i+1, j, k), and θ n [i-1][j][k] represents the temperature at time n, at position (i-1, j, k), and θ n [i][j+1][k] represents the temperature at time n, at position (i, j+1, k), and θ n [i][j-1][k] represents the temperature at time n, at position (i, j-1, k), and θ n [i][j][k+1] represents the temperature at time n, at position (i, j, k+1), and θ n [i][j][k-1] represents the temperature at time n at point (i, j, k-1).

[0084] Due to spatial continuity, when the space is divided to near-infinite detail, the temperatures of adjacent points can be approximated as equal, i.e., θ n [i][j][k]≈θ n [i-1][j][k]≈θ n [i+1][j][k]≈θ n [i][j-1][k]≈θ n [i][j+1][k]≈θ n [i][j][k-1]≈θ n Given [i][j][k+1], the above expression can be simplified to the following form:

[0085] θ n+1 [i][j][k]=α·θ n [i][j][k]

[0086] Where the constant α:

[0087] α = W x ·(2+dx)+W y (2+dy)+W z (2+dz)

[0088]

[0089] Furthermore, the error compensation formula used by the error compensation mechanism can be obtained as follows:

[0090] θ n+m [i][j][k]=α m ·θn [i][j][k]

[0091] Where, θ n+m [i][j][k] represents the temperature at point (i, j, k) at time n+m; m is the time difference before and after compensation. This formula allows us to convert the values ​​of the six adjacent data points of the current process's current position data at any computation step into the value at the same time (at the same computation step S) corresponding to the current position data. With a suitable maximum computation step threshold (Threshold) selected within the SSP framework, the absolute value of m will not be too large, ensuring the accuracy and precision of the data. The above describes the principle of error compensation.

[0092] Then, it is necessary to store the floating-point temperature data f and the integer data s of the calculation step for each point. The floating-point data is 32-bit float and the integer data is 32-bit int. During storage, they are merged into 64-bit data, with float occupying the high 32 bits and int occupying the low 32 bits. During calculation, they can be decomposed. This ensures that the process can read and modify the data and the number of calculation steps at the same time, avoiding the problem of read-write inconsistency. It can also be seen that only the file needs to be initialized once, that is, an mmap mapping is performed once. After that, data can be read and written on this file. There is no need for N calculation steps to correspond to N mmap mappings. Although merging and decomposing data takes time, it is all done in memory. Compared with the I / O external memory operation time in mmap, it is almost negligible, thus greatly reducing Tc.

[0093] Figure 10 Comparing the thermodynamic data generation time with the addition of mmap+SSP+error compensation mechanism and the use of mmap alone, it can be seen that the mmap+SSP+error compensation method only has one global barrier time and one mmap initialization call time, while the original mmap method requires N global barriers and N mmap calls if N calculation steps are needed. Therefore, the mmap+SSP+error compensation method is superior.

[0094] Therefore, through Figure 7 Optimize based on the existing solution, such as Figure 11As shown. The boundary data corresponding to the white one-dimensional index is not processed. However, for the internal data corresponding to the red one-dimensional index: the neighboring data of Index_one_dimension±z_glo*y_glo, ±z_glo, ±1 corresponding to the one-dimensional index i-2 (based on the current internal data index) are read from the one-dimensional array returned by mmap. After decomposing the 64-bit data, the temperature data and the number of calculation steps are obtained. The approximate true value of the current data and the neighboring data at the same time is obtained through the error compensation formula. Then, the approximate true value is calculated using the thermodynamic calculation formula. The obtained data and the new calculation step are merged into 64-bit data and written to the i-2 position in the output buffer buffer1. The same applies to other internal data.

[0095] In the context of domestic supercomputing, the use of memory-mapped files avoids frequent switching between user mode and kernel mode (frequent I / O) caused by the small memory of each node, reduces the copying between system cache and application cache, realizes efficient interaction between user area and system kernel, and provides a way for processes to communicate with each other through shared memory. It is suitable for parallel programming, so that the file is regarded as part of memory, and each process can read and write the file as if it were accessing ordinary memory, thus realizing efficient large-scale parallel data transmission.

[0096] Simultaneously, a delayed synchronous parallel mechanism (SSP) is introduced, allowing for controllable step delays between processes. This significantly reduces the number of synchronization barrier triggers and the time consumption (Tb). Furthermore, Tc is further compressed by merging high-frequency mmap operations. In addition, an error compensation algorithm is embedded within the SSP framework to offset potential accuracy losses from asynchronous computation. Finally, through the collaborative optimization of mmap and SSP, the system achieves secondary acceleration based on memory mapping technology, balancing computational efficiency and convergence accuracy.

[0097] To verify the performance improvement effect of memory mapping technology on domestic supercomputers, a massive data generation test was conducted on the Tianhe next-generation supercomputing system. The experimental parameters are as follows:

[0098] Computing resources: 100 compute nodes (1600 CPU cores in total)

[0099] Data size: 5TB

[0100] Total calculation steps: 100

[0101] Before optimization (traditional read() / write() method): 1300 hours elapsed.

[0102] After optimization (using mmap() + SSP (Threshold = 10) + error compensation mechanism): time taken 250 hours.

[0103] Meanwhile, the average error of its data was tested and found to be 0.024556%, with negligible loss of accuracy, which is within an acceptable range.

[0104] Experiments show that by employing memory mapping technology combined with the SSP+ error compensation mechanism, computation time is reduced to 20% of the original time, I / O efficiency is significantly improved, CPU load is reduced, and system throughput is increased, providing a more efficient solution for large-scale data computation. This also optimizes the bottleneck problem caused by the small memory of domestically produced supercomputers, which necessitates frequent I / O operations and synchronization barriers.

[0105] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0106] Based on the same inventive concept, this application also provides a memory-mapped supercomputing system thermodynamic data processing device for implementing the memory-mapped supercomputing system thermodynamic data processing method described above. The solution provided by this device is similar to the solution described in the above method; therefore, the specific limitations in one or more embodiments of the memory-mapped supercomputing system thermodynamic data processing device provided below can be found in the limitations of the memory-mapped supercomputing system thermodynamic data processing method described above, and will not be repeated here.

[0107] In one embodiment, such as Figure 12 As shown, a memory-mapped supercomputing system thermodynamic data processing device is also provided, comprising: a data acquisition module 110, a data mapping module 120, and a data transmission module 130, wherein:

[0108] The data acquisition module 110 is used to acquire a one-dimensional array obtained by converting three-dimensional thermodynamic data in the user buffer, as well as the array's starting address and one-dimensional index.

[0109] The data mapping module 120 is used by the process to perform memory mapping calls on the corresponding part of the data file in the one-dimensional array based on the array's starting address and one-dimensional index, and to map the data file to the kernel cache area; the kernel cache area shares the user buffer.

[0110] The data transmission module 1310 is used to transmit data between hardware devices based on data files mapped in the kernel cache.

[0111] In one embodiment, the data mapping module 120 determines the corresponding data file to be converted in the one-dimensional array based on the array's starting address, one-dimensional index, and preset process processing capacity; divides the data file to be converted into multiple data blocks according to the preset buffer capacity; sequentially converts each data block serially; and maps the converted data file to the kernel cache area.

[0112] In one embodiment, the data mapping module 120 analyzes the boundary data and internal data of the current data block in the input buffer using a sliding window strategy based on a one-dimensional index, assigns the boundary data magnitude to the corresponding position in the output buffer, and after reading the adjacent data of the internal data, calculates the corresponding position of the output buffer using a thermodynamic calculation formula, and maps the data file obtained from the output buffer to the kernel cache.

[0113] In one embodiment, the device further includes a synchronization compensation module, which is used to adjust each process using a delayed synchronization parallel mechanism and introduce an error compensation mechanism to compensate data when the difference in computation steps between different processes reaches a preset maximum computation step threshold.

[0114] The modules in the aforementioned memory-mapped supercomputing system thermodynamic data processing device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in the computer device, or stored in the computer device's memory as software, so that the processor can call and execute the corresponding operations of each module.

[0115] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 13As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When the computer program is executed by the processor, it implements a memory-mapped supercomputing system thermodynamic data processing method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0116] Those skilled in the art will understand that Figure 13 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0117] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above-described method embodiments.

[0118] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps in the above method embodiments.

[0119] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0120] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments described above. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0121] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0122] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A memory mapping based supercomputing system thermodynamic data processing method, characterized in that, include: Obtain the one-dimensional array obtained from the conversion of three-dimensional thermodynamic data in the user buffer, as well as the array's starting address and one-dimensional index; The process uses the array's starting address and one-dimensional index to perform memory mapping calls on the corresponding portion of the data file in the one-dimensional array, mapping the data file to the kernel cache; the kernel cache shares the user buffer. Data is transferred between the kernel cache and hardware devices based on the data files mapped in the kernel cache. When the difference in computation steps between different processes reaches the preset maximum computation step threshold, a delayed synchronous parallel mechanism is used to adjust each process, and an error compensation mechanism is introduced to compensate for the data. The introduced error compensation mechanism for data compensation includes: after each process reads neighboring data from its internal data, it obtains the approximate true value of the current data and neighboring data at the same moment using an error compensation formula, and then writes the obtained data and the new calculation step's synthesized data into the corresponding position in the buffer according to the thermodynamic calculation formula; the error compensation formula used by the error compensation mechanism is: ; ; ; in, , Δt , h x , h y , h z It is a constant. θ n [ i ][ j ][ k ]express n At what time, the location is ( i , j , k The temperature at point ), where α is a constant. θ n+m [ i ][ j ][ k ]express n + m At what time, the location is ( i , j , k The temperature at point () is used for compensation; m represents the time difference before and after the compensation.

2. The method according to claim 1, characterized in that, The process using the array performs memory mapping calls on the corresponding portion of the data file in the one-dimensional array based on the array's starting address and one-dimensional index, mapping the data file to the kernel cache, including: Based on the array's starting address, the one-dimensional index, and the preset process processing volume, determine the corresponding data file to be converted in the one-dimensional array; The data file to be converted is divided into multiple data blocks according to the preset buffer capacity. Each data block is converted serially, and the converted data file is mapped to the kernel cache.

3. The method according to claim 2, characterized in that, The buffer capacity is the buffer capacity of the input buffer and the output buffer; the step of sequentially converting each data block serially and mapping the converted data file to the kernel cache includes: Based on the one-dimensional index, a sliding window strategy is used to analyze the boundary data and internal data of the current data block in the input buffer. The boundary data amplitude is assigned to the corresponding position in the output buffer. After reading the adjacent data of the internal data, the corresponding position of the output is calculated using a thermodynamic calculation formula. The data file obtained from the output buffer is then mapped to the kernel cache.

4. The method according to claim 1, characterized in that, The memory mapping uses a request paging mechanism to trigger page table lookups when a page fault occurs, dynamically loads the required file pages into the kernel cache, and uses a page replacement algorithm to dynamically swap out low-priority pages when the kernel cache overflows.

5. The method according to any one of claims 1-4, characterized in that, The number of processes is multiple. Each process determines the corresponding data file to be converted in the one-dimensional array based on the array's starting address, the one-dimensional index, and the offset. Each process performs memory mapping calls on the corresponding data file to be converted in parallel, mapping the data file to the kernel cache area. The offset is determined based on the process's processing capacity.

6. The method according to claim 5, characterized in that, The thermodynamic calculation formula used in the error compensation mechanism is: ; in, θ n+1 [ i ][ j ][ k ]express n At time +1, the position is ( i , j , k The temperature at point ().

7. A memory-mapped supercomputing system thermodynamic data processing device, characterized in that, To implement the method according to any one of claims 1-6, comprising: The data acquisition module is used to acquire a one-dimensional array obtained by converting three-dimensional thermodynamic data in the user buffer, as well as the array starting address and one-dimensional index of the one-dimensional array; The data mapping module is used by a process to perform memory mapping calls on the corresponding part of the data file in the one-dimensional array based on the array's starting address and one-dimensional index, mapping the data file to the kernel cache area; the kernel cache area shares the user buffer. The data transmission module is used to transmit data between hardware devices based on the data files mapped in the kernel cache.