Data processing method and apparatus, and computing system and computing device cluster
By loading data from the low-bandwidth first storage layer to the high-bandwidth second storage layer according to the global access order in the computing system, and executing tasks in memory, the data loading latency problem caused by limited cache space is solved, and task execution efficiency is improved.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-08-26
- Publication Date
- 2026-07-02
AI Technical Summary
When the computing system performs AI model training tasks, the limited cache space leads to long data loading delays, which affects the efficiency of task execution.
By acquiring the access information of the training task, the data is loaded from the first storage layer to the second storage layer according to the global access order, and the task is executed in the memory of the computing node. The high-bandwidth second storage layer is used to quickly load data, freeing up storage space to load other data.
It effectively reduces data loading latency, improves the task execution efficiency of the computing system, and saves time reading data from the low-bandwidth storage layer.
Smart Images

Figure CN2025116975_02072026_PF_FP_ABST
Abstract
Description
Data processing methods, devices, computing systems and computing equipment clusters
[0001] This application claims priority to Chinese Patent Application No. 202411976168.6, filed on December 27, 2024, entitled “Data Processing Method, Apparatus, Computing System and Computing Device Cluster”, the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of cloud computing technology, and in particular to a data processing method, apparatus, computing system, and computing device cluster. Background Technology
[0003] With the rapid development of artificial intelligence (AI) technology, the scale of data processed by computing systems when performing tasks is getting larger and larger. However, due to the limited bandwidth of the storage medium where the data is stored, the latency of the computing system loading data is getting higher and higher.
[0004] In related technologies, data preloading is used to reduce data loading latency. For example, during the training of an AI model, certain data may be frequently accessed, so this data can be preloaded into a cache. Based on this, the computing system can directly load the cached data into memory to execute the task, without having to repeatedly load this data from the disk.
[0005] However, in the above methods, frequently accessed data will occupy cache space for a long time, and the storage capacity provided by the cache is very limited, making it difficult for the computing system to preload other data. As a result, when the computing system executes tasks, it still needs to spend a long time reading other data from the disk, which affects the efficiency of task execution. Summary of the Invention
[0006] This application provides a data processing method, apparatus, computing system, and computing device cluster, which can effectively reduce the data loading latency when the computing system performs training tasks.
[0007] Firstly, a data processing method is provided, applied to data loading scenarios during training tasks in the AI field. This method is executed by a computing system, which includes a first computing node and a second computing node. Each computing node is equipped with an accelerator card and memory. The first and second computing nodes are used to execute model training tasks. The method includes:
[0008] Obtain the first access information for the training task. The first access information is used to indicate the global access order of the training task data during the execution of the training task.
[0009] During the execution of the training task, based on the first access information, multiple first data are loaded from the first storage layer to the second storage layer, and the data of the training task includes multiple first data.
[0010] The first target data from multiple first data in the second storage layer is loaded into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node can perform training tasks based on the first target data in the memory of the first computing node.
[0011] The second target data from multiple first data in the second storage layer is loaded into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node can perform training tasks based on the second target data in the memory of the second computing node.
[0012] The aforementioned model is used to perform inference on input data to output inference results. Its application areas include text reasoning, image reasoning, video reasoning, audio reasoning, etc., which this application does not limit. In the above method, during the execution of the model training task by the computing system through multiple computing nodes, multiple data points related to the training task are pre-loaded from the first storage layer to the second storage layer according to the global access order of the data related to the training task. When any computing node needs to access the target data among the multiple data points to execute the training task, the target data in the second storage layer is loaded into the memory of that computing node. Since the data in the second storage layer is loaded from the first storage layer according to the global access order, the data already loaded into the second storage layer is accessed promptly and quickly, allowing the storage space of the second storage layer to be released in a timely manner to load other data. This eliminates the need for the computing system to spend time reading data from the first storage layer, effectively reducing data loading latency.
[0013] In one possible implementation, the method further includes:
[0014] Before executing the training task, based on the first access information, multiple second data are loaded from the first storage layer to the second storage layer until a first condition is met. The access order of the multiple second data is prior to the access order of the multiple first data. The training task data includes multiple second data. The first condition refers to the amount of data in the second storage layer reaching a first threshold. Alternatively, the first condition refers to the multiple second data including user-specified data from the training task data.
[0015] Before loading the first target data into the memory of the first computing node, the third target data among the multiple second data in the second storage layer is loaded into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node can perform training tasks based on the third target data in the memory of the first computing node.
[0016] Before loading the second target data into the memory of the second computing node, the fourth target data among the multiple second data in the second storage layer is loaded into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node can perform training tasks based on the fourth target data in the memory of the second computing node.
[0017] In the above manner, the data loaded from the first storage layer to the second storage layer by the computing system before executing the training task is referred to as the second data. This data can also be understood as offline pre-loaded data, or data loaded by the computing system through the offline data pre-loading function. By setting the aforementioned first condition, the computing system loads a portion of the training task data from the first storage layer to the second storage layer in advance, according to the global access order of the training task. This fully utilizes the idle resources before the training task is executed, saving the time required for the computing system to load this portion of data from the first storage layer to the second storage layer after the training task starts, thereby effectively reducing data loading latency.
[0018] In one possible implementation, when multiple threads load multiple pieces of second data from a first storage layer to a second storage layer, wherein any two threads are responsible for loading different data, the method further includes: after a first condition is met, storing an index to indicate data that has been loaded into the second storage layer during the execution of a training task; and loading multiple pieces of first data from the first storage layer to the second storage layer based on first access information, including: starting with the access order of the data indicated by the index in the first access information, loading multiple pieces of first data from the first storage layer to the second storage layer.
[0019] By employing the above method, the computing system runs multiple threads to load multiple pieces of second data from the first storage layer to the second storage layer, improving system resource utilization and reducing data loading latency. Furthermore, due to the different speeds of the threads, data gaps may exist between low-speed and high-speed threads; that is, the data pre-loaded offline by the computing system into the second storage layer may not be continuous. This could lead to omissions in data loading during subsequent task execution. Therefore, by using a storage index, the system can indicate the data already loaded into the second storage layer during task execution, ensuring that data loading is not missed and avoiding additional overhead for subsequent data loading processes.
[0020] In one possible implementation, the method further includes: after loading the first target data in the second storage layer into the memory of the first computing node through the first computing node, generating a data deletion request if the first target data meets a second condition, the data deletion request being used to instruct the first target data to be deleted from the second storage layer; wherein the second condition refers to the first target data not being accessed within a first time period; or, the second condition refers to the number of times the first target data is accessed reaching a second threshold.
[0021] The above process means that during the execution of the training task, after each data required for the training task is loaded from the second storage layer into memory, the computing node promptly determines whether the data can be deleted from the second storage layer. If it can be deleted, a data deletion request is generated to release the storage space of the second storage layer in a timely manner.
[0022] In one possible implementation, the method further includes: obtaining second access information for the training task, the second access information being used to indicate the local access order of the training task data by the accelerator cards on each computing node during the parallel execution of the training task by multiple accelerator cards, the second access information being generated based on the first access information;
[0023] Loading the first target data from multiple first data in the second storage layer into the memory of the first computing node via the first computing node includes: loading the first target data determined based on the second access information in the second storage layer into the memory of the first computing node via the first computing node.
[0024] The above process means that during the parallel execution of tasks by multiple accelerator cards, each accelerator card loads the data that needs to be accessed from the second storage layer into memory for processing according to the local access order required by its respective task. Since the computing system has preloaded the relevant data of the task into the second storage layer based on the global access information, each accelerator card can quickly load the required data from the second storage layer based on its own local access order, thereby effectively reducing the data loading latency.
[0025] In one possible implementation, the method further includes: during the process of loading first target data determined based on second access information from the second storage layer into the memory of the first computing node through the first computing node, loading at least one third data determined based on the second access information from the second storage layer into the memory of the first computing node through the first computing node, wherein the access order of the at least one third data is after the access order of the first target data.
[0026] The above process means that while the computing node is loading the data required to perform the training task from the second storage layer into memory, it prefetches the data required to perform the training task from the second storage layer into memory in advance, saving the computing node the time to load this part of the data from the second storage layer into memory, thereby further reducing the data loading latency.
[0027] In one possible implementation, the method further includes: deleting fourth data of the training task that satisfies a third condition from the second storage layer, wherein the third condition means that the fourth data will not be accessed during a second time period; or, the third condition means that the number of times the fourth data is accessed reaches a third threshold.
[0028] The above methods can promptly free up storage space in the second storage layer, improving resource utilization. It should be understood that deleting data from the second storage layer when it will not be accessed for a period of time can promptly eliminate "cold data," preventing it from occupying storage space in the second storage layer for extended periods. Similarly, deleting data from the second storage layer when its access frequency reaches a threshold can promptly eliminate data no longer needed for task execution, preventing this data from occupying storage space and thus reducing space waste.
[0029] In one possible implementation, the step of deleting the fourth data from the second storage layer is performed during the process of loading the training task data from the first storage layer to the second storage layer; or, the step of deleting the fourth data from the second storage layer is performed during the process of loading the training task data from the second storage layer into the memory of the computing node.
[0030] The above process means that during the entire training task, the computing system promptly determines whether there is data in the second storage layer that can be deleted, thereby releasing the storage space of the second storage layer in a timely manner and improving resource utilization.
[0031] Secondly, a data processing apparatus is provided for use in a computing system, the computing system including a first computing node and a second computing node, each computing node being equipped with an accelerator card and memory, the first computing node and the second computing node being used to perform model training tasks, the apparatus comprising:
[0032] The parsing module is used to obtain the first access information of the training task. The first access information is used to indicate the global access order of the training task data during the execution of the training task.
[0033] The online data preloading module is used to load multiple first data from the first storage layer to the second storage layer based on the first access information during the execution of the training task. The data of the training task includes multiple first data.
[0034] The front-end data loading module is used to load the first target data from multiple first data in the second storage layer into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node can perform training tasks based on the first target data in the memory of the first computing node; and to load the second target data from multiple first data in the second storage layer into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node can perform training tasks based on the second target data in the memory of the second computing node.
[0035] In one possible implementation, the device further includes:
[0036] The offline data preloading module is used to load multiple second data from the first storage layer to the second storage layer based on the first access information before executing the training task, until a first condition is met. The access order of the multiple second data is before the access order of the multiple first data. The data of the training task includes multiple second data. The first condition means that the amount of data in the second storage layer reaches a first threshold; or, the first condition means that the multiple second data includes user-specified data in the data of the training task.
[0037] The front-end data loading module is also used for:
[0038] Before loading the first target data into the memory of the first computing node, the third target data among the multiple second data in the second storage layer is loaded into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node can perform training tasks based on the third target data in the memory of the first computing node.
[0039] Before loading the second target data into the memory of the second computing node, the fourth target data among the multiple second data in the second storage layer is loaded into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node can perform training tasks based on the fourth target data in the memory of the second computing node.
[0040] In one possible implementation, when multiple threads are running to load multiple pieces of second data from a first storage layer to a second storage layer, wherein any two threads are responsible for loading different data, the apparatus further includes:
[0041] A storage module is used to store an index after the first condition is met. This index is used to indicate the data that has been loaded into the second storage layer during the execution of the training task.
[0042] An online data preloading module is used to load multiple first data from the first storage layer to the second storage layer, starting with the access order of the data indicated by the index in the first access information.
[0043] In one possible implementation, the device further includes:
[0044] The generation module is used to generate a data deletion request after the first target data in the second storage layer is loaded into the memory of the first computing node through the first computing node, and the first target data meets the second condition. The data deletion request is used to indicate that the first target data is deleted from the second storage layer.
[0045] The second condition refers to the first target data not being accessed within the first time period; or, the second condition refers to the number of times the first target data is accessed reaching the second threshold.
[0046] In one possible implementation, the parsing module is further configured to obtain second access information for the training task. The second access information is used to indicate the local access order of the training task data by the accelerator cards on each computing node during the parallel execution of the training task by multiple accelerator cards. The second access information is generated based on the first access information.
[0047] The front-end data loading module is used to load the first target data determined based on the second access information from the second storage layer into the memory of the first computing node through the first computing node.
[0048] In one possible implementation, the device further includes:
[0049] The data prefetching module is used to load at least one third data determined based on the second access information from the second storage layer into the memory of the first computing node through the first computing node, wherein the access order of the at least one third data is after the access order of the first target data.
[0050] In one possible implementation, the device further includes:
[0051] The deletion module is used to delete the fourth data of the training task that meets the third condition from the second storage layer. The third condition means that the fourth data will not be accessed within the second time period; or, the third condition means that the number of times the fourth data is accessed reaches the third threshold.
[0052] In one possible implementation, the deletion module is used to delete the fourth data from the second storage layer during the process of loading training task data from the first storage layer to the second storage layer; or, during the process of loading training task data from the second storage layer to the memory of the computing node, the fourth data is deleted from the second storage layer.
[0053] Thirdly, a computing system is provided, which includes multiple computing nodes, each of which is equipped with an accelerator card and memory. The multiple computing nodes are used to perform training tasks of a model. The computing system is used to implement the data processing method provided by the first aspect or any possible implementation of the first aspect.
[0054] Fourthly, a computing device cluster is provided, including at least one computing device, each computing device including a processor and memory;
[0055] A processor of at least one computing device is used to execute instructions stored in the memory of at least one computing device to cause the cluster of computing devices to implement the data processing method provided by the first aspect or any possible implementation of the first aspect.
[0056] Fifthly, a computer program product containing instructions is provided, which, when executed by a cluster of computing devices, causes the cluster of computing devices to implement the data processing method provided by the first aspect or any possible implementation thereof.
[0057] In a sixth aspect, a computer-readable storage medium is provided, including computer program instructions that, when executed by a cluster of computing devices, enable the cluster of computing devices to implement the data processing method provided by the first aspect or any possible implementation thereof.
[0058] Based on the implementation methods provided in the above aspects, this application can be further combined to provide more implementation methods. Attached Figure Description
[0059] Figure 1 is a schematic diagram of an implementation environment provided in an embodiment of this application;
[0060] Figure 2 is a schematic diagram of the principle of a data processing method provided in an embodiment of this application;
[0061] Figure 3 is a flowchart of a data processing method provided in an embodiment of this application;
[0062] Figure 4 is a schematic diagram of an offline data preloading process provided in an embodiment of this application;
[0063] Figure 5 is a schematic diagram of an index provided in an embodiment of this application;
[0064] Figure 6 is a schematic diagram of an online data preloading process provided in an embodiment of this application;
[0065] Figure 7 is a schematic diagram of a data loading process provided in an embodiment of this application;
[0066] Figure 8 is a schematic diagram of a data prefetching process provided in an embodiment of this application;
[0067] Figure 9 is a schematic diagram of a data elimination strategy during online data preloading provided in an embodiment of this application;
[0068] Figure 10 is a schematic diagram of a data eviction strategy during data loading provided in an embodiment of this application;
[0069] Figure 11 is a schematic diagram of a data loading function provided in an embodiment of this application;
[0070] Figure 12 is a schematic diagram of the structure of a data processing device provided in an embodiment of this application;
[0071] Figure 13 is a schematic diagram of the structure of a computing device provided in an embodiment of this application;
[0072] Figure 14 is a schematic diagram of a computing device cluster provided in an embodiment of this application;
[0073] Figure 15 is a schematic diagram of a possible implementation of a computing device cluster provided in an embodiment of this application. Detailed Implementation
[0074] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be further described in detail below with reference to the accompanying drawings. It should be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.), and signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use, and processing of related data must comply with the relevant laws, regulations, and standards of the relevant countries and regions. For example, the access information and data of training tasks involved in this application are obtained under fully authorized conditions.
[0075] To facilitate understanding, the key terms and concepts involved in this application will be explained below.
[0076] Deep learning is a type of machine learning technology based on deep neural network algorithms. Its main characteristic is the use of multiple nonlinear transformations to process and analyze data. It is widely used in AI fields such as perception and decision-making, including image and speech recognition, natural language translation, and computer games.
[0077] Parallel strategies in AI model training refer to breaking down the training task into multiple subtasks and distributing them across multiple nodes (such as accelerator cards) for execution. This fully utilizes parallel computing resources, significantly shortening training time and accelerating model convergence. Related technologies include model parallelism (or tensor parallelism), data parallelism, sequence parallelism (SP), and mixture of experts (MoE) parallelism, among others. The core idea behind these parallel strategies is to divide the model weights or activation values among multiple accelerator cards, allowing each accelerator card to handle a portion of the model's computation. This approach not only overcomes the memory capacity limitations of a single accelerator card but also supports the training of larger models.
[0078] An acceleration card, also known as an acceleration device, accelerator, or acceleration chip, is a type of specialized hardware device or computer system designed to accelerate computation in AI scenarios. In the embodiments of this application, acceleration cards may include, for example, graphics processing units (GPUs), neural network processing units (NPUs), intelligent processing units (IPUs), tensor processing units (TPUs), or domain-specific architecture (DSA) chips, and are not limited to these.
[0079] A file system is a way to store and organize data files. Typically, a file consists of data and metadata. Data is the actual content of the file, while metadata contains descriptive information such as filename, size, creation time, modification time, and access permissions. The concept of a "file" organizes data in a computer, grouping data used for the same purpose into different types of files according to the structural requirements of different applications.
[0080] A distributed file system is a file system that distributes data across multiple storage nodes. It improves system access performance through the concurrency capabilities of the multiple nodes and increases the overall system capacity through the capacity of the multiple nodes. Storage nodes can be, for example, physical servers.
[0081] A global shuffle list (GSL) is used to indicate the global access order of data in a training task. For example, in a distributed training scenario for AI models, the training task is launched in parallel by multiple accelerator cards. Each accelerator card has its own local access order to the data. The global access order is the union of these local access orders, representing the overall order of data access for the training task. Illustratively, taking the example of training task data stored in a file system, during the execution of the training task, a list file is pre-generated based on the global access order. Each item in this list file represents a file path, indicating the storage location of the data. The order of these items is the global access order of the files.
[0082] A local shuffle list (LSL) is a subset of a generalized shuffle list (GSL). For example, in a distributed AI model training scenario, the local access order of data by each accelerator card is the LSL. Illustratively, taking the data of the training task stored in a file system as an example, when generating a GSL, a list file is generated for each accelerator card based on the GSL to indicate the access order of the data by that accelerator card. Each entry represents a file path, which indicates the storage location of the data. The order of these entries is the local access order of the file.
[0083] Garbage collection (GC) is an automatic memory management mechanism used to identify and reclaim unused storage space to make room for new data. Related technologies include GC mechanisms such as reference counting, mark-and-sweep, copying, and generational collection, among others.
[0084] The application scenarios and implementation environment of this application are described below.
[0085] This application can be applied to data loading scenarios during AI training tasks. Currently, the scale of data processed by computing systems when performing AI model training tasks is increasing, but due to the limited bandwidth of the storage medium, the latency of data loading by the computing system is becoming increasingly high. Therefore, this application provides a data processing method that can effectively reduce the latency of data loading by the computing system and improve task execution efficiency.
[0086] The implementation environment of this application will be described below with reference to Figure 1. Figure 1 is a schematic diagram of an implementation environment provided by an embodiment of this application. As shown in Figure 1, the implementation environment includes a computing system 101, a first storage layer 102, and a second storage layer 103. The computing system 101 can access the first storage layer 102 and the second storage layer 103 through a wired network or a wireless network. The bandwidth of the first storage layer 102 is less than the bandwidth of the second storage layer 103, that is, the data transmission performance of the second storage layer 103 is higher than that of the first storage layer 102.
[0087] The computing system 101 has distributed computing capabilities for performing model training tasks. The computing system 101 includes multiple computing nodes. For any given computing node, the computing node includes a host and an accelerator card. The host and the accelerator card communicate with each other, for example, through a peripheral component interconnect express (PCIe) link. That is, the host and the accelerator card exchange data via a PCIe link.
[0088] Schematic illustration: A host is an integrated device or module including a central processing unit (CPU), memory, and other related components, which works in conjunction with an accelerator card to perform various complex computational functions. In this embodiment, the host is used to control the accelerator card to execute model training tasks. The host's memory is also the memory of the computing node, or the memory of the computing system 101. Memory may be, for example, dynamic random access memory (DRAM), double data rate memory (DDR), etc., and this application is not limited to these. Accelerator cards may include, for example, NPUs, GPUs, IPUs, TPUs, or DSA chips, etc., and this application is not limited to these. Accelerator cards may use high bandwidth memory (HBM) for data storage, and this application does not limit this.
[0089] It should be noted that this application does not limit the number of compute nodes or the number of hosts and accelerator cards within a compute node. For any compute node, the number of hosts and accelerator cards within that node can be one or more. When there are multiple accelerator cards in a compute node, each accelerator card can communicate with each other through a high-speed interconnect link (or chip bus), enabling different accelerator cards to quickly access each other's memory and achieve efficient data transfer between different accelerator cards. For example, the high-speed interconnect link can be a compute express link (CXL), a universal chiplet interconnect express (UCIe), a cache coherent interconnect for accelerators (CCIX), etc., and this application is not limited to these.
[0090] Furthermore, the architecture of the computing system 101 shown in the figure is merely illustrative and does not constitute a limitation of this application. For any computing node, the host and accelerator card in that node can be integrated into a single physical device or configured separately. In other embodiments, the computing system 101 can be deployed on a cloud platform. A cloud platform, short for cloud computing platform, refers to a service based on hardware and software resources that provides computing, networking, and storage capabilities. Through the network "cloud," massive amounts of data are processed and analyzed remotely before being returned to the user, featuring large-scale, distributed, virtualized, highly available, scalable, on-demand service, and secure characteristics. A cloud platform can achieve rapid deployment and release of configurable computing resources with relatively low management costs or low interaction complexity between users and service providers. Illustratively, a cloud platform is a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms. Accordingly, the computing nodes in the computing system 101 are, for example, virtual machine instances, container instances, etc., and this application does not limit them.
[0091] In this embodiment, the computing system 101 provides a data loading function for training tasks, which is implemented by a host in the computing system 101. For example, if the computing system 101 includes multiple computing nodes, the host of each computing node is used to implement the data loading function; or, if the computing system includes a single computing node and the computing node includes a host and multiple accelerator cards, the host of the computing node is used to implement the data loading function.
[0092] Schematic, the computing system 101 loads training task-related data layer by layer into its memory to execute the training task by accessing the first storage layer 102 and the second storage layer 103. Layer-by-layer loading means that the computing system 101 first loads the training task-related data (partial or complete data) from the first storage layer 102 into the second storage layer 103, and then loads the training task-related data from the second storage layer 103 into its memory to execute the training task. Since the bandwidth of the first storage layer 102 is less than that of the second storage layer 103, this layer-by-layer data loading method loads the training task data from the lower bandwidth storage layer to the higher bandwidth storage layer (here, "high bandwidth" and "low bandwidth" refer to the relative relative bandwidth of the first and second storage layers). Thus, during the execution of the training task, the computing system 101 can utilize the high bandwidth of the second storage layer 103 to quickly read data from the second storage layer 103 and load it into its memory, thereby reducing data loading latency. For example, for any computing node, when the computing node needs to access certain data to perform a training task, the data can be quickly loaded from the second storage layer into the memory of the computing node, so that the accelerator card on the computing node can perform the training task based on the data in the memory of the computing node. That is, the computing node does not need to spend time reading data from the first storage layer, effectively reducing the data loading latency.
[0093] The first storage layer 102 is used to store data for training tasks, such as all training data required to perform the training task, or the full set of training data, or even a portion of the training data required to perform the training task. For example, assuming a model is used to perform inference on videos to obtain inference results, the training data for this model's training task includes 1000 videos, which are stored in the first storage layer 102. In some embodiments, the first storage layer 102 is an object-based storage system (OBS). For example, the storage medium of the OBS may be a persistent storage medium such as a hard disk drive (HDD), but this application does not limit this to a specific type.
[0094] The second storage layer 103 serves as a cache layer between the first storage layer 102 and the memory of the computing system 101, employing a distributed storage architecture to store training task data. For example, based on the execution progress of the training task, it stores the data that the computing system 101 will access during the execution of the training task, so that subsequent computing nodes in the computing system 101 can load data from the second storage layer 103 into their respective memory. Indicatively, the second storage layer 103 and the memory of the computing nodes in the computing system 101 together constitute a two-level cache acceleration architecture. The second storage layer 103, as the first-level cache layer, supports data import from the first storage layer 102 and provides high-speed bandwidth. The memory of the computing nodes in the computing system 101, as the second-level cache layer, provides millisecond-level data read latency. Taking a training task with 1,000 videos as an example, the computing system 101 can load 500 videos from the first storage layer 102 to the second storage layer 103 according to the execution progress of the training task. After each computing node loads these 500 videos from the second storage layer 103 into the memory of the computing node to execute the training task, the remaining 500 videos are then loaded from the first storage layer 102 to the second storage layer 103. It should be noted that this is only an example and does not constitute a limitation of this application.
[0095] In some embodiments, the second storage layer 103 is a file system, such as a scalable file storage service, which supports OBS linkage, automatic import and export, and can provide a high-speed bandwidth of 60GB / s (this is only an example and does not constitute a limitation of this application). Furthermore, this file system supports sharing among multiple computing nodes and can expand storage capacity according to business needs. Indicatively, the storage medium of the second storage layer 103 is, for example, a persistent storage medium such as a solid-state drive (SSD), or a disk array composed of multiple disks, etc., which is not limited in this application. It should be understood that, taking the example of the first storage layer 102 using an HDD and the second storage layer 103 using an SSD, since the data access performance of an HDD is lower than that of an SSD, and the second storage layer 103 uses a distributed storage architecture to store data—for example, dividing the data into multiple data blocks and then distributing these data blocks across different SSDs—it achieves redundant data storage and high-performance data access. Therefore, the second storage layer 103 can provide a bandwidth greater than that of the first storage layer 102.
[0096] In some embodiments, the computing system 101 is further configured to provide data management functions for the first storage layer 102 and the second storage layer 103. For example, the computing system 101 uses these data management functions to control data transfer between any two of the computing system 101, the second storage layer 103, and the first storage layer 102. Or, for example, the computing system 101 uses these data management functions to delete data from the second storage layer 103 to free up storage space in the second storage layer 103, and so on.
[0097] In some embodiments, the aforementioned wireless or wired networks utilize standard communication technologies and / or protocols. Networks include, but are not limited to, Transmission Control Protocol / Internet Protocol (TCP / IP) networks in data center networks and RDMA networks such as RoCE networks, InfiniBand (IB) networks, Storage Area Networks (SANs), Local Area Networks (LANs), Metropolitan Area Networks (MANs), Wide Area Networks (WANs), mobile, wired or wireless networks, private networks, or any combination of virtual private networks. In some implementations, technologies and / or formats including Hypertext Markup Language (HTML), Extensible Markup Language (XML), etc., are used to represent data exchanged over the network. In addition, conventional encryption technologies such as Secure Sockets Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), and Internet Protocol Security (IPsec) can be used to encrypt all or part of the link. In other embodiments, custom and / or dedicated data communication technologies can be used to replace or supplement the aforementioned data communication technologies.
[0098] The data processing methods provided in this application are described below.
[0099] For ease of understanding, please refer to Figure 2 first, and in conjunction with the above-described implementation environment, to introduce the principle of the data processing method provided in this application. Figure 2 is a schematic diagram of the principle of a data processing method provided in an embodiment of this application. As shown in Figure 2, the data processing method is executed by a computing system that can access a first storage layer and a second storage layer. The first storage layer is, for example, an object storage system (OBS), and the second storage layer is, for example, a file system. It should be understood that the figures are merely illustrative examples. In practical applications, other types of storage systems can be used to implement the functions of the first or second storage layer, and this application does not limit this.
[0100] Taking the distributed training task of an AI model performed by the computing system as an example, the computing system provides data loading functionality for this training task. Illustratively, the data loading functionality includes an online preloader and a dataloader. In some embodiments, the data loading functionality further includes one or more of an offline preloader and a prefetcher.
[0101] In this embodiment, the data related to the training task executed by the computing system has an access order. That is, during the execution of the training task, the computing system accesses the training task data in a certain access order to process the data. As shown in Figure 2, for the distributed training task shown in Figure 2, the training task is executed in parallel by accelerator cards on multiple computing nodes in the computing system. The figure uses N accelerator cards as an example, where N is a positive integer. Before executing the training task, the computing system obtains the access information of the training task, and uses the access information to indicate the access order of the training task data during the execution of the training task. Schematic, the access information for the training task includes first access information and second access information. The first access information is used to indicate the global access order of the training task data during the execution of the training task. The first access information is, for example, GSL. The second access information is used to indicate the local access order of the training task data by the accelerator cards on each computing node during the parallel execution of the training task by multiple accelerator cards. The second access information is, for example, LSL. One process corresponds to one accelerator card, and the process is also the training process. Taking the LSL corresponding to accelerator card 1 as an example, it is represented in the figure as "card 1: LSL 1". The LSLs corresponding to the other accelerator cards are similar and will not be described again.
[0102] Before executing a training task, the computing system can use an offline data preloading function to load a portion of the training task's data from the first storage layer to the second storage layer according to the first access information (GSL) of the training task. This fully utilizes idle resources before the training task execution to reduce data loading latency. It should be noted that the offline data preloading function is optional and is not limited in this application. In some embodiments, the computing system maintains an index through the offline data preloading function to indicate the data already loaded into the second storage layer, for subsequent reading when the online data preloading function is initiated.
[0103] When a training task starts, the computing system initializes the training dataset. Using an online data preloading function, it loads the data to be accessed by the computing system next from the first storage layer to the second storage layer in real time, according to the first access information (GSL) and the execution progress of the training task. This fully utilizes the bandwidth of the second storage layer to reduce data loading latency. In some embodiments, the computing system loads data based on an index maintained by the offline data preloading function. Schematic, starting with the access order of the data indicated by the index in the first access information, subsequent data is loaded from the first storage layer to the second storage layer.
[0104] During the training process, the computing system uses a foreground data loading function. Each computing node loads data from the second storage layer into its memory according to the second access information (LSL) and the training task's execution progress. Because the bandwidth of the second storage layer is greater than that of the first storage layer, the computing system can quickly read data from the second storage layer to execute the task, effectively reducing data loading latency. For example, the computing system can access data step-by-step according to the LSL.
[0105] In some embodiments, during the execution of training tasks via the foreground data loading function, the computing system can also utilize a data prefetching function. This allows each computing node to pre-load data it will subsequently access from the second storage layer into its memory, based on the second access information and execution progress of the training task, thereby further reducing data loading latency in later steps. Accordingly, for any computing node, during the data loading process for executing the training task, it first determines whether the data is in memory. If it is, it reads the data from memory; otherwise, it loads the data from the second storage layer into memory. It should be noted that the data prefetching function is optional and is not limited in this application.
[0106] Typically, large-scale model training requires massive amounts of training data and high-speed data loading. In related technologies, the bandwidth provided by petabyte-scale data storage media is often insufficient to meet the demands of high-speed parallel training, and such large-scale data cannot be fully stored in high-speed storage systems, making large-scale data loading a bottleneck. Based on the data processing method provided in this application, a high-performance data loading function (or data loader) is offered for the computing system executing training tasks. This not only improves data loading bandwidth and reduces data loading latency but also reduces peak bandwidth requirements and long-tail latency during task execution. Furthermore, taking the object storage system OBS and the file system as the first and second storage layers, respectively, as an example, it achieves accelerated loading of petabyte-scale data using a terabyte-scale storage system. OBS can serve as the underlying data storage source for the file system, allowing large amounts of raw data to be stored first, achieving long-term, low-cost preservation of massive amounts of data. When high-speed processing and access to this data are required, the file system can quickly read data from OBS, reducing data access latency and improving data processing efficiency. For example, in the field of AI, OBS can store large-scale training data and model files, providing a data foundation for AI training; the file system can accelerate data reading and processing during training, improving training efficiency.
[0107] The flow of the data processing method provided in this application will be described below with reference to Figure 3.
[0108] Figure 3 is a flowchart of a data processing method provided in an embodiment of this application. As shown in Figure 3, the method is executed by a computing system. Taking the computing system as an example, which includes a first computing node and a second computing node, and the first computing node and the second computing node are used to execute the training task of the model, the method includes the following steps 301 to 305.
[0109] 301. The computing system obtains the first access information of the training task. The first access information is used to indicate the global access order of the training task data during the execution of the training task.
[0110] In this embodiment, the training task is a distributed training task. During the subsequent execution of the training task, the computing system can read and process the data according to the global access order indicated by the first access information. The first access information can also be understood as a kind of global access information. This application does not limit the type of data for the training task; for example, the data may be text, images, videos, audio, etc.
[0111] In some embodiments, the computing system creates multiple processes to execute the training task in parallel, with one process corresponding to one accelerator card. Illustratively, the computing system also obtains second access information for the training task, which is generated based on the first access information. The second access information indicates the local access order of the training task data by the accelerator cards on each computing node during the parallel execution of the training task using multiple accelerator cards; the second access information can also be understood as a type of local access information. In some embodiments, the second access information for the training task is generated by the computing system based on the first access information, for example, based on the first access information and the number of accelerator cards in the computing system. Alternatively, the second access information for the training task may be provided by the user; this application does not limit this.
[0112] The following example illustrates the access information for training tasks. Illustratively, the computing system can employ parallel techniques such as data parallelism, model parallelism, pipeline parallelism, or optimizer parallelism to partition model parameters and datasets across various accelerator cards in the computing system to accelerate training. Multiple processes created by the computing system correspond to multiple accelerator cards; that is, one process corresponds to one accelerator card. The first access information is, for example, GSL, and the second access information is, for example, LSL. Taking a video processing model as an example, the computing system can package the training task data into multiple files (file format, for example, zip; this is merely an example and does not constitute a limitation of this application). One zip file contains 1000 videos, and each accelerator card reads one video from one zip file per step. Each zip file is read in parallel by multiple accelerator cards. Based on this, the computing system obtains the first access information GSL and maps the GSL to different accelerator cards, generating the local access order corresponding to each accelerator card, which is also the second access information LSL. The local access order corresponding to each accelerator card can be represented as LSL_i. The order in which each accelerator card reads the zip file and the video in the zip file is in accordance with LSL_i, where i is an integer representing any accelerator card.
[0113] 302. Before executing the training task, the computing system loads multiple second data from the first storage layer to the second storage layer based on the first access information until the first condition is met. The data for the training task includes multiple second data.
[0114] In this embodiment, based on the foregoing description of the implementation environment, the bandwidth of the first storage layer is less than that of the second storage layer. The implementation methods of the first and second storage layers are described above and will not be repeated here. Indicatively, the first storage layer stores training task data. Before executing the training task, the computing system loads multiple sets of second data from the first storage layer to the second storage layer according to the first access information of the training task until the first condition is met. Here, the data loaded from the first storage layer to the second storage layer by the computing system before executing the training task is referred to as second data. This data can also be understood as offline pre-loaded data, or data loaded by the computing system through the offline data pre-loading function. That is, before executing the training task, the computing system utilizes its idle resources and loads some data from the first storage layer to the second storage layer in advance according to the global access order of the data required for executing the training task, thereby accelerating the data loading process in the early stages of the training task execution.
[0115] In some embodiments, the first condition refers to the amount of data in the second storage layer reaching a first threshold. The first threshold is a preset threshold that can be set according to business needs or provided by the user. For example, the first threshold is 85% of the total capacity of the second storage layer, or 10TB, etc., which is not limited in this application. In other embodiments, the data in the second storage layer satisfying the first condition means that the data includes user-specified data from the training task data, that is, the user can specify the data that needs to be preloaded offline by the computing system in advance. For example, the user-specified data refers to the first 100 data points obtained in the training task data arranged in access order, which is not limited in this application. It should be understood that the description of the first condition here is only illustrative and does not constitute a limitation of this application. In practical applications, it can be set according to needs. For example, the first condition can also be set to the offline preloading time reaching a preset time, etc. By setting the first condition mentioned above, the computing system loads a portion of the training task data from the first storage layer to the second storage layer in advance, according to the global access order of the training task, before executing the training task. This makes full use of the idle resources before the training task is executed and saves the time for the computing system to load this portion of data from the first storage layer to the second storage layer after the training task starts, thereby effectively reducing the data loading latency.
[0116] In some embodiments, the computing system runs multiple threads to load multiple pieces of second data from the first storage layer to the second storage layer, improving system resource utilization and reducing data loading latency. In this case, after determining that the first condition is met, the computing system stores an index. Wherein, any two threads are responsible for loading different data, and the index stored by the computing system is the maximum value among the indices of multiple consecutive pieces of data in the second storage layer, used to indicate the data loaded into the second storage layer during the execution of the training task. Illustratively, after determining that the first condition is met, the computing system identifies multiple consecutive pieces of data in the second storage layer and stores the maximum value among the indices of these multiple pieces of data, using this index to indicate the data loaded into the second storage layer. For example, the computing system stores the index in memory as an intermediate file. In some embodiments, the computing system updates the stored index based on the loading progress of the multiple pieces of second data; that is, the computing system can update the stored index in real time based on the progress of the second data loaded by multiple threads, which is not limited in this application. It should be understood that due to the different speeds of different threads, there may be data gaps between low-speed threads and high-speed threads. That is, the data preloaded offline by the computing system to the second storage layer may not be continuous. This may lead to omissions in data loading during subsequent task execution. Therefore, by storing the above index, it is possible to indicate the data that has been loaded into the second storage layer during the execution of training tasks, ensuring that data loading is not missed and avoiding additional overhead for subsequent data loading processes.
[0117] The following example, using a distributed training task of an AI model, illustrates the process of the computing system loading second data. Referring to Figure 4, which is a schematic diagram of an offline data preloading process provided in an embodiment of this application, the computing system loads a portion of data (i.e., multiple sets of second data) from the first storage layer to the second storage layer during idle periods in the resource pool, according to the global access order indicated by the first access information (GSL). This utilizes the high bandwidth of the second storage layer and the pre-known global access order to accelerate data loading in the early stages of training. For example, the offline data preloading function of the computing system can be provided to the user via a shell script. The user can use it after configuring the environment of the second storage layer and the first GSL file. It should be understood that the distributed training task of an AI model involves multiple iterations (also called multiple epochs). An epoch refers to the process of training the entire training dataset completely through the model once. The first GSL file represents the global access order of the training data in the training dataset during the first epoch.
[0118] Schematic illustration: The computing system initializes an offline preloader object. This object maintains a thread pool with a configurable number of threads (e.g., threads t0 to tn, where n is an integer). Multiple threads read data items from the GSL in parallel and preload the data. For example, multiple threads send data preload requests to the second storage layer through data management components (e.g., software toolsets), causing the second storage layer to load the data indicated by the preload request from the first storage layer. Furthermore, the reading order of the multiple threads can be determined by thread identifiers (IDs), ensuring high concurrency while maintaining order; any two threads are responsible for loading different data. It should be understood that if some data already exists in the second storage layer, the computing system will skip loading that data to avoid unnecessary overhead. As described above, the offline preloading process of the computing system will terminate when the first condition is met. Therefore, taking the first condition being that the amount of data in the second storage layer reaches a first threshold as an example, the computing system monitors the amount of data in the second storage layer in real time. If the first condition is met, preloading stops, and the index is stored. Specifically, a continuous maximum cursor (last index) is maintained and written to an intermediate file for subsequent training tasks to read, serving as the starting cursor for the training task. During this process, the computing system updates the stored index in real time based on the progress of data loaded by multiple threads; that is, the index is updated and stored in real time in the form of index = n+1.
[0119] Additionally, please refer to Figure 5 for the above-mentioned index. Figure 5 is a schematic diagram of an index provided in an embodiment of this application. As shown in Figure 5, multiple threads T0 to T3 read data items in GSL in parallel and perform data preloading. Due to the different speeds of different threads, there may be data gaps between low-speed threads and high-speed threads, that is, the data preloaded into the second storage layer is not continuous. By storing the index "last index = 11", it can be ensured that all data before this index has been successfully loaded into the second storage layer. It should be understood that if the index of the last data is stored, data 12, 13, 16, and 17 may be missed when subsequent training tasks start, bringing additional overhead to the subsequent data loading process.
[0120] In some embodiments, the computing system may not perform step 302 as described above. That is, the offline data preloading function involved in step 302 is an optional function of the computing system. The computing system may perform step 303 as described below after performing step 301. This application does not limit this.
[0121] 303. During the execution of the training task, the computing system loads multiple first data from the first storage layer to the second storage layer based on the first access information. The data of the training task includes multiple first data.
[0122] In this embodiment, in response to the execution request of a training task, the computing system loads multiple sets of first data, determined based on first access information, from the first storage layer to the second storage layer during the execution of the training task. This application refers to the data loaded from the first storage layer to the second storage layer during the execution of the training task as "first data." This data can also be understood as online data pre-loaded data, or data loaded by the computing system through the online data pre-loading function. This process means that during the execution of the training task, the computing system loads the data it will access next from the first storage layer to the second storage layer in real time, according to the first access information and the execution progress of the training task, thereby fully utilizing the bandwidth of the second storage layer to reduce data loading latency. Furthermore, since the first data is determined based on the first access information, the first data loaded into the second storage layer is retrieved promptly and does not occupy the storage space of the second storage layer for an extended period. This allows the computing system to release the storage space of the second storage layer in a timely manner to load other data, thus eliminating the need for the computing system to spend time reading data from the first storage layer and effectively reducing data loading latency. In other words, during the execution of the training task, the computing system utilizes its online preloading function to load some data from the first storage layer to the second storage layer in advance, according to the global access order of the data required for the training task, thereby accelerating the data loading process during the execution of the training task.
[0123] In some embodiments, the computing system loads multiple first data sets of the training task from the first storage layer to the second storage layer until the amount of data in the second storage layer meets a fourth threshold. The fourth threshold is a preset threshold that can be set according to business needs. For example, the fourth threshold could be 95% of the total capacity of the second storage layer, or 20TB, etc. This application does not limit this to any particular threshold. It should be understood that the storage space of the second storage layer is limited. By setting the fourth threshold, the computing system avoids continuing to load data when the second storage layer is full, thus preventing resource waste.
[0124] In some embodiments, in conjunction with the aforementioned step 302, it can be understood that before executing the training task, the computing system can load multiple second data of the training task from the first storage layer to the second storage layer based on the first access information. That is, before the computing system loads multiple first data from the first storage layer to the second storage layer, multiple second data of the training task are loaded from the first storage layer to the second storage layer. Accordingly, in this step, the access order of the multiple second data is before the access order of the multiple first data.
[0125] In some embodiments, in conjunction with the aforementioned step 302, when the computing system loads multiple second data from the first storage layer to the second storage layer through multiple threads, the computing system stores an index to indicate the data that has been loaded into the second storage layer after determining that the first condition is met. Accordingly, in this step, the computing system reads the index and loads multiple first data from the first storage layer to the second storage layer, starting with the access order of the data indicated by the index in the first access information.
[0126] The following example, using a distributed training task of an AI model, illustrates the process of the computing system loading the first data. Referring to Figure 6, which is a schematic diagram of an online data preloading process provided in an embodiment of this application, as shown in Figure 6, during the execution of the training task, the computing system loads multiple pieces of the first data of the training task from the first storage layer to the second storage layer according to the global access order indicated by the first access information and the execution progress of the training task.
[0127] Schematic illustration: The computing system creates a class named "online preloader," which starts with the user's training process and accompanies the entire training process as a daemon. It continuously preloads subsequent data from the first storage layer to the second storage layer by parsing the first access information (GSL). It should be understood that online data preloading can be started in parallel by multiple accelerator cards, with one accelerator card corresponding to one training process and one online preloader per accelerator card. Furthermore, each accelerator card can internally run multiple threads for data preloading, thereby fully utilizing computing resources in large-scale cluster training. For example, multiple threads can send data preloading requests to the second storage layer through relevant components of the data management function, causing the second storage layer to load the data indicated by the data preloading request from the first storage layer to the second storage layer. It should be noted that the binding relationship between the processes, threads, and the online data preloading class shown in Figure 6 does not mean that online data preloading is performed by the accelerator card.
[0128] Furthermore, as shown in Figure 6, when the online data preloading function of each accelerator card is activated, a common intermediate file is read. This intermediate file, created by the offline data preloading function, is used to store indexes. When the online data preloading function is activated, the index in the intermediate file is recorded as the start index, i.e., the starting cursor for online data preloading, avoiding duplicate data preloading. In addition, to ensure the order and non-repetition of parallel data preloading, the computing system can assign uniformly incrementing thread IDs to multiple threads of multiple processes and maintain the total number of threads in the computing system used to perform online data preloading. For example, when reading data items in GSL, each thread reads the data item whose index modulo the total number of threads equals the thread ID. For instance, the computing system runs threads 0, 1, 2, and 3 to read data items 0-7. Thread 0 reads data items 0 and 4, thread 1 reads data items 1 and 5, thread 2 reads data items 2 and 6, and thread 3 reads data items 3 and 7. It can be seen that this method ensures the order and non-repetition of parallel data preloading.
[0129] The above steps 301 to 303 describe the data preloading process involved before and during the execution of the training task by the computing system. In some embodiments, the computing system may omit step 302; that is, step 303 is executed after step 301 to load multiple first data of the training task from the first storage layer to the second storage layer. This application does not limit this. The following steps 304 and 305, using the first and second computing nodes as examples, describe the data loading involved during the execution of the training task by each computing node in the computing system.
[0130] 304. The computing system loads the first target data from multiple first data in the second storage layer into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node can perform training tasks based on the first target data in the memory of the first computing node.
[0131] In this embodiment of the application, the first target data is the data that the first computing node needs to access during the execution of the training task. This process is that the computing system uses the high bandwidth of the second storage layer to quickly load the data required by the first computing node to execute the training task into the memory of the first computing node, thereby reducing the data loading latency.
[0132] In some embodiments, the computing system loads the first target data, determined based on the second access information, from the second storage layer into the memory of the first computing node via the first computing node. That is, during the parallel execution of training tasks by multiple accelerator cards, each accelerator card, according to its local access order required for executing its training task, has the computing node load the data that the accelerator card needs to access from the second storage layer into memory for processing. Since the computing system has preloaded the relevant data for the training task into the second storage layer based on the first access information, when the first computing node needs to access the first target data from multiple first data sets to execute the training task, it can utilize the high bandwidth of the second storage layer to quickly load the first target data, thereby effectively reducing data loading latency.
[0133] In some embodiments, during the process of loading first target data determined based on second access information from the second storage layer into the memory of the first computing node via the first computing node, the computing system also loads at least one third data determined based on the second access information from the second storage layer into the memory of the first computing node via the first computing node. The access order of the at least one third data is after the access order of the first target data. The number of third data can be one or more, and this application does not limit this. Based on this, when the first computing node needs to access third data, it can read the third data from memory to execute the training task, without spending time loading the third data from the second storage layer. This process means that during the process of the computing system loading the data required for executing the training task from the second storage layer into memory via the computing node, it prefetches the data required for the next training task from the second storage layer into memory, saving the computing node the time to load this data from the second storage layer into memory, thereby further reducing data loading latency.
[0134] Furthermore, based on the aforementioned step 302, the computing system can load multiple second data from the first storage layer to the second storage layer based on the first access information before executing the training task. Correspondingly, before executing this step 304, the computing system loads the third target data from the multiple second data in the second storage layer into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node executes the training task based on the third target data in the memory of the first computing node. The access order of the third target data precedes the access order of the first target data.
[0135] 305. The computing system loads the second target data from multiple first data in the second storage layer into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node can perform training tasks based on the second target data in the memory of the second computing node.
[0136] In this embodiment, the second target data is the data that the second computing node needs to access during the execution of the training task. This process involves the computing system utilizing the high bandwidth of the second storage layer to quickly load the data required by the second computing node to perform the training task into the memory of the second computing node, thereby reducing data loading latency. Step 305 is similar to step 304 above and will not be repeated. Furthermore, based on the aforementioned step 302, the computing system can load multiple second data from the first storage layer to the second storage layer based on the first access information before executing the training task. Accordingly, before executing this step 305, the computing system loads the fourth target data from the multiple second data in the second storage layer into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node executes the training task based on the fourth target data in the memory of the second computing node. The access order of the fourth target data precedes the access order of the second target data.
[0137] Through steps 304 and 305 above, taking the first computing node and the second computing node as examples, the method of each computing node loading data to execute the training task in parallel execution of the training task by each computing node in the computing system has been introduced. It should be understood that the computing system may also include more computing nodes, which can share the data in the second storage layer, thereby improving the execution efficiency of the entire training task.
[0138] The following example, using a distributed training task for an AI model, illustrates the process of a computing system loading arbitrary target data to execute a training task. Referring to Figure 7, which is a schematic diagram of a data loading process provided in an embodiment of this application, as shown in Figure 7, during the execution of a training task, the computing system loads the target data of the training task from the second storage layer into the memory of any computing node according to the local access order indicated by the second access information and the execution progress of the training task. This allows the accelerator card on the computing node to read the target data from memory and process it.
[0139] Schematic illustration: The computing system creates classes for data loading (dataloader) and dataset. The dataset, by calling the function "get_data", parses the LSL and reads data step-by-step according to the access order indicated by the LSL. The dataloader drives the dataset to execute "get_data" by calling the function "next". When the dataloader drives the dataset by calling the function "next", the number of training samples (batch) can be specified, enabling the loading of multiple datasets at a time. Each call to "next" passes an auto-incrementing index, indicating that "get_data" needs to retrieve the data at index `index` in the LSL.
[0140] Additionally, referring to Figure 8, which is a schematic diagram of a data prefetching process provided in an embodiment of this application, the computing system, based on the data prefetching function, loads the data that the computing node will access next from the second storage layer into the memory of the computing node in advance, according to the second access information and execution progress of the training task, through any computing node. That is, when the computing node loads data step by step through the dataloader, the data for the next few steps is prefetched from the second storage layer into memory. This process can be completed concurrently by multiple threads, so that the data loading of the subsequent one or more steps can be directly read from memory, greatly reducing the data loading latency. Based on this, when the computing node loads data through the data loading function, if the data is not read from memory, the data is loaded from the second storage layer into memory. For example, when the computing node loads the data at index i (where i is an integer) from the second storage layer into memory, multiple threads preload the data at index i+1, index i+2, and index i+3 from the second storage layer into memory.
[0141] The above steps 301 to 305 describe the data loading process before and during the execution of the training task by the computing system. It should be understood that the online data preloading function involved in step 303 and the data loading function involved in steps 304 and 305 can be executed synchronously.
[0142] In some embodiments, as can be seen from the foregoing description of the computing system, the computing system provided in this application is also used to provide data management functions. Based on this, the computing system can promptly delete data from the second storage layer based on the execution progress of the task to free up the storage space of the second storage layer, so that the computing system can promptly load other data required for the execution of the task from the first storage layer to the second storage layer, thereby reducing the data loading latency.
[0143] The following section uses the first target data as an example to introduce how the computing system deletes accessed data from the second storage layer.
[0144] Schematic illustration: After the computing system loads the first target data from the second storage layer into the memory of the first computing node via the first computing node, it generates a data deletion request if the first target data meets a second condition. The data deletion request instructs the deletion of the first target data from the second storage layer. The second condition refers to either the first target data not being accessed within a first time period, or the second condition referring to the number of times the first target data has been accessed reaching a second threshold. The first time period is a preset time period, and the second threshold is a preset threshold. Both the first time period and the second threshold can be set according to business needs, or provided by the user; this application does not limit this. For example, the first time period can be set to 10 minutes, and the second threshold can be set to the total number of times the first target data needs to be accessed to execute the training task. In other words, during the execution of the training task, after each time the computing system loads the data required for the training task from the second storage layer into the memory of the computing node via the computing node to execute the training task, it promptly determines whether the data can be deleted from the second storage layer based on the execution progress of the training task, thereby generating a data deletion request.
[0145] In some embodiments, when a data deletion request is generated, the computing system executes the request, deleting the first target data from the second storage layer, thereby promptly releasing the storage space of the second storage layer and improving resource utilization. It should be understood that if the first target data will not be accessed within a first time period, deleting it from the second storage layer can promptly release the storage space occupied by the first target data, or in other words, promptly discard cold data, preventing cold data from occupying the storage space of the second storage layer for an extended period. Similarly, if the number of accesses to the first target data reaches a second threshold, deleting it from the second storage layer can promptly release the storage space occupied by the first target data, or in other words, promptly discard data no longer needed for training tasks, preventing this data from occupying the storage space of the second storage layer, thereby reducing space waste.
[0146] In some embodiments, the computing system executes tasks in parallel through multiple processes, with one process corresponding to one accelerator card. Each process loads data based on its own second access information. In this case, each process can generate a data deletion request. Based on this, when multiple processes generate data deletion requests that all indicate the deletion of the first target data, the computing system executes the data deletion request based on whether the number of accesses to the first target data has reached a target threshold. Illustratively, if the number of accesses to the first target data reaches the target threshold, the data deletion request is executed, and the first target data is deleted from the second storage layer; if the number of accesses to the first target data does not reach the target threshold, the data deletion request is not executed. Here, the target threshold is a preset threshold, such as the number of processes executing the training task. That is, if a piece of data has been accessed by all processes executing the training task, the data is promptly evicted to reduce space waste and cache thrashing.
[0147] Based on the above, taking the first target data as an example, the method of deleting accessed data from the second storage layer by the computing system has been introduced. In some embodiments, the computing system can release the storage space of the second storage layer in a timely manner throughout the entire process of executing the training task. For example, the computing system deletes the fourth data of the training task that meets the third condition from the second storage layer. The third condition means that the fourth data will not be accessed within a second time period; or, the third condition means that the number of times the fourth data is accessed reaches a third threshold. The second time period and the third threshold are preset time periods and thresholds, respectively. Both the second time period and the third threshold can be set according to business needs, and this application does not limit them. For example, the second time period can be the same as or different from the aforementioned first time period, and the third threshold can be the same as or different from the aforementioned second threshold. In other words, the computing system promptly determines whether there is data to be deleted in the second storage layer during the entire process of executing the training task, thereby releasing the storage space of the second storage layer in a timely manner and improving resource utilization.
[0148] Schematic, the step of deleting the fourth data from the second storage layer can be performed during the process of loading the training task data from the first storage layer to the second storage layer, that is, during the execution of step 303 above. Alternatively, the step of deleting the fifth data from the second storage layer can also be performed during the process of loading the training task data in the second storage layer into the memory of the computing node, that is, during the execution of step 304 or step 305 above, and this application does not limit this.
[0149] In some embodiments, the computing system stores a message queue for storing data deletion requests. These requests can be initiated by the computing system during step 303, or during steps 304 or 305, and this application does not limit the scope of the requests. Illustratively, the computing system executes the data deletion request by polling the message queue, deleting the data indicated by the data deletion request from the second storage layer, i.e., eviction of the data indicated by the data deletion request. For example, the data deletion request may be a garbage collection (GC) request, and the computing system executes the GC request using reference counting, and this application does not limit the scope of the requests.
[0150] The following example, using the distributed training task of an AI model, illustrates how the computing system deletes data from the second storage layer during the entire training process.
[0151] Figure 9 is a schematic diagram of a data eviction strategy during online data preloading provided in an embodiment of this application. As shown in Figure 9, the computing system creates an online preloader class to implement online data preloading, and a dataloader class to implement data loading. The online preloader and dataloader jointly maintain a message queue, which is used to store data deletion requests, such as garbage collection (GC) requests. During the execution of a training task by the computing system, after loading data from the second storage layer into the memory of any computing node to execute the training task, a data deletion request for that data is sent to the message queue. The computing system polls the message queue by creating a poller thread, and when a data deletion request is received, it sends the data deletion request to the data management component. Upon receiving a data deletion request, this component does not immediately evict the data. Internally, it maintains an inter-card reference count, "ref_map_across," which records how many accelerator cards' corresponding online preloaders have issued deletion requests for each piece of data. If the number reaches a user-configured threshold (e.g., the third threshold mentioned above), the data is evicted, freeing up storage space for the online preloader to load new data. It should be understood that each accelerator card's corresponding LSL is a subset of the GSL, scattered within the same range, not randomly. There may be data overlap between accelerator cards, and the access distance for each accelerator card to access the same data is usually not too different (access distance refers to the distance between a file and the first file in the LSL). This mechanism ensures that the time a piece of data waits for access from each accelerator card is relatively uniform. Therefore, by adhering to the strategy of evicting only after all accelerator cards have completed access, cache jitter can be minimized without occupying storage space in the second storage layer.
[0152] Figure 10 is a schematic diagram of a data eviction strategy during data loading provided in an embodiment of this application. As shown in Figure 10, the computing system creates an online preloader class to implement online data preloading, and a dataloader class to implement data loading. For any process, the computing system maintains an in-card reference count "ref_map_inner" of <data file - number of accessed cards> through the dataloader. During class initialization, the system scans the LSL to determine the number of times each data item in the LSL is accessed by the accelerator card corresponding to the process, and determines whether a data deletion request needs to be issued based on this. If the number reaches a user-preconfigured threshold (e.g., the second threshold involved when deleting the first target data mentioned above), a data deletion request is sent to the online preloader to the message queue. The computing system processes the data deletion request in the message queue in the same way as the relevant content in Figure 9 above, and will not be repeated here.
[0153] The following examples, using different data organization formats and the aforementioned data eviction strategies, illustrate how a computing system deletes data from the second storage layer. Illustratively, the data organization formats include, for example, tiled data or packaged data.
[0154] Tiled data refers to storing each data point as an independent file. One data item in the LSL corresponds to one data point, and a single training session requires reading the entire data set. Based on this, the computing system calls the "get_data" function to parse the LSL and read the data file path according to the access order indicated by the LSL. Based on this file path, it first checks if the data is in the computing node's memory (possibly prefetched using a data prefetching function). If so, the data is read from memory; otherwise, it is read from the second storage layer. After reading the data, the reference count for that data is updated in the card's reference count. When the reference count reaches a user-configured threshold, a data deletion request is issued.
[0155] Packed data refers to a method of storing multiple data items together. For example, large-scale datasets can be packaged uncompressed using zip-style data packets, with multiple data items stored in a single zip file. In this case, a data item in LSL includes not only the file path of the data packet but also the index "file_index" within that data packet. The computing system calls the "get_data" function to read the data packet without decompression. Each read does not read the entire data packet but instead reads the data at the specified index based on the metadata within the data packet, without decompression. Thus, the size of one I / O operation is equal to the size of one data item. The same principle applies to tiled data. For packaged data scenarios, the computing system first determines whether the data is already in the memory of the computing node. If so, it reads the data from memory; otherwise, it reads the data from the second storage layer. After reading the data, the reference count of the data is updated in the card's reference count. When the reference count reaches a user-configured threshold, a data deletion request is issued.
[0156] It should be noted that the execution information, such as the first condition, the second condition, and the third condition, involved in the above data processing method can all be configured by the user. Illustratively, the computing system receives a configuration request from the user for a training task. This configuration request is used to configure the execution information of the training task; wherein, the execution information includes at least one of the following: a first condition required to load data from the first storage layer to the second storage layer, a second condition required to delete the first target data from the second storage layer, and a third condition required to delete the fourth data from the second storage layer.
[0157] In summary, this application provides a data processing method for a computing system capable of accessing both a first and a second storage layer to load data required for executing a training task. Before executing the training task, the computing system uses an offline data preloading function to load a portion of the training task's data from the first storage layer to the second storage layer according to the global access order indicated by the first access information of the training task, thus fully utilizing idle resources before the training task execution to reduce data loading latency. During the execution of the training task, an online data preloading function loads the data that the computing system will access next from the first storage layer to the second storage layer in real time according to the first access information and execution progress of the training task, thereby fully utilizing the bandwidth of the second storage layer to reduce data loading latency. Furthermore, for any computing node executing the training task, when the computing node needs to access certain data to execute the training task, the data in the second storage layer is loaded into the computing node's memory using the data loading function. Since the bandwidth of the second storage layer is greater than that of the first storage layer, the computing node can quickly read data from the second storage layer to execute the task, effectively reducing data loading latency. Furthermore, during the execution of the training task, the computing node can also use the data prefetching function to prefetch the data required for the next training task from the second storage layer into memory while loading the data required for the training task from the second storage layer into memory. This saves the computing node the time to load this part of the data from the second storage layer into memory, thereby further reducing the data loading latency.
[0158] Referring to Figure 11, the data loading function (or data loader) provided by the above-described computing system will be illustrated below. Figure 11 is a schematic diagram of a data loading function provided in an embodiment of this application. As shown in Figure 11, the data loading function provided by the computing system includes offline data preloading, online data preloading, data loading, and data prefetching. For example, the data loading function, as a software toolkit component, is installed in a server and run by a host in the server. This server or a server cluster consisting of multiple servers can be configured as the above-described computing system.
[0159] Taking the distributed training task of an AI model as an example, before the training task is executed (e.g., during any idle resource period), offline data preloading is used to parse the first access information (GSL) of the training task. Multiple threads read data items from the GSL in parallel and preload the data, maintaining a continuous maximum cursor (last index) written to an intermediate file for subsequent training task startup. This index indicates the data already loaded into the second storage layer. During the training task execution, online data preloading is used to parse the first access information (GSL) of the training task (which is updated in real-time according to the training rounds). Multiple accelerator cards are started in parallel, each capable of running multiple threads for data preloading. During this process, a message queue is polled to promptly evict data based on reference count. Furthermore, for any computing node executing the training task, data loading is used to parse the second access information (LSL) and read data step-by-step according to the access order indicated by the LSL. Data prefetching is also used to prefetch at least one step of data into memory, reducing subsequent data loading latency. It should be understood that the aforementioned online data preloading and data prefetching can run continuously in the background, and users are unaware of the underlying caching architecture.
[0160] Furthermore, since the computing system of this application provides offline and online data preloading capabilities, the "dataset" class created by the computing system through the data loading function can directly read the corresponding data from the second storage layer. Thus, user initialization of the dataset is relatively simple. For example, the user-defined dataset class can inherit from the class provided in this application and import the relevant configuration file, calling the parent class's "get_data" function in the "getitem" function. In this way, users do not need to handle complex input / output (IO) operations, such as how to read data from the storage layer and how to handle various details during the data reading process. The parent class's "get_data" function has already handled all these IO-related matters, greatly simplifying the process of user-defined dataset creation. After initializing the dataset, the dataset is passed to the "dataloader" class for further data processing in preparation for subsequent tasks. Furthermore, the "dataloader" is highly compatible with deep learning frameworks such as PyTorch. It eliminates the need to pass a sampler; all input parameters for the PyTorch "dataloader" can be passed to the "dataloader" provided in this application and executed correctly, such as batch size and data mapping. Therefore, the software toolkit components provided in this application are easy to operate, highly compatible, and facilitate efficient and convenient data processing for users, serving subsequent higher-level task requirements such as model training.
[0161] Figure 12 is a schematic diagram of a data processing device provided in an embodiment of this application. This device is applied to a computing system, which includes a first computing node and a second computing node. Each computing node is equipped with an accelerator card and memory. The first and second computing nodes are used to execute model training tasks. As shown in Figure 12, the device includes a parsing module 1201, an online data preloading module 1202, and a front-end data loading module 1203. Schematably, the parsing module 1201, the online data preloading module 1202, and the front-end data loading module 1203 run on one or more processors in the computing system.
[0162] The parsing module 1201 is used to obtain the first access information of the training task. The first access information is used to indicate the global access order of the training task data during the execution of the training task.
[0163] The online data preloading module 1202 is used to load multiple first data from the first storage layer to the second storage layer based on the first access information during the execution of the training task. The data of the training task includes multiple first data.
[0164] The front-end data loading module 1203 is used to load the first target data from multiple first data in the second storage layer into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node can perform training tasks based on the first target data in the memory of the first computing node; and to load the second target data from multiple first data in the second storage layer into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node can perform training tasks based on the second target data in the memory of the second computing node.
[0165] In one possible implementation, the device further includes:
[0166] The offline data preloading module is used to load multiple second data from the first storage layer to the second storage layer based on the first access information before executing the training task, until a first condition is met. The access order of the multiple second data is before the access order of the multiple first data. The data of the training task includes multiple second data. The first condition means that the amount of data in the second storage layer reaches a first threshold; or, the first condition means that the multiple second data includes user-specified data in the data of the training task.
[0167] The front-end data loading module 1203 is further configured to, before loading the first target data into the memory of the first computing node through the first computing node, load the third target data among the multiple second data in the second storage layer into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node can perform a training task based on the third target data in the memory of the first computing node; and before loading the second target data into the memory of the second computing node through the second computing node, load the fourth target data among the multiple second data in the second storage layer into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node can perform a training task based on the fourth target data in the memory of the second computing node.
[0168] In one possible implementation, when multiple threads are running to load multiple pieces of second data from a first storage layer to a second storage layer, wherein any two threads are responsible for loading different data, the apparatus further includes:
[0169] A storage module is used to store an index after the first condition is met. This index is used to indicate the data that has been loaded into the second storage layer during the execution of the training task.
[0170] The online data preloading module 1202 is used to load multiple first data from the first storage layer to the second storage layer, starting with the access order of the data indicated by the index in the first access information.
[0171] In one possible implementation, the device further includes:
[0172] The generation module is used to generate a data deletion request after the first target data in the second storage layer is loaded into the memory of the first computing node through the first computing node, and the first target data meets the second condition. The data deletion request is used to indicate that the first target data is deleted from the second storage layer.
[0173] The second condition refers to the first target data not being accessed within the first time period; or, the second condition refers to the number of times the first target data is accessed reaching the second threshold.
[0174] In one possible implementation, the parsing module 1201 is further configured to obtain second access information of the training task. The second access information is used to indicate the local access order of the data of the training task by the accelerator cards on each computing node during the parallel execution of the training task by multiple accelerator cards. The second access information is generated based on the first access information.
[0175] The front-end data loading module 1203 is used to load the first target data determined based on the second access information in the second storage layer into the memory of the first computing node through the first computing node.
[0176] In one possible implementation, the device further includes:
[0177] The data prefetching module is used to load at least one third data determined based on the second access information from the second storage layer into the memory of the first computing node through the first computing node, wherein the access order of the at least one third data is after the access order of the first target data.
[0178] In one possible implementation, the device further includes:
[0179] The deletion module is used to delete the fourth data of the training task that meets the third condition from the second storage layer. The third condition means that the fourth data will not be accessed within the second time period; or, the third condition means that the number of times the fourth data is accessed reaches the third threshold.
[0180] In one possible implementation, the deletion module is used to delete the fourth data from the second storage layer during the process of loading training task data from the first storage layer to the second storage layer; or, during the process of loading training task data from the second storage layer to the memory of the computing node, the fourth data is deleted from the second storage layer.
[0181] The parsing module 1201, the online data preloading module 1202, and the front-end data loading module 1203 can all be implemented in software or in hardware. For example, the implementation of the parsing module 1201 will be described below. Similarly, the implementation of the online data preloading module 1202 and the front-end data loading module 1203 can refer to the implementation of the parsing module 1201.
[0182] As an example of a software functional unit, the parsing module 1201 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, or a container. Further, the aforementioned computing instance may be one or more. For example, the parsing module 1201 may include code running on multiple hosts / virtual machines / containers. It should be noted that the multiple hosts / virtual machines / containers used to run the code may be distributed in the same region or in different regions. Further, the multiple hosts / virtual machines / containers used to run the code may be distributed in the same availability zone (AZ) or in different AZs, each AZ including one or more geographically proximate data centers. Typically, a region may include multiple AZs.
[0183] Similarly, multiple hosts / virtual machines / containers used to run this code can be distributed within the same Virtual Private Cloud (VPC) or across multiple VPCs. Typically, a VPC is set up within a region. Communication between two VPCs within the same region, as well as between VPCs in different regions, requires a communication gateway to be set up within each VPC to enable interconnection between VPCs.
[0184] As an example of a hardware functional unit, the parsing module 1201 may include at least one computing device, such as a server. Alternatively, the parsing module 1201 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). The PLD may be implemented using a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), generic array logic (GAL), or any combination thereof.
[0185] The multiple computing devices included in the parsing module 1201 can be distributed within the same region or in different regions. Similarly, the multiple computing devices included in the parsing module 1201 can be distributed within the same Availability Zone (AZ) or in different AZs. Likewise, the multiple computing devices included in the parsing module 1201 can be distributed within the same Virtual Private Cloud (VPC) or in multiple VPCs. These multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
[0186] It should be noted that, in other embodiments, the parsing module 1201 can be used to execute any step in the data processing method. The steps implemented by the parsing module 1201, the online data preloading module 1202, and the front-end data loading module 1203 can be specified as needed. The parsing module 1201, the online data preloading module 1202, and the front-end data loading module 1203 respectively implement different steps in the data processing method to realize all the functions of the data processing device.
[0187] This application also provides a computing device 1300. Figure 13 is a schematic diagram of the structure of a computing device provided in an embodiment of this application. As shown in Figure 13, the computing device 1300 includes: a bus 1302, a processor 1304, a memory 1306, and a communication interface 1308. The processor 1304, the memory 1306, and the communication interface 1308 communicate with each other via the bus 1302. The computing device 1300 can be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 1300.
[0188] Bus 1302 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, only one line is used in Figure 13, but this does not imply that there is only one bus or one type of bus. Bus 1302 can include pathways for transmitting information between various components of storage node 1300 (e.g., memory 1306, processor 1304, communication interface 1308).
[0189] The processor 1304 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
[0190] The memory 1306 may include volatile memory, such as random access memory (RAM).
[0191] The memory 1306 stores executable program code, and the processor 1304 executes this executable program code to implement the functions of the aforementioned parsing module 1201, online data preloading module 1202, and foreground data loading module 1203, thereby realizing the data processing method. That is, the memory 1306 stores instructions for executing the data processing method.
[0192] Alternatively, the memory 1306 may store executable code, which the processor 1304 executes to implement the functions of the aforementioned data processing device, thereby implementing the data processing method. That is, the memory 1306 may store instructions for executing the data processing method.
[0193] The communication interface 1308 uses transceiver modules such as, but not limited to, network interface cards and transceivers to enable communication between the computing device 1300 and other devices or communication networks.
[0194] This application also provides a computing device cluster. Figure 14 is a schematic diagram of a computing device cluster provided in this application embodiment. As shown in Figure 14, the computing device cluster includes at least one computing device 1300. The memory 1306 of one or more computing devices 1300 in the computing device cluster may store the same instructions for executing data processing methods.
[0195] In some possible implementations, the memory 1306 of one or more computing devices 1300 in the computing device cluster may also store partial instructions for executing data processing methods. In other words, a combination of one or more computing devices 1300 can jointly execute instructions for executing data processing methods.
[0196] It should be noted that the memory 1306 in different computing devices 1300 within the computing device cluster can store different instructions, each used to execute a portion of the functions of the data processing device. That is, the instructions stored in the memory 1306 of different computing devices 1300 can implement the functions of one or more of the aforementioned parsing module 1201, online data preloading module 1202, and foreground data loading module 1203.
[0197] It should be understood that the functions of the computing device 1300 shown in Figure 14 can also be performed by multiple computing devices 1300.
[0198] In some possible implementations, one or more computing devices in a computing device cluster can be connected via a network. This network can be a wide area network (WAN) or a local area network (LAN), etc. Figure 15 illustrates one possible implementation. Figure 15 is a schematic diagram of a possible implementation of a computing device cluster provided by an embodiment of this application. As shown in Figure 15, computing devices 1300A, 1300B, and 1300C are connected via a network. Specifically, they are connected to the network through communication interfaces in each computing device. In this type of possible implementation, the memory 1306 in computing device 1300A stores instructions for performing the functions of the parsing module 1201, the online data preloading module 1202, and the foreground data loading module 1203. That is, computing device 1300A is equipped with a data processor device, and the memory 1306 in computing device 1300A is, for example, volatile memory. Additionally, the memory 1306 in computing device 1300B is used to implement the functions of the second storage layer described above. The memory 1306 in computing device 1300B may include, for example, non-volatile memory, such as read-only memory (ROM), flash memory, NVDIMM, SCM, hard disk drive (HDD), or solid-state drive (SSD). The memory 1306 in computing device 1300C is used to implement the functions of the first storage layer described above. The memory 1306 in computing device 1300C may include, for example, non-volatile memory.
[0199] This application also provides a computer program product containing instructions. This computer program product may be a software or program product containing instructions capable of running on a computing device cluster or stored on any available medium. When the computer program product runs on the computing device cluster, it causes the computing device cluster to perform a data processing method.
[0200] This application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that a computing device cluster can store, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device cluster to perform a data processing method.
[0201] Those skilled in the art will recognize that the method steps and units described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the steps and components of each embodiment have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0202] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be found in the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0203] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the couplings or direct couplings or communication connections shown or discussed may be indirect couplings or communication connections through some interfaces, apparatuses, or units, or they may be electrical, mechanical, or other forms of connection.
[0204] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of the embodiments of this application, depending on actual needs.
[0205] Furthermore, the units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or software.
[0206] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computing device (which may be a personal computer, server, or computing device, etc.) to execute all or part of the steps of the methods in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0207] In this application, the terms "first," "second," etc., are used to distinguish identical or similar items that have substantially the same function and purpose. It should be understood that there is no logical or temporal dependency between "first," "second," and "nth," nor does it limit the quantity or order of execution. It should also be understood that although the following description uses the terms "first," "second," etc., to describe various elements, these elements should not be limited by the terms. These terms are merely used to distinguish one element from another. For example, without departing from the scope of various examples, first data can be referred to as second data, and similarly, second data can be referred to as first data. Both first data and second data can be data, and in some cases, they can be separate and distinct data.
[0208] In this application, the term "at least one" means one or more, and the term "multiple" means two or more. The terms "system" and "network" are often used interchangeably.
[0209] It should also be understood that the term "if" can be interpreted as meaning "when" or "upon" or "in response to determination" or "in response to detection." Similarly, depending on the context, the phrases "if determination..." or "if detection [the stated condition or event]" can be interpreted as meaning "when determination..." or "in response to determination..." or "when detection [the stated condition or event]" or "in response to detection [the stated condition or event]."
[0210] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
[0211] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. This computer program product includes one or more computer program instructions. When these computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
[0212] The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer program instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired or wireless means. The computer-readable storage medium can be any available medium that a computer can access, or a data storage device such as a server or data center that integrates one or more available media. The available medium can be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., digital video discs (DVDs)), or semiconductor media (e.g., solid-state drives).
[0213] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.
[0214] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit it. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
Claims
1. A data processing method, characterized by, Applied to a computing system, the computing system includes a first computing node and a second computing node, each computing node being equipped with an accelerator card and memory, the first computing node and the second computing node being used to perform model training tasks, the method including: Obtain first access information for the training task, wherein the first access information is used to indicate the global access order of the data of the training task during the execution of the training task; During the execution of the training task, based on the first access information, multiple first data are loaded from the first storage layer to the second storage layer, and the data of the training task includes the multiple first data; The first target data in the plurality of first data in the second storage layer is loaded into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node performs the training task based on the first target data in the memory of the first computing node; The second target data from the plurality of first data in the second storage layer is loaded into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node performs the training task based on the second target data in the memory of the second computing node.
2. The method of claim 1, wherein, The method further includes: Before executing the training task, based on the first access information, multiple second data are loaded from the first storage layer to the second storage layer until a first condition is met, wherein the access order of the multiple second data is prior to the access order of the multiple first data, the data of the training task includes the multiple second data, and the first condition refers to the amount of data in the second storage layer reaching a first threshold; or, the first condition refers to the multiple second data including user-specified data in the data of the training task. Before loading the first target data into the memory of the first computing node through the first computing node, the third target data among the plurality of second data in the second storage layer is loaded into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node performs the training task based on the third target data in the memory of the first computing node; Before loading the second target data into the memory of the second computing node, the fourth target data among the plurality of second data in the second storage layer is loaded into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node performs the training task based on the fourth target data in the memory of the second computing node.
3. The method of claim 2, wherein, In a scenario where multiple threads load the plurality of second data from the first storage layer to the second storage layer, wherein any two threads are responsible for loading different data, the method further includes: After the first condition is met, an index is stored, which is used to indicate the data that has been loaded into the second storage layer during the execution of the training task; The step of loading multiple first data from the first storage layer to the second storage layer based on the first access information includes: loading the multiple first data from the first storage layer to the second storage layer starting from the access order of the data indicated by the index in the first access information.
4. The method according to any one of claims 1 to 3, characterized in that, The method further includes: After the first target data in the second storage layer is loaded into the memory of the first computing node through the first computing node, a data deletion request is generated if the first target data meets the second condition. The data deletion request is used to indicate that the first target data is deleted from the second storage layer. The second condition refers to the first target data not being accessed within the first time period; or, the second condition refers to the number of times the first target data is accessed reaching a second threshold.
5. The method according to any one of claims 1 to 4, characterized in that, The method further includes: Obtain second access information for the training task. The second access information is used to indicate the local access order of the data of the training task by the accelerator cards on each computing node during the parallel execution of the training task by multiple accelerator cards. The second access information is generated based on the first access information. The step of loading the first target data from the plurality of first data in the second storage layer into the memory of the first computing node through the first computing node includes: The first target data, determined based on the second access information, is loaded into the memory of the first computing node through the first computing node.
6. The method of claim 5, wherein, The method further includes: During the process of loading the first target data determined based on the second access information from the second storage layer into the memory of the first computing node through the first computing node, at least one third data determined based on the second access information from the second storage layer into the memory of the first computing node through the first computing node, wherein the access order of the at least one third data is after the access order of the first target data.
7. The method according to any one of claims 1 to 6, characterized in that, The method further includes: The fourth data of the training task that satisfies the third condition is deleted from the second storage layer, wherein the third condition means that the fourth data will not be accessed during the second time period; or, the third condition means that the number of times the fourth data is accessed reaches a third threshold.
8. The method of claim 7, wherein, The step of deleting the fourth data from the second storage layer is performed during the process of loading the training task data from the first storage layer to the second storage layer; or, the step of deleting the fourth data from the second storage layer is performed during the process of loading the training task data in the second storage layer into the memory of the computing node.
9. A data processing apparatus, characterized by, An apparatus for use in a computing system, the computing system comprising a first computing node and a second computing node, each computing node being equipped with an accelerator card and memory, the first computing node and the second computing node being used to perform model training tasks, the apparatus comprising: The parsing module is used to obtain the first access information of the training task, wherein the first access information is used to indicate the global access order of the data of the training task during the execution of the training task; An online data preloading module is used to load multiple first data from a first storage layer to a second storage layer based on the first access information during the execution of the training task, wherein the data of the training task includes the multiple first data. The front-end data loading module is used to load the first target data from the plurality of first data in the second storage layer into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node executes the training task based on the first target data in the memory of the first computing node; and to load the second target data from the plurality of first data in the second storage layer into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node executes the training task based on the second target data in the memory of the second computing node.
10. The apparatus of claim 9, wherein, The device further includes: An offline data preloading module is used to load multiple second data from the first storage layer to the second storage layer based on the first access information before executing the training task, until a first condition is met. The access order of the multiple second data precedes the access order of the multiple first data. The training task data includes the multiple second data. The first condition refers to the amount of data in the second storage layer reaching a first threshold; or, the first condition refers to the multiple second data including user-specified data from the training task data. The front-end data loading module is also used for: Before loading the first target data into the memory of the first computing node through the first computing node, the third target data among the plurality of second data in the second storage layer is loaded into the memory of the first computing node through the first computing node, so that the accelerator card on the first computing node performs the training task based on the third target data in the memory of the first computing node; Before loading the second target data into the memory of the second computing node, the fourth target data among the plurality of second data in the second storage layer is loaded into the memory of the second computing node through the second computing node, so that the accelerator card on the second computing node performs the training task based on the fourth target data in the memory of the second computing node.
11. The apparatus of claim 10, wherein, In a scenario where multiple threads load the plurality of second data from the first storage layer to the second storage layer, wherein any two threads are responsible for loading different data, the apparatus further includes: The storage module is configured to store an index after the first condition is met, the index being used to indicate data that has been loaded into the second storage layer during the execution of the training task; The online data preloading module is used to load the plurality of first data from the first storage layer to the second storage layer, starting with the access order of the data indicated by the index in the first access information.
12. The apparatus of any one of claims 9-11, wherein, The device further includes: The generation module is configured to, after loading the first target data in the second storage layer into the memory of the first computing node through the first computing node, generate a data deletion request if the first target data meets a second condition. The data deletion request is used to instruct the deletion of the first target data from the second storage layer. The second condition refers to the first target data not being accessed within the first time period; or, the second condition refers to the number of times the first target data is accessed reaching a second threshold.
13. The apparatus according to any one of claims 9 to 12, characterized in that, The parsing module is further configured to obtain second access information of the training task. The second access information is used to indicate the local access order of the data of the training task by the accelerator cards on each computing node during the parallel execution of the training task by multiple accelerator cards. The second access information is generated based on the first access information. The foreground data loading module is used to load the first target data determined based on the second access information from the second storage layer into the memory of the first computing node through the first computing node.
14. The apparatus of claim 13, wherein, The device further includes: The data prefetching module is used to load at least one third data determined based on the second access information from the second storage layer into the memory of the first computing node through the first computing node, wherein the access order of the at least one third data is after the access order of the first target data.
15. The apparatus of any one of claims 9 to 14, wherein, The device further includes: The deletion module is used to delete the fourth data of the training task that meets the third condition from the second storage layer, wherein the third condition means that the fourth data will not be accessed within the second time period; or, the third condition means that the number of times the fourth data is accessed reaches the third threshold.
16. The apparatus of claim 15, wherein, The deletion module is used to delete the fourth data from the second storage layer during the process of loading the training task data from the first storage layer to the second storage layer. Alternatively, during the process of loading the training task data from the second storage layer into the memory of the computing node, the fourth data may be deleted from the second storage layer.
17. A computing system, comprising: The computing system includes multiple computing nodes, each of which is equipped with an accelerator card and memory. The multiple computing nodes are used to perform model training tasks, and the computing system is used to implement the data processing method as described in any one of claims 1 to 8.
18. A cluster of computing devices, characterized in that, It includes at least one computing device, each computing device including a processor and memory; The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to implement the data processing method of any of the preceding claims 1 to 8.
19. A computer program product comprising instructions, characterized in that, The instructions, when executed by the cluster of computing devices, cause the cluster of computing devices to implement the data processing method of any of claims 1 to 8.
20. A computer-readable storage medium, characterized in that, Computer program instructions, when executed by the cluster of computing devices, cause the cluster of computing devices to implement the data processing method of any of claims 1 to 8.