Checkpoint file processing method and system, electronic device and storage medium

By caching checkpoint files in memory and asynchronously persisting them during deep learning training, the problem of low read/write performance of checkpoint files is solved, achieving efficient checkpoint file processing, reducing training overhead and improving read/write performance.

CN117407370BActive Publication Date: 2026-06-12SHANGHAI SENSETIME TECH DEV CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI SENSETIME TECH DEV CO LTD
Filing Date
2023-10-27
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

The existing checkpoint file read/write process is performed synchronously with the training task, resulting in poor read/write performance and increasing the overhead and latency of deep learning training.

Method used

The checkpoint file is cached in the memory space allocated by the processing process during the model training process, and memory sharing is achieved by using the memory file descriptor mechanism. It is then asynchronously and persistently stored in the storage medium. At the same time, the checkpoint file is backed up in the cluster to achieve asynchronous operation.

🎯Benefits of technology

It reduces the additional overhead of checkpoint file reading and writing for model training, lowers read and write latency, improves the processing efficiency and read and write performance of checkpoint files, and enables rapid recovery of training in the event of node failure.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117407370B_ABST
    Figure CN117407370B_ABST
Patent Text Reader

Abstract

The present disclosure relates to a checkpoint file processing method and system, an electronic device and a storage medium. The method comprises: a model training process obtaining a checkpoint file generated in a training task execution process and determining file information of the checkpoint file, and encapsulating the file information as a save request and sending the save request to a processing process; the processing process allocating memory space of local memory for the checkpoint file according to the file information in the save request and sharing the memory space allocated for the checkpoint file with the model training process; the model training process caching the checkpoint file to the memory space allocated by the processing process; and the processing process persistently storing the checkpoint file in the memory space into a storage medium in the case that the checkpoint file has been cached to the allocated memory space. The embodiment of the present disclosure can improve the read-write efficiency and read-write performance of the checkpoint file and reduce the additional overhead generated by reading and writing the checkpoint file in the model training process.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and in particular to a checkpoint file processing method and system, electronic device and storage medium. Background Technology

[0002] Currently, deep learning training frameworks generally offer checkpointing functionality for inference or resuming training. Checkpoints periodically save the complete state of the model, allowing training to resume from the saved checkpoint when training fails. This avoids wasting training time by resuming from where training stopped, preventing the need to start from scratch every time a training failure occurs. For example, the popular PyTorch training framework supports saving the model, optimizer, gradients, and other states as checkpoint files and storing them on computer storage. However, currently, the reading and writing of checkpoint files is performed synchronously with the training task. This results in a large amount of data being read and written for each round of checkpoint file reading and writing, reducing read / write performance and increasing latency, thus imposing significant additional overhead on deep learning training. Summary of the Invention

[0003] This disclosure proposes a technical solution for checkpoint file processing.

[0004] According to one aspect of this disclosure, a method for processing checkpoint files is provided, comprising: a model training process acquiring checkpoint files generated during the execution of a training task, determining file information of the checkpoint files, and encapsulating the file information into a save request and sending the save request to a processing process, wherein the file information includes a filename and a file size; the processing process allocating local memory space for the checkpoint files according to the file information in the save request, and sharing the allocated memory space with the model training process; the model training process caching the checkpoint files in the memory space allocated by the processing process; and, when the checkpoint files are cached in the allocated memory space, the processing process persistently storing the checkpoint files in the memory space into a storage medium.

[0005] In one possible implementation, sharing the memory space allocated for the checkpoint file with the model training process includes: the processing process using the memory file descriptor mechanism of the native operating system to share the memory space allocated for the checkpoint file with the model training process, and sending the file descriptor corresponding to the allocated memory space to the model training process, wherein the file descriptor is used to access the allocated memory space; wherein, the model training process caching the checkpoint file in the memory space allocated by the processing process includes: the model training process caching the checkpoint file in the allocated memory space according to the file descriptor.

[0006] In one possible implementation, the model training process and the processing process are processes of a first node in a cluster; the cluster also includes at least one second node; the first node and the second node are used to perform model training tasks; the method further includes: the processing process in the first node backs up the checkpoint file in the memory space to at least one second node, provided that the checkpoint file has been cached in the allocated memory space.

[0007] In one possible implementation, the cluster includes at least two second nodes, and backing up the checkpoint file in memory space to at least one second node includes: determining a target second node from at least two second nodes according to a preset training task node list; wherein the training task node list is used to record the node numbers of the first node and the second node; the node number of the target second node is located after the node number of the first node; and backing up the checkpoint file in memory space to the target second node.

[0008] In one possible implementation, when the checkpoint file has been cached in the allocated memory space, the processing process persists the checkpoint file in the memory space to the storage medium, including: when the checkpoint file has been cached in the allocated memory space, the processing process adds the filename of the checkpoint file to the processing queue; when the queue processing thread of the processing process receives the filename of the checkpoint file from the processing queue, it persists the checkpoint file in the memory space to the storage medium.

[0009] In one possible implementation, the step of persistently storing the checkpoint file in memory space into a storage medium includes: calling the interface of the storage medium corresponding to the storage type according to the storage type declared in the training task executed by the model training process, and persistently storing the checkpoint file in memory space into the storage medium corresponding to the storage type; wherein, the storage type includes object storage and / or file storage.

[0010] In one possible implementation, the method further includes: when the checkpoint file has been cached in the allocated memory space, the model training process stores the description information of the checkpoint file in a database, the description information including file information and the file status of the checkpoint file, the file status indicating at least one of the following: whether the checkpoint file is cached in local memory, whether it is persistently stored in a storage medium, or whether it has been backed up to a second node; when the processing process has completed persistently storing the checkpoint file in the memory space into a storage medium, and / or completed backing up the checkpoint file to a second node in the cluster, the process updates the file status of the checkpoint file recorded in the database.

[0011] In one possible implementation, the method further includes: when the model training process needs to read the checkpoint file, sending a read request to the processing process, the read request including the filename of the checkpoint file; the processing process reads the file status of the checkpoint file indicated by the filename from the database according to the filename in the read request, and performs processing corresponding to the read file status according to the read file status, so that the model training process can read the checkpoint file from local memory or from storage medium according to the processing result of the processing process.

[0012] In one possible implementation, the step of performing processing corresponding to the read file state, so that the model training process can read the checkpoint file from local memory or from storage medium according to the processing result of the processing process, includes: if the read file state indicates that the checkpoint file is located in local memory, sending the file descriptor of the memory space where the checkpoint file is located to the model training process, so that the model training process can read the checkpoint file from the memory space of local memory according to the file descriptor sent by the processing process; if the read file state indicates that the checkpoint file is not located in local memory... If the checkpoint file has been backed up to the second node, the backed-up checkpoint file in the second node is read into the local memory according to the node information of the second node to which the checkpoint file is backed up. The file descriptor of the memory space where the checkpoint file is read into the local memory is sent to the model training process, so that the model training process can read the checkpoint file from the memory space of the local memory according to the file descriptor sent by the processing process. If the read file status indicates that the checkpoint file is not located in the local memory and that the checkpoint file has not been backed up to the second node, the model training process is notified to read the checkpoint file from the storage medium.

[0013] According to one aspect of this disclosure, a checkpoint file processing system is provided, the system comprising: a model training process and a processing process; the model training process is configured to acquire checkpoint files generated during the execution of a training task, determine the file information of the checkpoint files, encapsulate the file information into a save request, and send the save request to the processing process, wherein the file information includes a filename and a file size; the processing process is configured to allocate local memory space for the checkpoint files according to the file information in the save request, and share the allocated memory space with the model training process; the model training process is configured to cache the checkpoint files in the memory space allocated by the processing process; the processing process is configured to persistently store the checkpoint files in the memory space into a storage medium when the checkpoint files are cached in the allocated memory space.

[0014] In one possible implementation, sharing the memory space allocated for the checkpoint file with the model training process includes: utilizing the memory file descriptor mechanism of the native operating system to share the memory space allocated for the checkpoint file with the model training process, and sending the file descriptor corresponding to the allocated memory space to the model training process, wherein the file descriptor is used to access the allocated memory space; wherein, the model training process caches the checkpoint file in the memory space allocated by the processing process includes: the model training process caches the checkpoint file in the allocated memory space according to the file descriptor.

[0015] In one possible implementation, the model training process and the processing process are processes of a first node in a cluster; the cluster also includes at least one second node; the first node and the second node are used to perform model training tasks; the processing process in the first node is also used to back up the checkpoint file in the memory space to at least one second node if the checkpoint file has been cached in the allocated memory space.

[0016] In one possible implementation, the cluster includes at least two second nodes, and backing up the checkpoint file in memory space to at least one second node includes: determining a target second node from at least two second nodes according to a preset training task node list; wherein the training task node list is used to record the node numbers of the first node and the second node; the node number of the target second node is located after the node number of the first node; and backing up the checkpoint file in memory space to the target second node.

[0017] In one possible implementation, the step of persisting the checkpoint file in the memory space to the storage medium when the checkpoint file has been cached in the allocated memory space includes: adding the filename of the checkpoint file to the processing queue when the checkpoint file has been cached in the allocated memory space; and persisting the checkpoint file in the memory space to the storage medium when the queue processing thread of the processing process receives the filename of the checkpoint file from the processing queue.

[0018] In one possible implementation, the step of persistently storing the checkpoint file in memory space into a storage medium includes: calling the interface of the storage medium corresponding to the storage type according to the storage type declared in the training task executed by the model training process, and persistently storing the checkpoint file in memory space into the storage medium corresponding to the storage type; wherein, the storage type includes object storage and / or file storage.

[0019] In one possible implementation, the model training process is further configured to, when the checkpoint file has been cached in the allocated memory space, store the description information of the checkpoint file in a database, the description information including file information and the file status of the checkpoint file, the file status indicating at least one of the following: whether the checkpoint file is cached in local memory, whether it is persistently stored in a storage medium, or whether it has been backed up to a second node; the processing process is further configured to, after completing the persistent storage of the checkpoint file in the memory space into the storage medium, and / or completing the backup of the checkpoint file to the second node in the cluster, update the file status of the checkpoint file recorded in the database.

[0020] In one possible implementation, the model training process is further configured to send a read request to the processing process when it is necessary to read the checkpoint file, the read request including the filename of the checkpoint file; the processing process is further configured to read the file status of the checkpoint file indicated by the filename from the database according to the filename in the read request, and perform processing corresponding to the read file status according to the read file status, so that the model training process can read the checkpoint file from local memory or from storage medium according to the processing result of the processing process.

[0021] In one possible implementation, the step of performing processing corresponding to the read file state, so that the model training process can read the checkpoint file from local memory or from storage medium according to the processing result of the processing process, includes: if the read file state indicates that the checkpoint file is located in local memory, sending the file descriptor of the memory space where the checkpoint file is located to the model training process, so that the model training process can read the checkpoint file from the memory space of local memory according to the file descriptor sent by the processing process; if the read file state indicates that the checkpoint file is not located in local memory... If the checkpoint file has been backed up to the second node, the backed-up checkpoint file in the second node is read into the local memory according to the node information of the second node to which the checkpoint file is backed up. The file descriptor of the memory space where the checkpoint file is read into the local memory is sent to the model training process, so that the model training process can read the checkpoint file from the memory space of the local memory according to the file descriptor sent by the processing process. If the read file status indicates that the checkpoint file is not located in the local memory and that the checkpoint file has not been backed up to the second node, the model training process is notified to read the checkpoint file from the storage medium.

[0022] According to one aspect of this disclosure, an electronic device is provided, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the method described above.

[0023] According to one aspect of this disclosure, a computer-readable storage medium is provided that stores computer program instructions thereon, which, when executed by a processor, implement the above-described method.

[0024] In this embodiment, the model training process can generate file information for checkpoint files. The processing process allocates local memory space for the checkpoint files, the model training process caches the checkpoint files in the allocated memory space, and the processing process then persists the checkpoint files in the memory space to the storage medium. This achieves local caching and asynchronous persistent storage of checkpoint files, which can ensure that the subsequent execution of training tasks in the model training process is not affected. This makes the processing of checkpoint files and the execution of model training tasks asynchronous operations, thereby reducing the additional overhead of reading and writing checkpoint files for model training, reducing checkpoint file read and write latency, and improving the processing efficiency and read / write performance of checkpoint files.

[0025] It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Other features and aspects of this disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description

[0026] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the specification, serve to illustrate the technical solutions of this disclosure.

[0027] Figure 1 A flowchart illustrating a checkpoint file processing method according to an embodiment of the present disclosure is shown.

[0028] Figure 2 A schematic diagram of a checkpoint file processing system according to an embodiment of the present disclosure is shown.

[0029] Figure 3 A block diagram of a checkpoint file processing system according to an embodiment of the present disclosure is shown.

[0030] Figure 4 A block diagram of an electronic device 1900 according to an embodiment of the present disclosure is shown. Detailed Implementation

[0031] Various exemplary embodiments, features, and aspects of this disclosure will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.

[0032] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.

[0033] In this document, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent three cases: A alone, A and B simultaneously, and B alone. Furthermore, the term "at least one" in this document means any combination of at least two of any one or more elements. For example, including at least one of A, B, and C can mean including any one or more elements selected from the set consisting of A, B, and C.

[0034] Furthermore, to better illustrate this disclosure, numerous specific details are set forth in the following detailed description. Those skilled in the art will understand that this disclosure can be practiced without certain specific details. In some instances, methods, means, components, and circuits well known to those skilled in the art have not been described in detail in order to highlight the main points of this disclosure.

[0035] As mentioned above, the current process of reading and writing checkpoint files is synchronized with the training task. This results in a large amount of data being read and written for each round of checkpoint file reading and writing, poor read / write performance, and high read / write latency, leading to significant additional overhead for deep learning training. In view of this, this disclosure provides a checkpoint file processing method that, when the model training process generates and saves checkpoint files during the model training task, caches the checkpoint files in local memory without blocking subsequent execution of the model training task. Then, the checkpoint files in local memory are asynchronously persisted to file storage or object storage via a processing process. Furthermore, the checkpoint files are backed up to the memory of other nodes in the cluster via a network channel, so that if the first node fails, the backed-up checkpoint files can be retrieved from other nodes to quickly resume model training. Additionally, when the model training process reads checkpoint files, it can prioritize reading from local memory. Thus, through memory caching, asynchronous persistence, and backup, the performance of each level of storage media can be efficiently utilized, effectively improving the read / write efficiency and performance of checkpoint files.

[0036] The checkpoint file processing method of this disclosure can be deployed in an electronic device for performing model training tasks. For example, it can be deployed in multiple nodes of a cluster, where each node can be used to perform a portion of the training tasks of a distributed model training task. The nodes of the cluster can be devices such as servers or computers. Alternatively, it can be deployed in a terminal device for performing model training tasks. The terminal device can be a user equipment (UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (PDA), handheld device, computing device, in-vehicle device, wearable device, etc. The method can be implemented by a processor in the electronic device calling computer-readable instructions stored in the memory.

[0037] Figure 1 A flowchart illustrating a checkpoint file processing method according to an embodiment of this disclosure is shown, such as... Figure 1 As shown, the checkpoint file processing method includes:

[0038] In step S11, the model training process obtains the checkpoint file generated during the execution of the training task, determines the file information of the checkpoint file, encapsulates the file information into a save request, and sends the save request to the processing process. The file information includes the file name and file size.

[0039] The model training process can be a process used to execute training tasks. The training task executed can be an independent model training task or a part of the model training task. This disclosure does not limit this aspect.

[0040] Obtaining checkpoint files generated during the training task execution process can include serializing the checkpoint data generated during the training task execution into checkpoint files. It should be understood that model training typically involves multiple iterations, and each iteration may generate at least one checkpoint data point. Therefore, it is possible to set up a checkpoint file saving process after at least one training iteration. To facilitate the storage of checkpoint data, the checkpoint data generated after at least one training iteration can be serialized into checkpoint files, that is, unordered checkpoint data can be integrated into ordered checkpoint files for easier saving and retrieval. When serializing checkpoint data into checkpoint files, a unique filename can be automatically assigned to each checkpoint file to distinguish between different checkpoint files; and the size of the checkpoint file, i.e., the total data volume of the checkpoint file, can be calculated to allocate memory space based on the size of the checkpoint file.

[0041] Specifically, the file information is encapsulated into a save request and sent to the processing process. In other words, a save request is sent to the processing process. This save request contains the file information of the checkpoint file, instructing the processing process to allocate memory space for the checkpoint file based on the file information of the checkpoint file.

[0042] In step S12, the processing process allocates local memory space for the checkpoint file based on the file information in the save request, and shares the memory space allocated for the checkpoint file with the model training process so that the model training process and the processing process can jointly access the allocated memory space.

[0043] Upon receiving file information from the model training process, the processing process can allocate memory space corresponding to the file size indicated in the file information for the checkpoint file. This local memory refers to the memory of the electronic device executing the checkpoint file processing method of this embodiment, or, in other words, the memory of the electronic device executing the model training process and processing process of this embodiment.

[0044] It should be understood that the processing process and the model training process are different processes. Theoretically, different processes cannot use the same memory space. In order for the model training process and the processing process to access the allocated memory space together, so that the model training process can store the checkpoint file in the allocated memory space, the processing process can perform the operation of sharing the memory space allocated for the checkpoint file with the model training process after allocating the memory space for the checkpoint file.

[0045] Optionally, sharing the memory space allocated for the checkpoint file with the model training process can include: the processing process utilizing the native operating system's memory file descriptor (memfd) mechanism to share the memory space allocated for the checkpoint file with the model training process, and sending the file descriptor corresponding to the allocated memory space to the model training process. The file descriptor is used to access the allocated memory space. This method leverages the native operating system's built-in memory file descriptor mechanism to achieve memory sharing, maximizing the utilization of native memory while avoiding competition between the processing process and the model training process for shared memory usage rights.

[0046] In the case of using the memory file descriptor mechanism to implement memory sharing, file descriptors for accessing the allocated memory space can be generated. Sending the file descriptors of the allocated memory space to the model training process makes it easier for the model training process to access the allocated memory space based on the file descriptors.

[0047] It should be understood that the memory file descriptor mechanism can be a memory sharing mechanism built into the local operating system. The processing process can call the memory file descriptor mechanism by calling the relevant interface provided by the operating system, thereby realizing memory sharing. Of course, those skilled in the art can also choose to use other known memory sharing mechanisms in the art, as long as they can realize inter-process memory sharing. This disclosure does not limit this.

[0048] In step S13, the model training process caches the checkpoint file in the memory space allocated by the processing process.

[0049] In practical applications, after the processing process completes the allocation of memory space for the checkpoint file and shares the allocated memory space with the model training process, it can notify the model training process to cache the checkpoint file in the memory space allocated by the processing process. Then, the model training process can execute the caching of the checkpoint file in the memory space allocated by the processing process.

[0050] As described above, the processing process can utilize the memory file descriptor mechanism of the local operating system to share the memory space allocated for the checkpoint file with the model training process, and send the file descriptor corresponding to the allocated memory space to the model training process. Based on this, the model training process caches the checkpoint file in the memory space allocated by the processing process, which can include: the model training process caching the checkpoint file in the allocated memory space according to the file descriptor, that is, caching the checkpoint file in the allocated memory space according to the file descriptor sent by the processing process. It should be understood that the model training process can access the allocated memory space using the file descriptor, and thus can cache the checkpoint file in the allocated memory space.

[0051] In step S14, the processing process persists the checkpoint file in the memory space to the storage medium, provided that the checkpoint file has been cached in the allocated memory space.

[0052] As described above, the processing process and the model training process share the allocated memory space. Therefore, the processing process can read the stored checkpoint file from the memory space and persist the checkpoint file to a storage medium. The storage medium may include storage media of storage types such as object storage and / or file storage. It is understood that the storage medium used to persist the checkpoint file can be a storage device independent of the electronic device executing the checkpoint file processing method of this disclosure embodiment, which is beneficial for persistent storage of the checkpoint file.

[0053] Considering that persisting checkpoint files to storage media typically takes a considerable amount of time, to prevent errors in persisting each checkpoint file, in one possible implementation, the processing process persists the checkpoint files in memory to storage media after they have been cached in the allocated memory space. This includes:

[0054] When the checkpoint file is already cached in the allocated memory space, the processing process adds the checkpoint file's filename to the processing queue. When the queue processing thread receives the checkpoint file's filename from the processing queue, it persists the checkpoint file from memory to the storage medium. This method allows the processing queue to sequentially execute the persistence of each checkpoint file to the storage medium, facilitating efficient asynchronous persistence without blocking the model training process from executing subsequent training tasks.

[0055] In this process, if the checkpoint file has been cached in the allocated memory space, the processing process adds the filename of the checkpoint file to the processing queue. This can be understood as follows: when the processing process receives a save request from the model training process and the model training process has completed storing the checkpoint file in the memory space, the processing process adds the filename of the checkpoint file to the processing queue to queue for persistent storage of the checkpoint file into the storage medium.

[0056] The queue processing thread can be understood as a thread of the processing process. When the queue processing thread receives a filename in the processing queue, it means that it needs to store the checkpoint file corresponding to the filename into the storage medium. At this time, it can find the checkpoint file corresponding to the filename in the local memory based on the filename received from the processing queue and store the checkpoint file into the storage medium.

[0057] In practical applications, if the processing process encounters an error while storing the checkpoint file in the storage medium, such as due to network anomalies causing file transfer failure, the checkpoint file can be added back to the processing queue to re-execute the persistent storage of the checkpoint file in the storage medium. This will not affect the model training process, improve the processing performance of the checkpoint file, and has a certain degree of fault tolerance.

[0058] To facilitate storing checkpoint files in the user's desired storage medium, the storage type can be declared in advance during the training task executed in the model training process. This allows the processing process to store the checkpoint files in the storage medium corresponding to the declared storage type. Specifically, persistently storing checkpoint files in memory can include: calling the interface of the storage medium corresponding to the declared storage type in the training task, and persistently storing the checkpoint files in memory in the storage medium corresponding to the storage type. The storage type includes object storage and / or file storage. This approach supports multiple storage media to meet various user storage needs and helps reduce the storage cost of checkpoint files.

[0059] According to embodiments of this disclosure, the model training process generates file information for checkpoint files. The processing process allocates local memory space for the checkpoint files, the model training process caches the checkpoint files in the allocated memory space, and the processing process then persists the checkpoint files in the memory space to a storage medium. This achieves local caching and asynchronous persistent storage of checkpoint files, enabling the subsequent execution of training tasks in the model training process without affecting the process. This makes the processing of checkpoint files and the execution of model training tasks asynchronous, thereby reducing the additional overhead of reading and writing checkpoint files for model training, reducing checkpoint file read / write latency, and improving the processing efficiency and read / write performance of checkpoint files.

[0060] As described above, the checkpoint file processing method of this disclosure can be applied to multiple nodes in a cluster, with each node executing a portion of the training task in the model training task. However, a node in the cluster may fail and restart during training. When a node is restarted and scheduled, the checkpoint file cached in the node's local memory will be lost. This forces the node to inefficiently load the checkpoint file from storage when resuming the training task, resulting in significant overhead for model training and increasing its duration.

[0061] It should be understood that each of the aforementioned nodes runs its own model training and processing processes. Assuming that a first node among these nodes runs both the model training and processing processes (i.e., the model training and processing processes are processes within the first node of the cluster), and that the cluster also includes at least one second node, both of which are used to execute model training tasks, the second node can be understood as any node in the cluster other than the first node. Based on this, the method may further include: when the checkpoint file is cached in the allocated memory space, the processing process in the first node backs up the checkpoint file in the memory space to at least one second node in the cluster. Backing up the checkpoint file in the memory space to the second node means backing up the checkpoint file to the memory space of the second node. In practical applications, the checkpoint file can be transferred to the second node via a high-performance network channel between the nodes in the cluster for backup, which improves the efficiency of checkpoint file backup.

[0062] It should be understood that any first node in the aforementioned cluster can perform the operation of backing up the checkpoint file to the second node. This allows the backed-up checkpoint file to be restored to the local memory from other nodes in the cluster if any node in the cluster fails and the checkpoint file cached in the local memory is lost. Compared to loading the checkpoint file directly from the storage medium, reading the checkpoint file from other nodes in the cluster is more efficient, thereby reducing the overhead of reading the checkpoint file and reducing the time consumed by model training.

[0063] In practical applications, if the cluster includes one second node, the checkpoint file can be directly backed up to that single second node; if the cluster includes at least two second nodes, the checkpoint file can be backed up to one or more of the at least two second nodes; to ensure that each node in the cluster knows which other node the checkpoint file has been backed up to, in one possible implementation, backing up the checkpoint file in memory space to at least one second node can include:

[0064] Based on a pre-defined list of training task nodes, a target second node is determined from at least two second nodes. The training task node list records the node numbers of the first and second nodes. The node number of the target second node follows the node number of the first node. The checkpoint file in memory is backed up to the target second node. This method allows each node in the cluster to easily and efficiently know which second node the checkpoint file has been backed up to, facilitating the reading of the backed-up checkpoint file from the second node.

[0065] In this process, after a model training task is created and assigned to multiple nodes to execute portions of the training task, a training task node list can be generated to record the node information of the multiple nodes participating in the model training task, i.e., the node numbers of the first and second nodes of the parametric model training task are known. It should be understood that each node participating in the model training task can be assigned a unique number, and the processing flow of each node knows the node number of its node. When any first node needs to back up the checkpoint file, it can search for a target second node whose node number follows the first node in the training task node list, and then back up the checkpoint file to that target second node. For example, it could be the target second node whose node number is adjacent to the node number of the first node, or it could be a target second node separated from the node number of the first node by multiple node numbers; this embodiment of the present disclosure does not limit this.

[0066] Specifically, the checkpoint file in the memory space is backed up to the target second node whose node number is after the first node. In other words, the checkpoint file is transferred to the target second node whose node number is after the first node and stored in the memory space of the target second node.

[0067] As mentioned above, in order to prevent errors in the persistence of each checkpoint file, the processing process uses a processing queue to persist the checkpoint files to the storage medium. At the same time, in order to prevent errors in the backup of the checkpoint files, a processing queue can be used to back up the checkpoint files to the target second node. It should be understood that the processing queue for backing up the checkpoint files and the processing queue for persisting the checkpoint files mentioned above can be the same queue or different queues.

[0068] Based on this, when the queue processing thread of the processing process receives the filename of the checkpoint file from the processing queue, it can also perform the aforementioned process of backing up the checkpoint file in memory to the target second node in the cluster. This enables asynchronous backup of the checkpoint file, and if an error occurs during the backup process, such as a network anomaly preventing the checkpoint file from being successfully transferred to the target second node, the processing process can add the checkpoint file back to the processing queue to re-execute the backup.

[0069] According to the embodiments of this disclosure, by backing up the checkpoint file to the target second node in the cluster, that is, backing up the checkpoint file to other nodes in the cluster, the processing of the checkpoint file has fault tolerance capability. That is, when any node in the cluster fails and the checkpoint file cached in the local memory is lost, the backed-up checkpoint file can be restored from other nodes in the cluster to the local memory. Compared with loading the checkpoint file directly from the storage medium, reading the checkpoint file from other nodes in the cluster is more efficient, thereby reducing the reading overhead of the checkpoint file and reducing the time consumed by model training.

[0070] To facilitate subsequent model training processes reading the saved checkpoint files, in one possible implementation, the method further includes: when the checkpoint files have been cached in the allocated memory space, the model training process stores the description information of the checkpoint files in the database. The description information includes file information and the file status of the checkpoint files. The file status is used to indicate at least one of the following: whether the checkpoint files are cached in local memory, whether they are persistently stored in the storage medium, or whether they have been backed up to the second node; after the processing process has completed the persistent storage of the checkpoint files in the memory space into the storage medium, and / or has completed the backup of the checkpoint files to the second node in the cluster, it updates the file status of the checkpoint files recorded in the database.

[0071] In the aforementioned model training process, once the checkpoint file has been cached in the allocated memory space, the descriptive information of the checkpoint file is stored in the database. This can be understood as the model training process inserting the descriptive information of the checkpoint file into a database table after completing the caching of the checkpoint file in the allocated memory space. Specifically, a record is inserted into the database table, which can record information such as the checkpoint file's filename, file size, and file status. It should be understood that when the descriptive information is first inserted into the database table, the file status of the checkpoint file may only include that it is cached in local memory, and the database may also record the file descriptor corresponding to the memory space where the checkpoint file is cached.

[0072] When the processing process completes the persistence of the checkpoint file in memory to the storage medium, it can update the file status of the checkpoint file recorded in the database, including that it has been persisted to the storage medium. It can also record the storage type of the storage medium in the data. When the processing process completes the backup of the checkpoint file to the second node in the cluster, it can update the file status of the checkpoint file recorded in the database, including that it has been backed up to the second node. It can also record the node information of the second node in the database, such as the node information of the target second node to which the backup was made.

[0073] Based on the checkpoint file saving method provided in the above checkpoint file processing method, this disclosure also provides a checkpoint file reading method, or loading method. Specifically, the method further includes:

[0074] In step S21, when the model training process needs to read the checkpoint file, it sends a read request to the processing process. The read request includes the filename of the checkpoint file.

[0075] In step S22, the processing process reads the file status of the checkpoint file indicated by the file name in the read request from the database, and performs the corresponding processing according to the read file status, so that the model training process can read the checkpoint file from the local memory or from the storage medium according to the processing result of the processing process.

[0076] In step S21, when the model training process needs to resume training at a certain training stage, it needs to obtain the checkpoint file of that training stage. At this time, the model training process can send a read request, or load request, to the processing process. The read request may contain the filename of the checkpoint file to be read.

[0077] As described above, the database records the file information and file status of checkpoint files. The file status can indicate at least one of the following: whether the checkpoint file is cached in local memory, whether it is persistently stored in a storage medium, or whether it has been backed up to a second node. Therefore, when the processing process receives a read request sent by the model training process, it can read the file status of the checkpoint file indicated by the filename in the read request from the database. Then, based on the read file status, different processing is performed. Specifically, in step S22, based on the read file status, processing corresponding to the read file status is performed so that the model training process, based on the processing result of the processing process, performs reading the checkpoint file from local memory or from the storage medium, including at least one of the following:

[0078] If the checkpoint file representing the file status is located in the local memory, the file descriptor of the memory space where the checkpoint file is located is sent to the model training process, so that the model training process can read the checkpoint file from the memory space of the local memory according to the file descriptor sent by the processing process.

[0079] If the checkpoint file representing the file status is not located in the local memory and the checkpoint file has been backed up to the second node, the checkpoint file backed up in the second node is read into the local memory according to the node information of the second node to which the checkpoint file is backed up. The file descriptor of the memory space where the checkpoint file is read into the local memory is sent to the model training process so that the model training process can read the checkpoint file from the memory space of the local memory according to the file descriptor sent by the processing process.

[0080] If the file status representation checkpoint file is not located in the local memory and the representation checkpoint file has not been backed up to the second node, the model training process is notified to read the checkpoint file from the storage medium.

[0081] As described above, when the description information of the checkpoint file is inserted into the database table, the file descriptor corresponding to the memory space where the checkpoint file is cached can also be recorded in the database. Therefore, the file descriptor of the memory space where the checkpoint file is located can be read from the database and sent to the model training process. The model training process can then read the checkpoint file from its local memory space based on the file descriptor sent by the processing process; that is, load the checkpoint file from its local memory space. The file descriptor sent by the processing process to the model training process can be a processing result.

[0082] If the checkpoint file, indicating its status, is not located in local memory, it may be because the cached checkpoint file in local memory has been lost. In this case, a notification message can be sent to the model training process to instruct it to read the checkpoint file from the storage medium. This notification message can include the storage type of the storage medium containing the checkpoint file, allowing the model training process to call the corresponding storage medium's interface to read the checkpoint file. This notification message can also be a processing result. This method prioritizes efficient reading of the checkpoint file from local memory, improving reading efficiency and reducing the overhead of reading the checkpoint file compared to directly reading it from the storage medium.

[0083] As described above, checkpoint files generated by any node in the cluster can also be backed up to a second node in the cluster. Considering that loading checkpoint files from the memory space of the second node is more efficient than loading them directly from the storage medium, and that when a checkpoint file has been backed up to the second node, in addition to updating the file status of the checkpoint file recorded in the database to indicate that it has been backed up to the second node, the node information of the backed-up second node, such as the node information of the target second node, can also be recorded simultaneously. Therefore, the model training process can also obtain the node information of the second node to which the checkpoint file has been backed up from the database, and then, based on the obtained node information, remotely read the checkpoint file from the second node into the local memory space via a high-performance network channel. It should be understood that the memory space used at this time can be the memory space where the checkpoint file was previously cached, or it can be other memory space in the local memory that can be shared with the model training process; this embodiment of the present disclosure does not limit this. Then, the processing process can send the file descriptor of the memory space where the checkpoint file is located, which has been read into the local memory, to the model training process, so that the model training process can read the checkpoint file from the memory space of the local memory according to the file descriptor sent by the processing process.

[0084] It should be understood that if the read file status indicates that the checkpoint file is not located in the local memory and that the checkpoint file has not been backed up to the second node, it means that the checkpoint file has not been backed up to other nodes in the cluster and the checkpoint file in the local memory has also been lost. In this case, the processing process can send a notification message to the model training process to notify the model training process to read the checkpoint file from the storage medium.

[0085] According to embodiments of this disclosure, checkpoint files can be read efficiently from local memory first, which improves the reading efficiency of checkpoint files compared to reading checkpoint files directly from storage media. Furthermore, when checkpoint files are not stored in local memory, backup checkpoint files can be read from other second nodes in the cluster, which improves the reading efficiency of checkpoint files and reduces the additional overhead of reading checkpoint files compared to reading checkpoint files directly from storage media.

[0086] In practical applications, users can incorporate the code library corresponding to the checkpoint file processing method provided in this disclosure into their large-scale model training code. This enables more efficient saving and reading of checkpoint files during training, reducing the time overhead associated with checkpoint file saving and reading, and improving overall training efficiency. Furthermore, while ensuring efficient checkpoint access, incorporating the code library corresponding to the checkpoint file processing method provided in this disclosure into the large-scale model training code allows for more frequent saving of checkpoint files, reducing the loss of training progress due to unexpected model training process exits or node failures, thereby improving overall model training efficiency. Users can also use cheaper object storage as a persistent storage medium without reducing checkpoint read / write efficiency and lowering model training costs.

[0087] Compared to the prior art that synchronously executes checkpoint file read / write operations and model training tasks, the embodiments of this disclosure, through efficient memory caching and asynchronous persistent storage, can greatly reduce the storage latency and reading latency of checkpoint files, improve the processing efficiency of checkpoint files, and reduce the high additional overhead caused by checkpoint file reading operations.

[0088] Compared to existing technologies that lack fault tolerance for node failures, the embodiments of this disclosure employ cross-node backup of checkpoint files. This helps to reduce the problem that when a node is rescheduled and restarted due to unexpected node failures, the checkpoint files cached in local memory are lost. This would lead to the inefficient loading of checkpoints from the storage medium when the training task resumes, resulting in high additional overhead for training. Therefore, this method has fault tolerance for failures of non-adjacent nodes and does not affect the reading performance of checkpoint files.

[0089] Compared to existing technologies that primarily support costly file storage, the embodiments of this disclosure, by interfacing with various storage media, enrich the storage types of storage media and meet various user storage needs. In particular, they can persist checkpoint files to object storage, which can reduce the storage cost of checkpoint files.

[0090] Compared to the memory sharing mechanism used in existing checkpoint file read / write systems, the embodiments of this disclosure use the operating system's memfd mechanism to avoid the model training process and the processing process being heavily dependent on shared memory, thereby maximizing the use of local memory space and avoiding potential competition between the processing process and the model training process for the right to use shared memory.

[0091] It is understood that the model training process in this embodiment can be understood as the client (or front-end), or the model training process is integrated into the client, and the processing process can be understood as the server (or back-end). The checkpoint file processing method of this embodiment is realized through the cooperation of the two ends (that is, the cooperation of the two processes).

[0092] It is understood that the various method embodiments mentioned above in this disclosure can be combined with each other to form combined embodiments without violating the principle and logic. Due to space limitations, this disclosure will not elaborate further. Those skilled in the art will understand that in the above methods of specific implementation, the specific execution order of each step should be determined by its function and possible internal logic.

[0093] In addition, this disclosure also provides a checkpoint file processing system, electronic device, computer-readable storage medium, and program, all of which can be used to implement any of the checkpoint file processing methods provided in this disclosure. The corresponding technical solutions and descriptions are described in the corresponding section of the method and will not be repeated here.

[0094] This disclosure also provides an embodiment such as Figure 2 The diagram shown is of a checkpoint file processing system, as follows: Figure 2 As shown, the processing system may include:

[0095] The model training process is used to execute the functions performed by the model training process in the checkpoint file processing method of the above-described embodiments of the present disclosure.

[0096] A processing process is used to perform the functions executed by the processing process in the checkpoint file processing method of this disclosure embodiment;

[0097] The storage medium interface calling tool is used to call the interface of the storage medium corresponding to the storage type declared in the training task, so as to persist the checkpoint file into the storage medium.

[0098] Storage media, used for persistently storing checkpoint files, may include object storage and / or file storage;

[0099] The task initiator is used to start the processing process and the model training process;

[0100] The database is used to store and manage the metadata of checkpoint files, such as file information, description information, file descriptor of the memory space where the checkpoint file is located, storage type of the storage medium to which it is stored, and node information of the second node to which it is backed up; the database is deployed in advance during the cluster deployment phase.

[0101] Local memory, also known as the node's local memory, is used to cache checkpoint files.

[0102] The system according to embodiments of this disclosure can save checkpoint files generated during model training to local memory for caching, without blocking subsequent execution of the model training task. Then, the checkpoint files in local memory are asynchronously persisted to storage media such as file storage or object storage via a processing process. Furthermore, the checkpoint files can be backed up to the memory of other second nodes in the cluster via a network channel, so that if any node fails, the backed-up checkpoint files can be retrieved from other nodes to quickly resume model training. Moreover, when the model training process reads checkpoint files, it can prioritize reading from local memory. Thus, through memory caching, asynchronous persistence, and backup, the performance of storage media at each level can be efficiently utilized, effectively improving the read / write efficiency and performance of checkpoint files and reducing the additional overhead incurred when reading checkpoint files.

[0103] Based on the checkpoint file processing method of the above embodiments of this disclosure, the embodiments of this disclosure also provide Figure 3 The diagram shown is a block diagram of this checkpoint file processing system, such as Figure 3 As shown, the checkpoint file processing system includes: a model training process 301 and a processing process 302;

[0104] The model training process 301 is used to serialize the checkpoint data generated during the execution of the training task into a checkpoint file, determine the file information of the checkpoint file, encapsulate the file information into a save request, and send the save request to the processing process, wherein the file information includes the file name and the file size;

[0105] The processing process 302 is configured to allocate local memory space for the checkpoint file according to the file information in the save request, and share the memory space allocated for the checkpoint file with the model training process, so that the model training process and the processing process can jointly access the allocated memory space.

[0106] The model training process 301 is used to cache the checkpoint file in the memory space allocated by the processing process;

[0107] The processing process 302 is used to persist the checkpoint file in the memory space to the storage medium when the checkpoint file has been cached in the allocated memory space.

[0108] In one possible implementation, sharing the memory space allocated for the checkpoint file with the model training process includes: utilizing the memory file descriptor mechanism of the native operating system to share the memory space allocated for the checkpoint file with the model training process, and sending the file descriptor corresponding to the allocated memory space to the model training process, wherein the file descriptor is used to access the allocated memory space; wherein, the model training process caches the checkpoint file in the memory space allocated by the processing process includes: the model training process caches the checkpoint file in the allocated memory space according to the file descriptor.

[0109] In one possible implementation, the model training process 301 and the processing process 302 are processes of a first node in a cluster; the cluster also includes at least one second node; the first node and the second node are used to perform model training tasks; the processing process in the first node is also used to back up the checkpoint file in the memory space to at least one second node if the checkpoint file has been cached in the allocated memory space.

[0110] In one possible implementation, the cluster includes at least two second nodes, and backing up the checkpoint file in memory space to at least one second node includes: determining a target second node from at least two second nodes according to a preset training task node list; wherein the training task node list is used to record the node numbers of the first node and the second node; the node number of the target second node is located after the node number of the first node; and backing up the checkpoint file in memory space to the target second node.

[0111] In one possible implementation, the step of persisting the checkpoint file in the memory space to the storage medium when the checkpoint file has been cached in the allocated memory space includes: adding the filename of the checkpoint file to the processing queue when the checkpoint file has been cached in the allocated memory space; and persisting the checkpoint file in the memory space to the storage medium when the queue processing thread of the processing process 302 receives the filename of the checkpoint file from the processing queue.

[0112] In one possible implementation, the step of persistently storing the checkpoint file in memory space into a storage medium includes: calling the interface of the storage medium corresponding to the storage type according to the storage type declared in the training task executed by the model training process, and persistently storing the checkpoint file in memory space into the storage medium corresponding to the storage type; wherein, the storage type includes object storage and / or file storage.

[0113] In one possible implementation, the model training process 301 is further configured to, when the checkpoint file has been cached in the allocated memory space, store the description information of the checkpoint file in a database, the description information including file information and the file status of the checkpoint file, the file status being used to indicate at least one of the following: whether the checkpoint file is cached in local memory, whether it is persistently stored in a storage medium, or whether it has been backed up to a second node; the processing process 302 is further configured to, after completing the persistent storage of the checkpoint file in the memory space into a storage medium, and / or completing the backup of the checkpoint file to a second node in the cluster, update the file status of the checkpoint file recorded in the database.

[0114] In one possible implementation, the model training process 301 is further configured to send a read request to the processing process when it is necessary to read the checkpoint file, the read request including the filename of the checkpoint file; the processing process 302 is further configured to read the file status of the checkpoint file indicated by the filename from the database according to the filename in the read request, and perform processing corresponding to the read file status according to the read file status, so that the model training process can read the checkpoint file from local memory or from storage medium according to the processing result of the processing process.

[0115] In one possible implementation, the step of performing processing corresponding to the read file state, so that the model training process can read the checkpoint file from local memory or from storage medium according to the processing result of the processing process, includes: if the read file state indicates that the checkpoint file is located in local memory, sending the file descriptor of the memory space where the checkpoint file is located to the model training process, so that the model training process can read the checkpoint file from the memory space of local memory according to the file descriptor sent by the processing process; if the read file state indicates that the checkpoint file is not located in local memory... If the checkpoint file has been backed up to the second node, the backed-up checkpoint file in the second node is read into the local memory according to the node information of the second node to which the checkpoint file is backed up. The file descriptor of the memory space where the checkpoint file is read into the local memory is sent to the model training process, so that the model training process can read the checkpoint file from the memory space of the local memory according to the file descriptor sent by the processing process. If the read file status indicates that the checkpoint file is not located in the local memory and that the checkpoint file has not been backed up to the second node, the model training process is notified to read the checkpoint file from the storage medium.

[0116] According to embodiments of this disclosure, the model training process generates file information for checkpoint files. The processing process allocates local memory space for the checkpoint files, the model training process caches the checkpoint files in the allocated memory space, and the processing process then persists the checkpoint files in the memory space to a storage medium. This achieves local caching and asynchronous persistent storage of checkpoint files, enabling the subsequent execution of training tasks in the model training process without affecting the process. This makes the processing of checkpoint files and the execution of model training tasks asynchronous, thereby reducing the additional overhead of reading and writing checkpoint files for model training, reducing checkpoint file read / write latency, and improving the processing efficiency and read / write performance of checkpoint files.

[0117] The methods and systems described in this disclosure are technically related to the internal structure of a computer system. They can utilize the computer system's local memory and other hardware structures to improve the computer system's read and write efficiency of checkpoint files, reduce the amount of data transferred between checkpoint files and storage media, and increase hardware processing speed by caching checkpoint files, thereby achieving technical effects that improve the internal performance of the computer system in accordance with natural laws.

[0118] In some embodiments, the system provided in this disclosure may have functions or include processes that can be used to execute the methods described in the above method embodiments. The specific implementation of these methods can be referred to the description in the above method embodiments, and for the sake of brevity, they will not be repeated here.

[0119] This disclosure also proposes a computer-readable storage medium storing computer program instructions that, when executed by a processor, implement the above-described method. The computer-readable storage medium can be volatile or non-volatile.

[0120] This disclosure also proposes an electronic device, including: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to execute the above-described method.

[0121] This disclosure also provides a computer program product, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code, wherein when the computer-readable code is run in a processor of an electronic device, the processor in the electronic device performs the above-described method.

[0122] Electronic devices can be provided as terminals, servers, cluster nodes, or other forms of devices.

[0123] Figure 4 A block diagram of an electronic device 1900 according to an embodiment of the present disclosure is shown. For example, the electronic device 1900 may be provided as a server, a cluster node, or a terminal device. (Refer to...) Figure 4 The electronic device 1900 includes a processing component 1922, which further includes one or more processors, and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 1922 is configured to execute instructions to perform the methods described above.

[0124] Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input / output (I / O) interface 1958. Electronic device 1900 can operate on an operating system stored in memory 1932, such as Microsoft Server operating system (Windows Server). TM Apple's graphical user interface-based operating system (Mac OSX) TM ), a multi-user, multi-process computer operating system (Unix) TM Linux is a free and open-source Unix-like operating system. TM ), the open-source Unix-like operating system (FreeBSD) TM(or similar.)

[0125] In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 1932 including computer program instructions that can be executed by a processing component 1922 of an electronic device 1900 to perform the above-described method.

[0126] This disclosure can be a system, method, and / or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of this disclosure.

[0127] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example, (but not limited to) electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination of the foregoing. The computer-readable storage media used herein are not to be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0128] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.

[0129] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk, C++, etc., and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing the status information of the computer-readable program instructions to implement various aspects of this disclosure.

[0130] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0131] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0132] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0133] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0134] The computer program product can be implemented specifically through hardware, software, or a combination thereof. In one alternative embodiment, the computer program product is specifically embodied in a computer storage medium; in another alternative embodiment, the computer program product is specifically embodied in a software product, such as a software development kit (SDK), etc.

[0135] The description of the various embodiments above tends to emphasize the differences between the various embodiments. The similarities or similarities between them can be referred to, and for the sake of brevity, they will not be repeated here.

[0136] Those skilled in the art will understand that, in the above-described method of the specific implementation, the order in which each step is written does not imply a strict execution order and does not constitute any limitation on the implementation process. The specific execution order of each step should be determined by its function and possible internal logic.

[0137] If the technical solution of this application involves personal information, the product using this technical solution has clearly informed the user of the personal information processing rules and obtained the user's voluntary consent before processing the personal information. If the technical solution of this application involves sensitive personal information, the product using this technical solution has obtained the user's separate consent before processing the sensitive personal information, and also meets the requirement of "express consent". For example, at personal information collection devices such as cameras, clear and prominent signs are set up to inform users that they have entered the scope of personal information collection and that personal information will be collected. If an individual voluntarily enters the collection scope, it is deemed that they have agreed to the collection of their personal information; or on the personal information processing device, with clear signs / information informing users of the personal information processing rules, authorization is obtained from the individual through pop-up information or by asking the individual to upload their personal information; wherein, the personal information processing rules may include information such as the personal information processor, the purpose of personal information processing, the processing method, and the types of personal information processed.

[0138] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method for processing checkpoint files, characterized in that, The method includes: The model training process acquires checkpoint files generated during the execution of the training task, determines the file information of the checkpoint files, encapsulates the file information into a save request, and sends the save request to the processing process. The file information includes the file name and file size. The processing process allocates local memory space for the checkpoint file based on the file information in the save request, and shares the memory space allocated for the checkpoint file with the model training process; The model training process caches the checkpoint file in the memory space allocated by the processing process; If the checkpoint file has been cached in the allocated memory space, the processing process will persist the checkpoint file in the memory space to the storage medium.

2. The method according to claim 1, characterized in that, The step of sharing the memory space allocated for the checkpoint file with the model training process includes: The processing process utilizes the memory file descriptor mechanism of the local operating system to share the memory space allocated for the checkpoint file with the model training process, and sends the file descriptor corresponding to the allocated memory space to the model training process. The file descriptor is used to access the allocated memory space. The model training process caches the checkpoint file in the memory space allocated by the processing process, including: The model training process caches the checkpoint file in the allocated memory space according to the file descriptor.

3. The method according to claim 1, characterized in that, The model training process and the processing process are processes of the first node in the cluster; The cluster also includes at least one second node; the first and second nodes are used to perform model training tasks; the method further includes: If the checkpoint file has been cached in the allocated memory space, the processing process in the first node backs up the checkpoint file in the memory space to at least one second node.

4. The method according to claim 3, characterized in that, The cluster includes at least two second nodes, and backing up the checkpoint file in memory space to at least one second node includes: According to a preset training task node list, a target second node is determined from at least two second nodes; wherein, the training task node list is used to record the node numbers of the first node and the second node; the node number of the target second node is located after the node number of the first node; Back up the checkpoint file in memory space to the target second node.

5. The method according to claim 1, characterized in that, The processing procedure, when the checkpoint file has been cached in the allocated memory space, persists the checkpoint file in the memory space to the storage medium, including: If the checkpoint file has been cached in the allocated memory space, the processing process adds the filename of the checkpoint file to the processing queue. When the queue processing thread of the processing process receives the filename of the checkpoint file from the processing queue, it persists the checkpoint file in memory to the storage medium.

6. The method according to claim 1 or 5, characterized in that, The step of persistently storing the checkpoint file in memory space into a storage medium includes: According to the storage type declared in the training task executed by the model training process, the interface of the storage medium corresponding to the storage type is called to persistently store the checkpoint file in the memory space into the storage medium corresponding to the storage type; wherein, the storage type includes object storage and / or file storage.

7. The method according to claim 3 or 4, characterized in that, The method further includes: When the model training process has cached the checkpoint file in the allocated memory space, the description information of the checkpoint file is stored in the database. The description information includes file information and the file status of the checkpoint file. The file status is used to indicate at least one of the following: whether the checkpoint file is cached in local memory, whether it is persistently stored in the storage medium, or whether it has been backed up to the second node. The processing process updates the file status of the checkpoint file recorded in the database after it has completed the persistence of the checkpoint file in the memory space to the storage medium and / or the backup of the checkpoint file to the second node in the cluster.

8. The method according to claim 7, characterized in that, The method further includes: When the model training process needs to read the checkpoint file, it sends a read request to the processing process, and the read request includes the filename of the checkpoint file. The processing process reads the file status of the checkpoint file indicated by the file name in the read request from the database, and performs processing corresponding to the read file status, so that the model training process can read the checkpoint file from local memory or from storage medium according to the processing result of the processing process.

9. The method according to claim 8, characterized in that, The step of performing processing corresponding to the read file state, so that the model training process can read the checkpoint file from local memory or from storage medium according to the processing result of the processing process, includes at least one of the following cases: If the read file status indicates that the checkpoint file is located in the local memory, the file descriptor of the memory space where the checkpoint file is located is sent to the model training process, so that the model training process can read the checkpoint file from the memory space of the local memory according to the file descriptor sent by the processing process; If the read file status indicates that the checkpoint file is not located in the local memory and indicates that the checkpoint file has been backed up to the second node, the checkpoint file backed up in the second node is read into the local memory according to the node information of the second node to which the checkpoint file is backed up, and the file descriptor of the memory space where the checkpoint file is read into the local memory is sent to the model training process, so that the model training process reads the checkpoint file from the memory space of the local memory according to the file descriptor sent by the processing process; If the read file status indicates that the checkpoint file is not located in the local memory and that the checkpoint file has not been backed up to the second node, the model training process is notified to read the checkpoint file from the storage medium.

10. A checkpoint file processing system, characterized in that, The system includes: a model training process and a processing process; The model training process is used to obtain checkpoint files generated during the execution of the training task, determine the file information of the checkpoint files, encapsulate the file information into a save request, and send the save request to the processing process. The file information includes the file name and the file size. The processing process is configured to allocate local memory space for the checkpoint file based on the file information in the save request, and share the memory space allocated for the checkpoint file with the model training process. The model training process is used to cache the checkpoint file in the memory space allocated by the processing process; The processing procedure is used to persist the checkpoint file in the memory space to the storage medium when the checkpoint file has been cached in the allocated memory space.

11. An electronic device, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1 to 9.

12. A computer-readable storage medium having computer program instructions stored thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1 to 9.