A data loading method, related device, equipment and readable storage medium

By using a distributed caching system and node monitoring service during model training, the problem of insufficient throughput of centralized storage services is solved, and more efficient data loading and model training are achieved.

CN117290557BActive Publication Date: 2026-06-26DUXIAOMAN TECH (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DUXIAOMAN TECH (BEIJING) CO LTD
Filing Date
2023-09-22
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Insufficient throughput of centralized storage services slows down data loading for model training tasks, potentially causing training tasks to fail.

Method used

By deploying a distributed caching system, training files are stored in distributed storage nodes. The training nodes directly query the cache query service to obtain the required files, reducing the load on the centralized storage service. Abnormal nodes are replaced under the monitoring of the node monitoring service, thus optimizing the use of storage resources.

Benefits of technology

It effectively reduces the workload of centralized storage services, avoids data loading timeouts, and improves the efficiency and success rate of model training.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117290557B_ABST
    Figure CN117290557B_ABST
Patent Text Reader

Abstract

The application provides a data loading method, related devices, equipment and a readable storage medium. The method comprises the following steps: sending a first query message to a data query service, wherein first information in the first query message is used to indicate information of a target storage node where a first training file required by a first model training task is located; receiving a first message sent from the data query service, wherein the first message comprises information of the target storage node; sending a data loading request message to the target storage node, wherein the data loading request message is used to request the first training file; receiving the first training file from the target storage node; and performing model training according to the first training file.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the technical field of data loading, and in particular to a data loading method, related apparatus, device, and computer-readable storage medium. Background Technology

[0002] With the rapid development of big data technology, neural networks are increasingly being used in our daily lives. For example, in the field of image recognition, neural networks can extract and analyze the features of people in images to perform facial recognition; in the field of data analysis, the data to be processed can be used as input to a neural network, which then processes the input data according to its algorithm to obtain the results.

[0003] Model training is the crucial step in determining the performance of a neural network. The training process generally involves: using training sample data as input to the neural network; processing the input data; and outputting the results. Then, comparing the output with the labels of the training sample data yields a difference function. Based on this difference function, the network structure and parameters are adjusted to ensure the output closely approximates or perfectly matches the labels of the training sample data. As can be seen from this process, model training requires loading a large amount of training sample data. Summary of the Invention

[0004] This application provides a data loading method, related apparatus, device, and readable storage medium, which solves the problem of insufficient throughput in centralized storage services.

[0005] In a first aspect, embodiments of this application provide a data loading method applied to a first training node. The method includes: sending a first query message to a data query service, wherein first information in the first query message is used to indicate information about the target storage node where a first training file required for a first model training task is located; receiving a first message sent from the data query service, wherein the first message includes information about the target storage node; sending a data loading request message to the target storage node, wherein the data loading request message is used to request the first training file; receiving the first training file from the target storage node; and performing model training based on the first training file.

[0006] In conjunction with the first aspect, in one possible implementation, after receiving the first training file from the target storage node, the method further includes: storing the first training file in memory if the storage space occupied by the first training file is less than or equal to the remaining available storage space of the training node.

[0007] In conjunction with the first aspect, in one possible implementation, after sending a data loading request message to the target storage node, the method further includes: receiving a first subscription message from the target storage node, wherein the first subscription information in the first subscription message is used to instruct the target storage node to send a first prompt message after the model training is completed; and, in the case that the model training is completed, sending a first prompt message to the target storage node, wherein the first prompt information in the first prompt message is used to characterize the completion of model training.

[0008] Secondly, embodiments of this application provide a data loading method, comprising: receiving a first query message from a first training node, wherein first information in the first query message is used to indicate information about the target storage node where the first training file required for a first model training task is located; querying information about the target storage node based on the first information; and sending a first message to the first training node, wherein the first message includes information about the target storage node.

[0009] In conjunction with the second aspect, in one possible implementation, the method further includes: receiving a second query message from a node monitoring service, the second query message including information about a storage node that is malfunctioning; querying training files cached by the storage node based on the information about the malfunctioning storage node; and sending identification information of the training files to the node monitoring service; wherein the node monitoring service is used to replace the storage node with a storage node that is malfunctioning with a storage node that is functioning normally.

[0010] Thirdly, embodiments of this application provide a data loading method applied to a storage node. The method includes: receiving a data loading request message from a first training node, the data loading request message being used to request a first training file; and sending the first training file to the first training node.

[0011] In conjunction with the third aspect, in one possible implementation, after receiving the data loading request message from the first training node, the method further includes: sending a first subscription message to the first training node, wherein the first subscription information in the first subscription message is used to instruct the storage node to send a first prompt message after the model training is completed; and receiving the first prompt message from the first training node, wherein the first prompt information in the first prompt message is used to characterize the end of model training.

[0012] Fourthly, embodiments of this application provide a data loading apparatus, which includes a first transmitting unit, a first receiving unit, a second transmitting unit, a second receiving unit, and a training unit; wherein:

[0013] The first sending unit is used to send a first query message to the data query service. The first information in the first query message is used to indicate the information of the target storage node where the first training file required for the first model training task is located.

[0014] The first receiving unit is used to receive a first message sent from the data query service, the first message including information about the target storage node;

[0015] The second sending unit is used to send a data loading request message to the target storage node. The data loading request message is used to request the first training file.

[0016] The second receiving unit is used to receive the first training file from the target storage node;

[0017] The training unit is used to train the model based on the first training file.

[0018] In conjunction with the fourth aspect, in one possible implementation, the data loading device further includes:

[0019] A storage unit is used to store the first training file in memory when the storage space occupied by the first training file is less than or equal to the remaining available storage space of the training node.

[0020] In conjunction with the fourth aspect, in one possible implementation, the data loading device further includes a third receiving unit and a third transmitting unit, wherein:

[0021] The third receiving unit is used to receive a first subscription message from the target storage node. The first subscription information in the first subscription message is used to instruct the target storage node to send a first prompt message after the model training is completed.

[0022] The third sending unit is used to send a first prompt message to the target storage node when the model training is completed. The first prompt information in the first prompt message is used to indicate that the model training is completed.

[0023] Fifthly, embodiments of this application provide a data loading apparatus, which includes a first receiving unit, a storage unit, and a first transmitting unit; wherein:

[0024] The first receiving unit is configured to receive a first query message from the first training node, wherein the first information in the first query message is used to indicate the information of the target storage node where the first training file required for the first model training task is located.

[0025] Storage unit, used to query information of target storage node based on first information;

[0026] The first sending unit is used to send a first message to the first training node, the first message including information about the target storage node.

[0027] In conjunction with the fifth unit, in one possible implementation, the data loading device further includes a second receiving unit, a querying unit, and a second sending unit; wherein:

[0028] The second receiving unit is configured to receive a second query message from the node monitoring service, the second query message including information about storage nodes that are malfunctioning.

[0029] The query unit is used to query the training files cached by the storage node based on the information of the storage node with the malfunction.

[0030] The second sending unit is used to send the identification information of the training file to the node monitoring service; wherein, the node monitoring service is used to replace the storage node with the storage node with the normal working status.

[0031] Sixthly, embodiments of this application provide a data loading apparatus, which includes a first receiving unit and a second transmitting unit; wherein:

[0032] The first receiving unit is used to receive a data loading request message from the first training node, the data loading request message being used to request the first training file;

[0033] The first sending unit is used to send the first training file to the first training node.

[0034] In conjunction with the sixth aspect, in one possible implementation, the data loading device further includes a second transmitting unit and a second receiving unit; wherein:

[0035] The second sending unit is used to send a first subscription message to the first training node. The first subscription information in the first subscription message is used to indicate that a first prompt message should be sent to the storage node after the model training is completed.

[0036] The second receiving unit is used to receive a first prompt message from the first training node, wherein the first prompt information in the first prompt message is used to indicate that the model training has ended.

[0037] In a seventh aspect, embodiments of this application provide a data loading device, including a memory, a communication module, and a processor;

[0038] The memory is used to store program code, and the processor is used to call the program code stored in the memory to execute the data loading method in the first aspect and its various possible implementations, or to execute the data loading method in the second aspect and its various possible implementations, or to execute the data loading method in the third aspect and its various possible implementations.

[0039] Eighthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the data loading method in the first aspect and its various possible implementations, or implements the data loading method in the second aspect and its various possible implementations, or implements the data loading method in the third aspect and its various possible implementations.

[0040] Ninthly, embodiments of this application provide a computer program including instructions that, when executed by a computer, cause a data loading device to execute the processes executed in the first aspect and its various possible implementations, or the data loading device to execute the processes executed in the second aspect and its various possible implementations, or the data loading device to execute the processes executed in the third aspect and its various possible implementations. Attached Figure Description

[0041] The accompanying drawings used in the embodiments of this application are described below.

[0042] Figure 1 This is a system architecture diagram of a data loading method provided in an embodiment of this application;

[0043] Figure 2 This is a system architecture diagram of another data loading method provided in the embodiments of this application;

[0044] Figure 3 This is a system architecture diagram of another data loading method provided in the embodiments of this application;

[0045] Figure 4 This is a flowchart of a data loading method provided in an embodiment of this application;

[0046] Figure 5 This is a schematic diagram of the structure of a data loading device 50 provided in an embodiment of this application;

[0047] Figure 6 This is a schematic diagram of the structure of a data loading device 60 provided in an embodiment of this application;

[0048] Figure 7 This is a schematic diagram of the structure of a data loading device 70 provided in an embodiment of this application;

[0049] Figure 8 This is a schematic diagram of the structure of a data loading device 80 provided in an embodiment of this application;

[0050] Figure 9 This is a schematic diagram of the structure of a data loading device 90 provided in an embodiment of this application;

[0051] Figure 10 This is a schematic diagram of the structure of a data loading device 100 provided in an embodiment of this application. Detailed Implementation

[0052] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. The term "embodiment" as used herein means that a specific feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of this application. The appearance of this phrase in different places in the specification does not necessarily indicate the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art will explicitly and implicitly understand that the embodiments described herein can be combined with other embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.

[0053] The terms "first," "second," "third," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish different objects and not to describe a particular order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, it may include a series of steps or units, or optionally, steps or units not listed, or other steps or units inherent to these processes, methods, products, or devices.

[0054] The accompanying drawings show only the portions relevant to this application, not all of them. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts depict operations (or steps) as sequential processes, many of these operations may be performed in parallel, concurrently, or simultaneously. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operation is completed, but may also have additional steps not included in the drawings. The process may correspond to a method, function, procedure, subroutine, subprogram, etc.

[0055] The terms “component,” “module,” “system,” “unit,” etc., used in this specification are used to refer to computer-related entities, hardware, firmware, combinations of hardware and software, software, or software in execution. For example, a unit can be, but is not limited to, a process running on a processor, a processor, an object, an executable file, a thread of execution, a program, and / or distributed between two or more computers. Furthermore, these units can be executed from various computer-readable media on which various data structures are stored. Units can communicate, for example, via local and / or remote processes based on signals having one or more data packets (e.g., data from a second unit interacting with another unit between a local system, a distributed system, and / or a network; for example, the Internet interacting with other systems via signals).

[0056] With the rapid development of big data technology, neural networks are increasingly being used in our daily lives. For example, in the field of image recognition, neural networks can extract and analyze the features of people in images to perform facial recognition; in the field of data analysis, the data to be processed can be used as input to a neural network, which then processes the input data according to its algorithm to obtain the results.

[0057] Model training is the crucial step in determining the performance of a neural network. The training process generally involves: using training sample data as input to the neural network; processing the input data; and outputting the results. Then, comparing the output with the labels of the training sample data yields a difference function. Based on this difference function, the network structure and parameters are adjusted to ensure the output closely approximates or perfectly matches the labels of the training sample data. As can be seen from this process, model training requires loading a large amount of training sample data.

[0058] like Figure 1As shown, the training files required for model training are stored in a centralized storage service, which can be the Hadoop Distributed File System (HDFS). The training files include the training sample data needed for model training. When a network device executing a training task needs to perform that task, it requests the training files from the centralized storage service and uses the training data in those files as input to train the model. While the centralized storage service can provide large storage capacity to meet the requirements of storing large amounts of sample data, network devices executing different training tasks will also send data loading requests to the centralized storage service at the start of training to obtain the corresponding training files for model training. Typically, a model training cluster may execute multiple model training tasks simultaneously (usually more than a dozen). The more concurrent training data, the more data loading requests are sent to the centralized storage service, resulting in a larger amount of data being sent to the model training cluster. When the data loading requests exceed the capacity of the centralized storage service, it will reduce its data reading speed. This will cause some network devices executing model training tasks to receive training files for longer periods (slower loading speed), i.e., increase the data reading time for these network devices. If the data reading time of the network device performing the model training task exceeds the set value, the model training task may fail.

[0059] In some embodiments, the limited throughput of centralized storage services can be addressed by deploying a distributed caching system. For example... Figure 2 As shown, in Figure 2This includes a distributed caching system and a centralized storage service. The distributed caching system comprises multiple storage nodes, each storing the training files required for model training. These cached training files can originate from the centralized storage service. During model training, the network device can send a query request to the cache query service, which may include identification information for the model training task. Upon receiving the query request from the network device, the cache query service uses the identification information in the query request to locate the storage node storing the required training files and sends the storage node's information (e.g., its IP address) to the network device. Upon receiving this information, the network device can send a data loading request to the corresponding storage node to obtain the training files needed for the model training task. In this way, the network device performing the model training task can send data loading requests to the storage nodes in the distributed caching system specifically based on the task requirements of its model training mission, instead of directly sending data loading requests to the centralized storage service. For a single storage node, the number of data loading requests it receives is generally less than the number of data loading requests received by a centralized storage service. Moreover, compared to a centralized storage service, a single storage node sends fewer bits of training files to network devices in parallel. This greatly reduces the throughput pressure on the storage node, thus effectively preventing the problem of data loading timeouts and model training task failures caused by nodes caching training files exceeding their throughput capacity due to the amount of training files being sent in parallel.

[0060] In the above Figure 1 and Figure 2 The system architecture of two data loading methods provided in embodiments of this application was introduced in the previous section. Below, with reference to the accompanying drawings, another data loading method provided in embodiments of this application is described. Please refer to... Figure 3 , Figure 3 This is a system architecture diagram of a data loading method provided in an embodiment of this application. Figure 3 It includes model training clusters, distributed storage systems, centralized storage services, node monitoring services, and cache query services.

[0061] The model training cluster comprises multiple training nodes, each corresponding to a model training task. The distributed caching system includes multiple distributed caching subsystems, each containing one or more storage nodes. Each distributed caching subsystem is independent and caches all training files required for a single model training task. A centralized storage service caches all training files required for all model training tasks. A node monitoring service monitors the working status of each storage node in the distributed storage system in real time. If an abnormal working status is detected, the node detection service allocates a new storage node and caches the training files of the previously abnormal node in the newly allocated storage node. A cache query service queries the storage node that caches the training files required for training the model of a training node based on a data loading request sent by a training node in the model training cluster, and sends the information of that storage node to the training node.

[0062] The above Figure 3 The system architecture of a data loading method provided in this application embodiment is described below. The flow of a data loading method provided in this application embodiment is then described with reference to the accompanying drawings. Please refer to... Figure 4 , Figure 4 This is a flowchart of a data loading method provided in an embodiment of this application. Figure 4 Including the first training node, which is one of the training nodes in the model training cluster, in Figure 4 It also includes centralized storage services, distributed storage systems, data query services, and node monitoring services. The specific process is as follows:

[0063] S401: The first training node sends a first query message to the data query service. The first information in the first query message is used to indicate the information of the storage node where the training file required for the first model training task is located.

[0064] Specifically, the model training task performed by the first training node is the first task. When performing the first task, the first training node needs to obtain the training file so as to train the model based on the training data in the training file. The first training node can send a first query message to the data query service. The first query message includes first information, which instructs the data query service to find the storage node where the first training file required by the first training node is located, so that the first training node can obtain the training file required to train the model based on the first storage node.

[0065] S402: The data query service queries the information of the target storage node based on the first information and sends a first message to the first training node, wherein the first message includes the information of the target storage node.

[0066] Specifically, the information of the target storage node can be the physical address of the target storage node, the IP address of the target storage node, or the identification information of the target storage node. This application embodiment does not limit the type of information of the target storage node.

[0067] After receiving the first information from the first training node, the data query service can parse the first information to obtain the identification information of the first training file. The data query service can then retrieve the information of the storage node corresponding to the identification information of the first training file from the memory-to-training-file mapping table. This storage node-to-training-file mapping table includes the mapping relationship between the identification information of the training files stored in each storage node and the information of the storage node in the distributed storage system.

[0068] After the data query service finds the information of the storage node (target storage node) that caches the first training file through the storage node-training file mapping table, the data query service can send a first message to the first training node, which includes the information of the target storage node.

[0069] The target storage node is a storage node in the first distributed storage subsystem. This distributed storage system comprises multiple independent subsystems, each containing one or more storage nodes. Each subsystem stores all the training files required for a single model training task. By dividing the distributed storage system into multiple independent subsystems based on the model training task, the problem of one or more storage nodes failing and affecting the working status of other storage nodes in the system, thus preventing those nodes from reading or writing data, can be effectively avoided.

[0070] For example, in an existing distributed storage system, there are 100 storage nodes. The training files stored on these 100 nodes collectively form a resource pool. These 100 storage nodes are interconnected and influence each other. Assuming the tolerance of this distributed storage system is 3, the failure of no more than 3 storage nodes will not affect other storage nodes. However, if the number of failed storage nodes exceeds 3, it may affect the operation of the remaining storage nodes, causing the entire distributed storage system to be unable to read or write training files, thereby affecting the execution of training tasks in the model training cluster. In this embodiment, assuming the model training cluster needs to execute 10 model training tasks, designated as Task 1 to Task 10, these 100 storage nodes can be divided into 10 distributed storage subsystems, designated as Subsystem 1 to Subsystem 10. Each distributed storage subsystem can include 10 storage nodes, and each subsystem is independent and does not affect the others. The training files for Task 1 can be stored entirely in the storage nodes of subsystem 1, the training files for Task 2 can be stored entirely in the storage nodes of subsystem 2, and so on, until the training files for Task 10 are stored entirely in the storage nodes of subsystem 10. In this way, if one or more storage nodes in subsystem 10 fail, it will not affect the working status of the storage nodes in subsystems 1 through 9, and thus will not affect the execution of Tasks 1 through 9.

[0071] S403: The first training node sends a data loading request message to the target storage node based on the information of the target storage node. The data loading request message includes second information representing the first training file.

[0072] Specifically, after receiving the first message from the data query service, the first training node can send a data loading request message to the target storage node based on the target storage node information in the first message. The data loading request message includes second information, which is used to characterize the first training file.

[0073] In some embodiments, after obtaining the information of the target storage node that caches the first training file, the first training node can directly send a data loading request message to the target storage node when it wants to retrieve the first training file to train the model again, without having to repeat steps S401-S402. This improves the efficiency of retrieving the training file. If the first training node cannot retrieve the first training file from the target storage node, the first training node can execute steps S401-S402 to query the storage node that caches the first training file.

[0074] S404: The target storage node searches for the first training file based on the second information and sends a second message to the first training node, the second message including the first training file.

[0075] Specifically, after receiving a data loading request from the first training node, the target storage node can parse the second information in the data loading request to obtain the identification information of the first training file. Then, the target storage node can retrieve the first training file from memory based on the identification information. Next, the target storage node sends a second message to the first training node, which includes the first training file.

[0076] S405: The first training node trains the model based on the training data in the first training file.

[0077] Specifically, after receiving the second message from the target storage node, the first training node can use the training data in the first training file as input to train the model.

[0078] In this way, the first training node does not need to directly send data loading requests to the centralized storage service. Instead, it sends data loading requests to the storage nodes of the distributed storage system according to the needs of its model training task, thereby obtaining the training files required for training the model. This greatly reduces the workload of the centralized storage service and avoids the problem of the centralized storage service sending training files to multiple training nodes in parallel when the size of the data files to be sent by the centralized storage service exceeds the throughput limit, causing the centralized storage service to send data at a low rate, resulting in the training node's loading timeout and ultimately causing the training node to fail to execute the model training task.

[0079] In one possible implementation, if the first training node has sufficient memory to store the first training file, it can store the first training file in memory. This way, when the first training node needs to use the first training file to train the model again, it can directly retrieve the first training file from memory without repeating the steps in S401-S404 above, which can greatly reduce data loading time and improve the efficiency of model training.

[0080] S406: The node monitoring service periodically monitors the working status of all storage nodes in the distributed storage system.

[0081] Optionally, the node monitoring service can periodically or periodically monitor the working status of all storage nodes in the distributed storage system. The node monitoring service can periodically broadcast monitoring messages to all storage nodes in the distributed storage system, and the monitoring information in these messages is used to instruct the storage nodes to report their working status. After receiving the monitoring messages broadcast by the node monitoring service, a storage node can send a feedback message to the node monitoring service. The working status information in the feedback message is used to characterize the current working status of the storage node. The node monitoring service identifies storage nodes with abnormal working status information or those that fail to send a feedback message within a timeout as storage nodes with abnormal working status.

[0082] S407: If a storage node with an abnormal working status is detected, the node monitoring service will replace the storage node with a storage node with a normal working status.

[0083] Specifically, when the node monitoring service detects a storage node with an abnormal working status, it can send a second query message to the data query service, which includes information about the abnormal storage node. Upon receiving the second query message from the node monitoring service, the data query service queries the training files cached by the abnormal storage node and sends the identification information of the training files to the node detection service. After receiving the identification information of the training files (first identification information) from the data query service, the node monitoring service can send a first request message to the centralized storage service, which includes the first identification information. Upon receiving the first request message from the node monitoring service, the centralized storage service can find the training file corresponding to the first identification information and send that training file to the node monitoring service. After receiving the training file from the centralized storage service, the node monitoring service can allocate N new storage nodes based on the number N of abnormal storage nodes. These N new storage nodes correspond one-to-one with the N abnormal storage nodes. Then, based on the cached training files in the abnormal storage nodes, the node monitoring service caches the same training files cached by the corresponding abnormal storage node in each newly allocated training node. Then, the node monitoring service can send a first update message to the data query service. This first update message includes the mapping information between the newly allocated training nodes and their cached training file information. After receiving the first update message from the node monitoring service, the data query service can store the mapping information in the node-training file mapping table.

[0084] In some embodiments, after receiving a data loading request message from a training node, the storage node can send a first subscription message to the training node. The first subscription information in the first subscription message instructs the training node to send a first notification message to the storage node after the model training task is completed. The first notification information in the first notification message indicates that the current model training task has ended. After receiving the first notification message, the storage node can delete the training files required for the model task but before the update, thereby saving memory on the storage node.

[0085] It should be understood that S401-S407 is a flow chart of a data loading method provided in the embodiments of this application. The execution order of S401-S407 can be adjusted, and any step in S401-S407 can be deleted. Figure 4 Adjusting and / or deleting the execution order of steps in the embodiments can yield different embodiments, and the resulting embodiments still fall within the protection scope of the embodiments of this application.

[0086] The methods of the embodiments of this application have been described in detail above. The related devices, equipment, computer-readable storage media, and computer programs of the embodiments of this application are described below.

[0087] Please see Figure 5 , Figure 5 This is a schematic diagram of the structure of a data loading device 50 provided in an embodiment of this application. The data loading device 50 may include a first sending unit 501, a first receiving unit 502, a second sending unit 503, a second receiving unit 504, and a training unit 505; wherein:

[0088] The first sending unit 501 is used to send a first query message to the data query service. The first information in the first query message is used to indicate the information of the target storage node where the first training file required for the first model training task is located.

[0089] The first receiving unit 502 is used to receive a first message sent from the data query service, the first message including information about the target storage node;

[0090] The second sending unit 503 is used to send a data loading request message to the target storage node. The data loading request message is used to request the first training file.

[0091] The second receiving unit 504 is used to receive the first training file from the target storage node;

[0092] Training unit 505 is used to train the model based on the first training file.

[0093] In one possible implementation, the data loading device 50 further includes:

[0094] A storage unit is used to store the first training file in memory when the storage space occupied by the first training file is less than or equal to the remaining available storage space of the training node.

[0095] In one possible implementation, the data loading device 50 further includes a third receiving unit and a third transmitting unit, wherein:

[0096] The third receiving unit is used to receive a first subscription message from the target storage node. The first subscription information in the first subscription message is used to instruct the target storage node to send a first prompt message after the model training is completed.

[0097] The third sending unit is used to send a first prompt message to the target storage node when the model training is completed. The first prompt information in the first prompt message is used to indicate that the model training is completed.

[0098] Please see Figure 6 , Figure 6 This is a schematic diagram of the structure of a data loading device 60 provided in an embodiment of this application. The device 60 includes a first receiving unit 601, a storage unit 602, and a first sending unit 603; wherein:

[0099] The first receiving unit 601 is used to receive a first query message from the first training node, wherein the first information in the first query message is used to indicate the information of the target storage node where the first training file required for the first model training task is located.

[0100] Storage unit 602 is used to query information of the target storage node based on the first information;

[0101] The first sending unit 603 is used to send a first message to the first training node, the first message including information about the target storage node.

[0102] In one possible implementation, the data loading device 60 further includes a second receiving unit, a querying unit, and a second sending unit; wherein:

[0103] The second receiving unit is configured to receive a second query message from the node monitoring service, the second query message including information about storage nodes that are malfunctioning.

[0104] The query unit is used to query the training files cached by the storage node based on the information of the storage node with the malfunction.

[0105] The second sending unit is used to send the identification information of the training file to the node monitoring service; wherein, the node monitoring service is used to replace the storage node with the storage node with the normal working status.

[0106] Please see Figure 7 , Figure 7 This is a schematic diagram of the structure of a data loading device 70 provided in an embodiment of this application. The device 70 includes a first receiving unit 701 and a first transmitting unit 702; wherein:

[0107] The first receiving unit 701 is used to receive a data loading request message from the first training node, the data loading request message being used to request the first training file;

[0108] The first sending unit 702 is used to send the first training file to the first training node.

[0109] In one possible implementation, the data loading device 70 further includes a second transmitting unit and a second receiving unit; wherein:

[0110] The second sending unit is used to send a first subscription message to the first training node. The first subscription information in the first subscription message is used to indicate that a first prompt message should be sent to the storage node after the model training is completed.

[0111] The second receiving unit is used to receive a first prompt message from the first training node, wherein the first prompt information in the first prompt message is used to indicate that the model training has ended.

[0112] Please see Figure 8 , Figure 8 This is a schematic diagram of the structure of a data loading device 80 provided in an embodiment of this application. The data loading device 80 may include a memory 801, a communication module 802, and a processor 803; wherein, the detailed description of each unit is as follows:

[0113] Memory 801 is used to store program code.

[0114] Processor 803 is used to call program code stored in memory to perform the following steps:

[0115] The communication module 802 sends a first query message to the data query service. The first information in the first query message is used to indicate the information of the target storage node where the first training file required for the first model training task is located.

[0116] The communication module 802 receives a first message sent from the data query service, the first message including information about the target storage node;

[0117] The communication module 802 sends a data loading request message to the target storage node. The data loading request message is used to request the first training file.

[0118] The first training file is received from the target storage node via the communication module 802;

[0119] The model is trained based on the first training file.

[0120] In one possible implementation, processor 803 is used to execute program code stored in memory:

[0121] If the storage space occupied by the first training file is less than or equal to the remaining available storage space of the training node, the first training file will be stored in memory.

[0122] In one possible implementation, processor 803 is used to execute program code stored in memory:

[0123] The communication module 802 receives a first subscription message from the target storage node. The first subscription information in the first subscription message is used to indicate that a first prompt message should be sent to the target storage node after the model training is completed.

[0124] When model training is complete, a first prompt message is sent to the target storage node via the communication module 802. The first prompt message contains information indicating that model training has ended.

[0125] Please see Figure 9 , Figure 9 This is a schematic diagram of the structure of a data loading device 90 provided in an embodiment of this application. The data loading device 90 may include a memory 901, a communication module 902, and a processor 903; wherein, the detailed description of each unit is as follows:

[0126] Memory 901 is used to store program code.

[0127] Processor 903 is used to call program code stored in memory to perform the following steps:

[0128] The communication module 902 receives a first query message from the first training node. The first information in the first query message is used to indicate the information of the target storage node where the first training file required for the first model training task is located.

[0129] Based on the first piece of information, query the information of the target storage node;

[0130] The communication module 902 sends a first message to the first training node, which includes information about the target storage node.

[0131] In one possible implementation, processor 903 is used to execute program code stored in memory:

[0132] The communication module 902 receives a second query message from the node monitoring service, which includes information about storage nodes that are malfunctioning.

[0133] Based on the information of the storage node with the malfunction, query the training files cached by the storage node;

[0134] The identification information of the training file is sent to the node monitoring service through the communication module 902; the node monitoring service is used to replace the storage nodes with abnormal working status with the storage nodes with normal working status.

[0135] Please see Figure 10 , Figure 10 This is a schematic diagram of the structure of a data loading device 100 provided in an embodiment of this application. The data loading device 100 may include a memory 1001, a communication module 1002, and a processor 1003; wherein, the detailed description of each unit is as follows:

[0136] Memory 1001 is used to store program code.

[0137] The processor 1003 is used to call the program code stored in memory to perform the following steps:

[0138] The communication module 1002 receives a data loading request message from the first training node, which is used to request the first training file; and sends the first training file to the first training node.

[0139] In one possible implementation, the processor 1003 is used to execute program code stored in memory:

[0140] The communication module 1002 sends a first subscription message to the first training node. The first subscription information in the first subscription message is used to indicate that a first prompt message should be sent to the storage node after the model training is completed.

[0141] The communication module 1002 receives a first prompt message from the first training node. The first prompt information in the first prompt message is used to indicate that the model training has ended.

[0142] This application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the data loading method described in the above embodiments and their various possible implementations.

[0143] This application provides a computer program including instructions that, when executed by a computer, enable a first training node to execute the process executed by the first training node in the above embodiments and their various possible implementations, or enable a data query service to execute the process executed by the data query service in the above embodiments and their various possible implementations, or enable a storage node to execute the process executed by the storage node in the above embodiments and their various possible implementations, or enable a node monitoring service to execute the process executed by the node monitoring service in the above embodiments and their various possible implementations.

[0144] It should be noted that the memory in the above embodiments can be read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, random access memory (RAM) or other types of dynamic storage devices capable of storing information and instructions, or electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital universal optical discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but is not limited thereto. The memory can exist independently and be connected to the processor via a bus. The memory can also be integrated with the processor.

[0145] The processor in the above embodiments may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits used to control the execution of the above scheme program.

[0146] For the foregoing method embodiments, in order to simplify the description, they are all expressed as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, because according to this application, some steps may be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0147] In the several embodiments provided in this application, it should be understood that the disclosed apparatus can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of the units described above is merely a logical functional division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed.

[0148] The units described above as separate components may or may not be physically separate. Similarly, the components shown as units may or may not be physical units; they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0149] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The aforementioned integrated unit can be implemented in hardware or as a software functional unit.

[0150] If the aforementioned integrated units are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in software form. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which can be a personal computer, server, or network device, specifically a processor in the computer device) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium may include various media capable of storing program code, such as a USB flash drive, portable hard drive, magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM).

[0151] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit it. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A data loading method, characterized in that, Applied to the first training node, the method includes: Send a first query message to the data query service. The first information in the first query message is used to indicate the information of the target storage node where the first training file required for the first model training task is located. Receive a first message sent from the data query service, the first message including information about the target storage node; Send a data loading request message to the target storage node, the data loading request message being used to request the first training file; Receive the first training file from the target storage node; The model is trained based on the first training file.

2. The method as described in claim 1, characterized in that, After receiving the first training file from the target storage node, the process further includes: If the storage space occupied by the first training file is less than or equal to the remaining available storage space of the training node, the first training file is stored in memory.

3. The method as described in claim 1, characterized in that, After sending the data loading request message to the target storage node, the method further includes: Receive a first subscription message from the target storage node, wherein the first subscription information in the first subscription message is used to instruct the target storage node to send a first prompt message after the model training is completed; When the model training is completed, a first prompt message is sent to the target storage node, wherein the first prompt information in the first prompt message is used to indicate that the model training is completed.

4. A data loading method, characterized in that, Applied to data query services, including: Receive a first query message from the first training node, wherein the first information in the first query message is used to indicate the information of the target storage node where the first training file required for the first model training task is located. Based on the first information, query the information of the target storage node; A first message is sent to the first training node, the first message including information about the target storage node.

5. The method as described in claim 4, characterized in that, The method further includes: Receive a second query message from the node monitoring service, the second query message including information about the storage nodes that are malfunctioning; Based on the information of the storage node with the malfunction, query the training files cached by the storage node; The identification information of the training file is sent to the node monitoring service; The node monitoring service is used to replace storage nodes with abnormal working status with storage nodes with normal working status.

6. A data loading device, characterized in that, Includes a unit that performs the data loading method as described in any one of claims 1-3 or 4-5.

7. A data loading device, characterized in that, include: Memory and processor, wherein: The memory is used to store computer programs, the computer programs including program instructions; The processor is used to invoke the program instructions, causing the data loading device to perform the method as described in any one of claims 1-3 or 4-5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method as described in any one of claims 1-3 or 4-5.