Data processing method, system, medium, and program product

By executing other tasks in parallel and incrementally transmitting data processing results in a high-performance computing system, the problem of resource waste caused by task data uploading is solved, and resource utilization and task execution efficiency are improved.

CN122240286APending Publication Date: 2026-06-19HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2024-12-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In high-performance computing systems, long upload times for task data can lead to prolonged occupation of allocated computing resources, resulting in resource waste and impacting task execution efficiency.

Method used

During task data transmission, computing nodes are allocated to execute other tasks in parallel, backfilling technology is used to optimize resource utilization, and data processing results are sent through incremental transmission.

Benefits of technology

It reduces the waste of computing resources, improves resource utilization, and achieves the effect of "computing and transmitting simultaneously", thereby improving task execution efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240286A_ABST
    Figure CN122240286A_ABST
Patent Text Reader

Abstract

This application relates to the field of computer technology and discloses a data processing method, system, medium, and program product. In this method, the data processing system first allocates corresponding computing resources for a first task. Then, it determines the transmission time required for the task data of the first task to be transmitted to the data processing system. During the transmission of the task data of the first task, a second task is executed based on the computing nodes allocated to the first task; the execution time of the second task is less than the transmission time. Corresponding to the completion of the transmission of the task data of the first task and the completion of the execution of the second task, the first task is executed based on the computing nodes allocated to the first task. Thus, even if the transmission of the task data of the first task takes a significant amount of time, the computing nodes allocated to the first task can execute other tasks with shorter execution times during the transmission of the task data of the first task, thereby reducing the waste of computing resources and improving resource utilization.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a data processing method, system, medium, and program product. Background Technology

[0002] High-performance computing (HPC) is a technology that uses multiple processors of the same computer or multiple computers in a cluster to perform complex computational processing.

[0003] Currently, when processing data based on HPC technology, the HPC system typically acquires one or more tasks to be processed and their corresponding task data uploaded by the user through a client, and stores the task data in the storage device corresponding to the HPC system. Then, based on the tasks to be processed, the required computing resources are determined, the necessary computing resources are allocated to the tasks, and the data corresponding to the tasks is processed.

[0004] However, assuming that the necessary computing resources are allocated to the pending tasks before the task data is uploaded to the HPC system's storage device, if the amount of task data is large and uploading the task data takes a long time, the allocated computing resources may be occupied for an extended period, resulting in resource waste. Summary of the Invention

[0005] In order to solve the above problems, the purpose of this application is to provide a data processing method, system, medium and program product.

[0006] The first aspect of this application provides a data processing method applied to a data processing system. The method includes: allocating computing nodes in the data processing system for executing a first task; determining a first duration for transmitting task data to the data processing system based on the amount of task data of the first task; receiving task data in parallel and executing a second task on the computing nodes, wherein the execution duration of the second task on the computing nodes is less than the first duration; and executing the first task on the computing nodes based on the completion of the second task and the completion of the transmission of task data.

[0007] In this embodiment, the data processing system first determines the computing resources required for the first task and allocates corresponding computing resources (e.g., at least one computing node) to the first task. Then, based on the amount of task data (e.g., program instructions, data to be processed, etc.) of the first task, it determines the transmission time required for the task data of the first task to be transmitted to the data processing system (i.e., the resource availability time of the computing resources corresponding to the first task). During the transmission of the task data of the first task, at least one other task (e.g., a second task) is executed based on the computing node allocated to execute the first task, and the execution time of the at least one other task is less than the transmission time. Furthermore, the task data corresponding to the first task is transmitted to the data processing system, and the computing node allocated to execute the first task completes the execution of other tasks. Based on the computing node allocated to execute the first task, the first task is executed, and the data processing result corresponding to the first task is obtained.

[0008] In this way, even if the data transmission of the first task to the data processing system takes a significant amount of time, the computing nodes allocated to execute the first task can still perform other tasks with shorter execution times during the data transmission process. This reduces the waste of computing resources and improves resource utilization. Furthermore, since the execution time of other tasks is shorter than the transmission time, the first task does not need to wait for computing resources after the data transmission is complete.

[0009] In one possible implementation of the first aspect described above, the first task includes multiple data processing results; and the method further includes: corresponding to the multiple data processing results being generated sequentially, and the generated data processing results satisfying a time condition or a quantity condition, sending the data processing results generated during the execution of the first task to a first storage device in the form of incremental transmission; wherein the time condition is that the time interval between two adjacent data processing systems sending data processing results to the first storage device is the same, and the quantity condition is that the amount of data in each data processing result sent by the data processing system to the first storage device is the same.

[0010] In this embodiment of the application, if the multiple data processing results corresponding to the first task are generated sequentially, the data processing system can send the multiple data processing results corresponding to the first task to the first storage device (e.g., external storage) in multiple increments.

[0011] In this way, if the data processing result is large, the data processing result can be transmitted to the external storage device at the same time during the execution of the first task, so as to achieve the effect of "compute and transmit at the same time" without having to wait for the first task to be completed and then spend a lot of time transmitting the data processing result.

[0012] In one possible implementation of the first aspect described above, the task data includes at least one of the following: program instructions corresponding to the first task, and data to be processed for the first task.

[0013] In one possible implementation of the first aspect described above, determining the first duration of task data transmission to the data processing system based on the data volume of the task data of the first task includes: acquiring the data volume of the task data and the transmission rate of the task data; and calculating the first duration based on the data volume of the task data and the transmission rate of the task data.

[0014] In one possible implementation of the first aspect described above, calculating the first duration based on the amount of task data and the transmission rate of task data includes: multiplying the amount of task data and the transmission rate of task data as the first duration; or, determining a preset ratio coefficient and multiplying the preset ratio coefficient, the amount of task data, and the transmission rate of task data as the first duration.

[0015] In one possible implementation of the first aspect described above, the second task is performed on the computing node by allocating the computing node to the second task using a backfilling technique and then performing the second task on the computing node.

[0016] In this embodiment, the data processing system executing a second task based on a computing node allocated for executing a first task can involve executing only one second task on that same computing node, with the execution time of this single second task being less than the first duration. Alternatively, the data processing system can execute multiple second tasks serially on the same computing node, with the execution time of these multiple second tasks being less than the first duration. Finally, the data processing system can execute multiple second tasks in parallel on the same computing node, with the execution time of these multiple second tasks being less than the first duration.

[0017] A second aspect of this application provides a data processing system, comprising: a management node for allocating computing nodes within the data processing system to execute a first task; the management node further for determining a first duration for transmitting task data to the data processing system based on the amount of task data in the first task; a storage node for receiving task data; and computing nodes for executing a second task, wherein the execution duration of the second task on the computing nodes is less than the first duration, and wherein the execution of the second task by the computing nodes and the reception of task data by the storage nodes are performed in parallel; the computing nodes further for executing the first task upon completion of the second task and completion of the transmission of task data.

[0018] In one possible implementation of the second aspect described above, the storage node is further configured to send the data processing results generated during the execution of the first task to the first storage device in the form of incremental transmission, corresponding to multiple data processing results generated sequentially and the generated data processing results satisfying time conditions or quantity conditions; wherein, the time condition is that the time interval between two adjacent data processing systems sending data processing results to the first storage device is the same, and the quantity condition is that the amount of data in the data processing results sent by the data processing system to the first storage device each time is the same.

[0019] A third aspect of this application provides a data processing system, comprising: a memory for storing instructions to be executed by one or more processing units of the data processing system; and a processing unit, one of the processing units of the data processing system, for executing the instructions stored in the memory to implement any of the methods of the first aspect described above.

[0020] A fourth aspect of this application provides a computer-readable storage medium storing instructions that, when executed on a model deployment device, cause the model deployment device to implement any of the methods described in the first aspect above.

[0021] The fifth aspect of this application provides a computer program product including instructions that, when executed on a model deployment device, cause the model deployment device to implement any of the methods described in the first aspect. Attached Figure Description

[0022] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0023] Figure 1 A schematic diagram of the structure of an HPC system is shown according to an embodiment of this application;

[0024] Figure 2 A schematic diagram of a high-performance computing scenario is shown according to an embodiment of this application;

[0025] Figure 3 A logic block diagram illustrating the process of an HPC system processing a task to be processed is shown according to an embodiment of this application;

[0026] Figure 4A A logical block diagram of a data staging (DS) job execution process is shown according to an embodiment of this application;

[0027] Figure 4B An embodiment of this application illustrates a resource-time relationship diagram during the execution of a DS job;

[0028] Figure 5A An embodiment of this application illustrates another logic block diagram of a DS job execution process;

[0029] Figure 5B An embodiment of this application illustrates another resource-time relationship diagram during the execution of a DS job;

[0030] Figure 6 An embodiment of this application illustrates a resource-time relationship diagram during the execution of a first task;

[0031] Figure 7 A schematic flowchart of a data processing method is shown according to an embodiment of this application;

[0032] Figure 8 A flowchart illustrating another data processing method is shown according to an embodiment of this application;

[0033] Figure 9 A structural block diagram of a data processing system 100 is shown according to an embodiment of this application;

[0034] Figure 10 A logic block diagram of a data processing method is shown according to an embodiment of this application;

[0035] Figure 11 A logical block diagram of the process of performing the data export phase of a DS job in an HPC system 10 is shown according to an embodiment of this application;

[0036] Figure 12 An embodiment of this application illustrates an interactive flow diagram of various modules of a data management system 100 during the data storage stage.

[0037] Figure 13 An embodiment of this application illustrates an interactive flow diagram of various modules of a data management system 100 during task execution;

[0038] Figure 14 A schematic diagram of the hardware structure of a data processing system 100 is shown according to an embodiment of this application. Detailed Implementation

[0039] The illustrative embodiments of this application include, but are not limited to, a data processing method, system, medium, and program product.

[0040] Before introducing the technical solutions involved in the embodiments of this application, some of the terms included in the embodiments of this application will be explained.

[0041] (1) High-performance computing

[0042] High-performance computing (HPC) is a technology that uses multiple processors in a single computer or multiple computers in a cluster to perform complex computational processing. HPC provides extremely high floating-point computing power, which can be used to solve computationally intensive and massive data processing needs. For example, industries such as scientific research, weather forecasting, finance, simulation experiments, biopharmaceuticals, gene sequencing, and image processing utilize HPC to solve large-scale computational problems and meet computational requirements. Using HPC technology to handle large-scale computational problems can effectively shorten the computation time for data processing and improve computational accuracy.

[0043] (2) HPC System

[0044] HPC systems are computer cluster systems. HPC systems are typically deployed as clusters of multiple computers. The computers in an HPC system are usually connected using high-performance network interconnection technologies, such as infinite bandwidth (IB), remote direct memory access over converged Ethernet (RoCE), and packet communication and switching technology (Myrinet).

[0045] These multiple processes can be run in parallel by multiple computers in an HPC system, or by multiple central processing units (CPUs) / multiple processor cores in a single computer.

[0046] For example, Figure 1 A schematic diagram of an HPC system is shown.

[0047] like Figure 1 As shown, the HPC system 10 includes a resource pool 110 and a management node 120, etc.

[0048] Resource pool 110 includes computing resources 111, storage resources 112, and network resources 113.

[0049] Computing resources 111 can be provided by a computing cluster within the system. The computing cluster contains at least two computing nodes (Computing Node 1…Computing Node N, where N is an integer greater than 1). Computing nodes are used to execute massive computing tasks. When executing massive computing tasks, a computing node can coordinate the execution of multiple processors across multiple computing nodes, multiple processors on the same computing node, or multiple processor cores within a single processor on the same computing node. Multiple computing nodes, computing nodes and management nodes, and computing nodes and storage nodes can be connected via high-speed networks (such as Ethernet, IB, etc.) for high-speed communication.

[0050] Storage resource 112 can be provided by a storage cluster in the system. The storage cluster contains at least two storage nodes (storage node 1…storage node M, where M is an integer greater than 1). Storage nodes provide storage services, such as storing code for computing nodes to perform computational tasks, task data required for performing computational tasks, and data processing results obtained by the computing nodes from performing computational tasks. A storage node includes one or more controllers, a network interface card (NIC), and multiple hard drives. Hard drives are used to store data. Hard drives can be disks or other types of storage media, such as solid-state drives (SSDs) or shingled magnetic recording (SMR) hard drives. The NIC is used to communicate with the computing nodes. The controller is used to write data to or read data from the hard drives according to read / write data requests sent by the computing nodes. During the read / write process, the controller needs to convert the address carried in the read / write data request into an address that the hard drive can recognize.

[0051] Network resource 113 can be provided by network devices such as switches and routers (e.g., data transfer tools). Network resource 113 is used to support data transfer between compute nodes and storage nodes.

[0052] The management node 120 is used to manage compute nodes and storage nodes, such as monitoring the working status of multiple compute nodes, remotely starting or stopping compute nodes, etc.

[0053] Understandable. Figure 1 The HPC system shown is merely an example; in practical applications, an HPC system may include fewer or more modules than illustrated. For instance, HPC system 10 may also include a scheduler, which provides services for managing and allocating computing resources to management nodes and for executing jobs for computing nodes. The scheduler can be implemented in hardware or software. When implemented in hardware, the scheduler can be a device with data processing capabilities, such as a server. When implemented in software, the scheduler can be an application running on HPC system 10.

[0054] Alternatively, an HPC system can also be deployed as a single server, in which case the management node, storage node, and compute node are computing units within the server. For example, the management node and compute node can be implemented by the processor in the server, and the storage node can be implemented by the server's disk.

[0055] (3) Resources

[0056] Resources refer to at least one of the computing resources, storage resources, and network resources required to process a job. Examples include a central processing unit (CPU), a graphics processing unit (GPU), a data processing unit (DPU), a neural processing unit (NPU), memory, and input / output (I / O).

[0057] (4) Homework

[0058] A job is a task submitted by a user to a computer system, such as an executable command or script. Jobs typically include computer programs, data, and control information.

[0059] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions in the embodiments of this application will be described clearly and in detail below with reference to the accompanying drawings.

[0060] The data processing method provided in this application can be applied to high-performance computing scenarios. For example, Figure 2 A schematic diagram of a high-performance computing scenario is shown according to an embodiment of this application.

[0061] like Figure 2 As shown, HPC system 10 and client 20 are connected via a network. Users can submit jobs (e.g., computation tasks) to HPC system 10 through client 20, and upload task data corresponding to submitted jobs to HPC system 10, so as to utilize the computing resources of HPC system 10 to complete the corresponding jobs.

[0062] HPC system 10 can be referenced. Figure 1 The relevant description of the HPC system 10 shown.

[0063] Client 20 can be a client of a target application, which can be an application that requires a large amount of data computation, such as the Weather Research and Forecasting Modeling System (WRF) or the Computational Fluid Dynamics Library (OpenFoam).

[0064] Client 20 can also be a mobile phone, tablet computer, laptop computer or other terminal device.

[0065] As mentioned earlier, when performing data processing based on HPC technology, the HPC system typically acquires one or more tasks (e.g., jobs) uploaded by users through clients, along with the corresponding task data, and stores the task data in the storage nodes corresponding to the HPC system. Then, based on the tasks to be processed, the system determines the computing resources required for each task, allocates the necessary computing resources to the task, and processes the data corresponding to the task.

[0066] For example, refer to Figure 3 The diagram shown illustrates the logic flow of an HPC system processing tasks. Figure 3 As shown, the user uploads the first task to be executed (e.g., data staging (ds) job) to the management node of HPC system 10 through the submission machine (e.g., client 20). HPC system 10 performs data storage (stageIn) based on the computing node, executes the command line (cmd) corresponding to the job, that is, executes the first task, and performs data export (stageOut).

[0067] In this process, the computing node notifies the cluster shared storage (e.g., the storage node of HPC system 10) to retrieve the task data corresponding to the first task from the external storage (e.g., the storage device corresponding to client 20), that is, to realize the data transfer of the task data of the first task from the external storage to the cluster shared storage.

[0068] The compute node executes the command line corresponding to the job, that is, executes the first task, obtains the data processing result of the first task, and sends the data processing result of the first task to the cluster shared storage for storage.

[0069] Data export involves the compute node notifying the cluster shared storage to send the data processing results of the first task to external storage; that is, it realizes the data transfer of the data processing results of the first task from the cluster shared storage to external storage.

[0070] On the one hand, it is assumed that the required computing resources are allocated to the first task before the task data of the first task is transmitted to the cluster shared storage (storage node) of HPC system 10.

[0071] For example, taking the first task as a DS job as an example, refer to Figure 4A The diagram shown is a logical block diagram of the DS job execution process.

[0072] Figure 4A The execution details of the management node, compute nodes, and cluster shared storage (storage node) of HPC system 10 during the DS job execution process shown can be referenced. Figure 3 The relevant description in the document.

[0073] It is understandable that the data storage phase already consumes the necessary computing resources. If the amount of task data is large, the data transfer may take a long time, potentially causing the computing resources allocated to the first task to be occupied for an extended period without the task being executed, thus resulting in resource waste.

[0074] For example, refer to Figure 4B The diagram shows the resource-time relationship during the execution of a DS job.

[0075] like Figure 4B As shown, during the execution of the DS job: In the time interval 0-t1, some resources (e.g., network resources, storage resources) are used for data storage, and some resources are reserved for computational jobs. At time t1, data storage is complete. In the time interval t1-t2, some resources reserved for computational jobs are used to compute the DS job. At time t2, the computation of the DS job is complete. In the time interval t2-t3, some resources (e.g., network resources, storage resources) are used for data export.

[0076] It is understandable that during the time period 0-t1, the reserved resources for the computation job are occupied but not used, resulting in a waste of these resources.

[0077] On the other hand, it is assumed that the required computing resources are allocated to the first task only after the task data of the first task is uploaded to the cluster shared storage (storage node) of HPC system 10.

[0078] For example, still taking the first task as the DS job as an example, refer to Figure 5A The diagram shows another logical block diagram of the DS job execution process.

[0079] HPC System 10 performs data storage and executes the corresponding command lines for the jobs based on compute nodes; that is, it executes the first task and performs data export. The execution of data storage and the corresponding command lines for the jobs is scheduled by the management node. For details on data storage and data export, please refer to... Figure 3 The relevant description in the document.

[0080] Understandable, compared to Figure 4A The execution process shown involves data storage and the execution of the command lines corresponding to the job, which are scheduled by the management node. The compute node notifies the management node after completing the data storage. Furthermore, the compute node only begins executing the command lines corresponding to the job after receiving the scheduling from the management node.

[0081] It is understandable that during the data storage phase, the required computing resources for the DS job are not allocated. If, after data storage is completed, the idle computing resources in HPC system 10 are insufficient to execute the command line corresponding to the DS job, it is necessary to wait until sufficient computing resources become available in HPC system 10 and are allocated to the DS job before execution of the command line corresponding to the DS job can begin. This may result in the DS job waiting for a relatively long time, thereby affecting its execution efficiency.

[0082] For example, refer to Figure 5B This is another resource-time relationship diagram during the execution of a DS job.

[0083] like Figure 5B As shown, during the execution of job 2 (DS job): From time 0 to t11, some resources (e.g., network resources, storage resources) are used to perform data storage for job 2. From time 0 to t12, some resources (computing resources) are used to compute jobs 1, 3, and 4. From time t12 to t13, some resources (e.g., computing resources) are used to compute job 2. At time t13, the computation of job 2 is completed. From time t13 to t14, some resources (e.g., network resources, storage resources) are used to perform data export for job 2.

[0084] It's understandable that although the data for Job 2 has been stored by time t11, some computing resources in HPC system 10 are being used to execute Jobs 3 and 4, leaving insufficient resources for Job 2. Therefore, at time t11, Job 2 has not yet started execution but is waiting to be allocated computing resources. If the t11-t12 time period (i.e., the waiting time) is long, it will affect the execution efficiency of Job 2.

[0085] In view of this, embodiments of this application provide a data processing method applied to high-performance computing scenarios. In this method, the data processing system (e.g., HPC system 10) first determines a first task (e.g., ... Figure 5B The computational resources required for the first task (as shown in the figure) are determined, and corresponding computational resources (e.g., at least one computing node) are allocated to the first task.

[0086] Then, based on the amount of task data (such as program instructions, data to be processed, etc.) of the first task, the transmission time required for the task data of the first task to be transmitted to the data processing system is determined (i.e., the resource availability time of the computing resources corresponding to the first task).

[0087] The system receives task data for a first task in parallel and executes other tasks (e.g., a second task) on computing nodes allocated to execute the first task. The execution time of these other tasks on the computing nodes allocated to execute the first task is less than the transmission time. In other words, during the transmission of task data for the first task, at least one other task (e.g., a second task) is executed on the computing nodes allocated to execute the first task, and the execution time of this at least one other task is less than the transmission time.

[0088] Furthermore, the task data corresponding to the first task is transmitted to the data processing system, and the computing nodes allocated to execute the first task complete the execution of other tasks. Based on the execution of the first task by the computing nodes allocated to execute the first task, the data processing result corresponding to the first task is obtained.

[0089] In this way, even if the data transmission of the first task to the data processing system takes a significant amount of time, the computing nodes allocated to execute the first task can still perform other tasks with shorter execution times during the data transmission process. This reduces the waste of computing resources and improves resource utilization. Furthermore, since the execution time of other tasks is shorter than the transmission time, the first task does not need to wait for computing resources after the data transmission is complete.

[0090] For example, refer to Figure 6 The diagram shows the resource-time relationship during the execution of the first task.

[0091] like Figure 6 As shown, during the execution of the first task: From time 0 to t10, some resources (e.g., network resources, storage resources, etc.) are used to perform data storage corresponding to the first task. Also, from time 0 to t10, some resources (computing resources) are used to perform backfill job 1 and backfill job 2. At time t10, the data storage corresponding to the first task is completed. From time t10 to t20, some resources (computing resources) are used to perform the first task. The computing resources used to perform the first task are at least partially the same as those used to perform backfill job 1 and backfill job 2. At time t20, the execution of the first task is completed. From time t20 to t30, some resources (e.g., network resources, storage resources, etc.) are used to perform data export.

[0092] It is understandable that during the data storage phase, the computing resources required for the first task can be used to execute backfill job 1 and backfill job 2. Backfill job 1 and backfill job 2 are completed before the end time of the data storage phase (time t10). After the data storage phase ends, the first task can begin to be executed.

[0093] Compared to Figure 4B The execution process of the DS job shown is as follows: Figure 6 As shown in the execution process, during the data storage phase, the computing resources required for the first task can be used to execute backfill job 1 and backfill job 2, improving resource utilization. Furthermore, compared to... Figure 5B The execution process of the DS job shown is as follows: Figure 6 In the execution process shown, the computing resources required for the first task are determined in advance. Once the data storage phase is completed, the first task can be executed based on the corresponding computing resources, without the first task needing to spend a lot of time waiting for the allocation of computing resources.

[0094] In some embodiments, a computing node that has been allocated to perform the first task can be used to perform at least one other task besides the first task by backfilling.

[0095] In some embodiments, if the multiple data processing results corresponding to the first task are generated sequentially, the data processing system can send the multiple data processing results corresponding to the first task to the external storage (as an instance of the first storage device) in multiple incremental transmissions.

[0096] In this way, if the data processing result is large, the data processing result can be transmitted to the external storage device at the same time during the execution of the first task, so as to achieve the effect of "compute and transmit at the same time" without having to wait for the first task to be completed and then spend a lot of time transmitting the data processing result.

[0097] To better understand the technical solutions of the embodiments of this application, the following will be used as examples. Figure 2 Taking the scenario shown as an example, we will introduce some of the technical solutions of this application in detail.

[0098] Figure 7 A flowchart illustrating a data processing method is shown according to an embodiment of this application. It can be understood that... Figure 7 The execution entity of the process shown is the data processing system 100, which can be the aforementioned HPC system 10. For simplicity, the following description... Figure 7 The execution entity will not be described again when the process is shown.

[0099] like Figure 7 As shown, this process includes, but is not limited to:

[0100] S701: Allocate computing nodes in the data processing system 100 for performing the first task.

[0101] In some embodiments, the data processing system 100 obtains a first task submitted by a user (e.g., receiving a DS job sent by the client 20), determines the computing resources (e.g., computing nodes) required for the first task, and allocates computing nodes in the data processing system 100 to execute the first task.

[0102] It is understandable that before the first task is completed, the data processing system 100 cannot allocate the computing nodes that have been allocated to perform the first task to perform other tasks without receiving any resource allocation instructions.

[0103] In some embodiments, the first task may be the highest priority task in the task queue (multiple tasks to be executed) corresponding to the data processing system 100.

[0104] In some embodiments, when the data processing system 100 acquires the first task, it also acquires the address information of the storage space in the external storage corresponding to the task data of the first task, and the address information of the storage space in the external storage corresponding to the data processing result of the first task.

[0105] It is understood that the external storage corresponding to the data processing result of the first task and the external storage corresponding to the task data of the first task can be the same storage device or different storage devices, and this application does not limit this. The external storage can be the storage device corresponding to the client 20 or other storage devices specified by the user, and this application does not limit this.

[0106] S702: Determine the first duration for transmitting the task data to the data processing system 100 based on the amount of task data for the first task.

[0107] In some embodiments, the data processing system 100 acquires the amount of task data for the first task and the data transfer rate between the data processing system 100 and external storage (e.g., the storage device corresponding to the client 20). Based on the amount of task data for the first task and the data transfer rate, the time required for the task data of the first task to be transferred from external storage to the data processing system 100 (i.e., a first duration) is determined, which serves as the resource lending time for the computing resources (e.g., computing nodes) corresponding to the first task.

[0108] For example, the product of the task data volume and the data transmission rate can be used as the first duration. Another example is to set a preset scaling factor (e.g., 1.1, 1.2, etc.), multiply the product of the task data volume and the data transmission rate by the preset scaling factor, and use the result as the first duration. This application does not limit the specific calculation method for the first duration.

[0109] S703: Receives task data in parallel and executes a second task on the compute node, wherein the execution time of the second task on the compute node is less than the execution time of the first task.

[0110] In some embodiments, the data processing system 100 receives task data for a first task transmitted from external storage. Furthermore, the data processing system 100 executes a second task on a computing node allocated for executing the first task. The transmission of the task data for the first task and the execution of the second task are performed in parallel within the data processing system 100.

[0111] In other words, the data processing system 100 stores the task data for the first task. Then, it retrieves the second task and executes it on the computing node that was allocated to execute the first task; that is, it executes the second task on the computing node that was allocated to execute the first task.

[0112] It is understood that the execution time of the second task on the computing node allocated to execute the first task is less than the first time. Thus, after the data transmission of the task data of the first task is completed and the second task is completed, the data processing system 100 can execute the first task on the computing node allocated to execute the first task without waiting for the data processing system 100 to allocate resources for the first task.

[0113] In some embodiments, the second task is performed in the form of backfilling on the computing node assigned to perform the first task.

[0114] It is understood that the data processing system 100 executing the second task based on the computing nodes allocated for executing the first task can be done in several ways. Specifically, it can execute only one second task on the same computing nodes, with the execution time of this single second task being less than the first duration. Alternatively, the data processing system 100 can execute multiple second tasks serially on the same computing nodes, with the execution time of these multiple second tasks being less than the first duration. Furthermore, the data processing system 100 can execute multiple second tasks in parallel on the same computing nodes, with the execution time of these multiple second tasks being less than the first duration. This application does not limit the scope of these methods.

[0115] S704: Based on the completion of the second task and the completion of the data transmission of the task, execute the first task on the computing node.

[0116] In some embodiments, the data processing system 100 executes the first task on a computing node that was previously allocated to execute the first task, based on the completion of the second task and the completion of the data transmission of the task data for the first task. That is, the computing node that was previously allocated to execute the first task is reassigned to execute the first task.

[0117] Understandable, through Figure 7 The data processing method shown involves the following steps: Before the data transmission of the first task, the data transmission system 100 allocates the computing resources required for the first task and determines the available resource lending time for the computing nodes allocated to execute the first task based on the amount of data in the first task's task data. During the available resource lending time, the second task is executed based on the computing nodes allocated to execute the first task. After the transmission of the task data of the first task and the execution of the second task are completed, the first task is executed based on the computing nodes allocated to execute the first task.

[0118] In this way, even if the data transmission of the task corresponding to the first task to the data processing system takes a significant amount of time, the computing nodes allocated to execute the first task can still perform other tasks during the data transmission process, reducing resource waste and improving resource utilization. Furthermore, since the execution time of other tasks is less than the transmission time, the first task does not need to wait for computing resources after the data transmission is complete.

[0119] In some embodiments, after the data processing system 100 completes the first task, it can transmit the data processing result corresponding to the first task to external storage (such as the storage device corresponding to the client or the storage device specified by the user when submitting the first task).

[0120] In other embodiments, if multiple data processing results corresponding to the data processing result are generated sequentially, the data processing system 100 can send the multiple data processing results of the first task to the external storage in multiple increments during the execution of the first task.

[0121] For example, Figure 8 A flowchart illustrating another data processing method is shown according to an embodiment of this application. It can be understood that... Figure 8 The execution entity of the process shown is the data processing system 100, which can be the aforementioned HPC system 10. For simplicity, the following description... Figure 8 The execution entity will not be described again when the process is shown.

[0122] like Figure 8 As shown, this process includes, but is not limited to:

[0123] S801: Allocate a computing node in the data processing system 100 to perform the first task.

[0124] S802: Determine the first duration for transmitting the task data to the data processing system 100 based on the amount of task data for the first task.

[0125] S803: Receives task data in parallel and executes a second task on the compute node, wherein the execution time of the second task on the compute node is less than the execution time of the first task.

[0126] S804: Based on the completion of the second task and the completion of the data transmission of the task, execute the first task on the computing node.

[0127] Specifically, the above S801 to S804 can be referred to Figure 7 The relevant descriptions in S701 to S704 are not repeated here.

[0128] S805: Detects the generation of the data processing result corresponding to the first task, and stores the generated data processing result to the storage node.

[0129] In some embodiments, during the execution of the first task, the data processing system 100 detects the generation of data processing results and stores the data processing system 100 to a storage node.

[0130] S806: When the data processing results stored in the storage node meet the transmission conditions, the data processing results are transmitted to external storage.

[0131] In some embodiments, the data processing system 100 detects the completion of the first task, that is, the data processing result stored in the storage node is the total data storage result corresponding to the first task, and transmits the data processing result to external storage.

[0132] In other embodiments, the first task includes multiple data processing results, and the multiple data processing results are not generated sequentially (e.g., generated out of order). Then, the data processing system 100 detects the completion of the first task, that is, the data processing results stored in the storage node are all data storage results corresponding to the first task, and the data processing results are transmitted to external storage.

[0133] In other embodiments, if the first task includes multiple data processing results, and the multiple data processing results are generated sequentially, then the data processing system 100 can transmit the multiple data processing results corresponding to the first task to external storage through incremental transmission.

[0134] For example, if the data processing system 100 detects that the data processing results stored in the storage node meet the time or quantity conditions, it will transfer the data processing results stored in the storage node to the external storage in the form of incremental transfer.

[0135] For example, the data transmission system 100 sets a transmission time interval and, during the execution of the first task, transmits the data processing results to external storage using incremental transmission. For instance, when sending the data processing results to external storage for the first time, all generated data processing results of the first task are sent to external storage. When sending data processing results to external storage for subsequent times, snapshot information of the data processing results generated by the data processing system 100 and snapshot information of the data processing results stored in external storage are obtained. Then, based on the snapshot information of the data processing results generated by the data processing system 100 and the snapshot information of the data processing results stored in external storage, the data processing results that have not been sent to external storage (or incremental data) generated by the data processing system 100 are determined, and the incremental data is sent to external storage. The time interval between two adjacent data transmissions is a preset transmission time interval.

[0136] It is understandable that snapshot information may include information such as the size of the data processing result, the generation time, and the hash value.

[0137] For example, the data transmission system 100 sets a preset data volume (e.g., 64 kilobytes, 1 megabyte, etc.) and transmits the data processing results to external storage using incremental transmission. For instance, when sending the data processing results to external storage for the first time, all the data processing results of the first task are sent to external storage. When sending data processing results to external storage for subsequent times, snapshot information of the data processing results generated by the data processing system 100 and snapshot information of the data processing results stored in external storage are obtained. Then, based on the snapshot information of the data processing results generated by the data processing system 100 and the snapshot information of the data processing results stored in external storage, the data processing results that have not been sent to external storage (or incremental data) are determined, and the incremental data is sent to external storage. The amount of data transmitted each time is the same, always the preset data volume.

[0138] In other words, the transmission conditions include, but are not limited to: the first task is completed, the data processing results of the first task are not generated sequentially and the first task is completed, the data processing results of the first task are generated sequentially and the generated data processing results meet the time condition, and the data processing results of the first task are generated sequentially and the generated data processing results meet the quantity condition.

[0139] The time condition can be a preset transmission time interval between two adjacent data transmissions. The quantity condition can be a preset data volume for each data transmission. This application does not impose any limitations on this.

[0140] Understandable, through Figure 8 The data processing method shown includes the following steps: Before the data transmission of the first task, the data transmission system 100 allocates the computing resources required for the first task and determines the available resource lending time for the computing nodes allocated to execute the first task based on the amount of data in the first task's task data. During the available resource lending time, the second task is executed based on the computing nodes allocated to execute the first task. After the transmission of the task data of the first task and the execution of the second task are completed, the first task is executed based on the computing nodes allocated to execute the first task. Furthermore, if the first task includes multiple data processing results that are generated sequentially, the data processing system 100 transmits the generated data processing results to external storage via incremental transmission during the execution of the first task.

[0141] Thus, even if transmitting the task data corresponding to the first task to the data processing system takes a significant amount of time, the computing nodes allocated to execute the first task can still perform other tasks during the data transmission process, reducing resource waste and improving resource utilization. Furthermore, since the execution time of other tasks is less than the transmission time, the first task does not need to wait for computing resources after the task data transmission is complete. In addition, if the data processing result is large, the result can be transmitted to external storage devices simultaneously during the execution of the first task, achieving a "compute-as-you-go" effect without waiting for the first task to complete before consuming significant time to transmit the result.

[0142] The following is an introduction to the software structure of the data management system 100.

[0143] For example, Figure 9 A structural block diagram of a data processing system 100 is shown according to an embodiment of this application.

[0144] like Figure 9 As shown, the data processing system 100 includes a management node 1110, a computing node 1120, and a storage node 1130.

[0145] The management node 1110 is used to manage compute nodes and storage nodes, such as monitoring the operation of multiple compute nodes, remotely starting or stopping compute nodes, etc. The management node 1110 includes a master service 1111, which is used to monitor and manage the compute nodes. This master service can be a service provided to the management node by a scheduler (or scheduler program). For example, in some embodiments, the master service 1111 is used to allocate compute nodes for a first task, determine the available resource time of the compute nodes corresponding to the first task, and allocate the compute nodes corresponding to the first task to execute a second task, etc.

[0146] Computing node 1120 is used to execute tasks to be executed (e.g., a first task and a second task) and manage the transmission of task data and data processing results corresponding to the tasks to be executed (e.g., the first task). Computing node 1120 includes an agent service 1121 and a transmission module 1122. The agent service 1121 may be a service provided to the computing node by a scheduler (or scheduler program). The transmission module 1122 may be, for example, a data transfer tool. The agent service is used to control the execution of the tasks to be executed, control the transmission module 1122 to complete the transmission of task data and data processing results, and control the transmission module 1122 to obtain information such as the task data to be transmitted and the data transmission rate corresponding to the tasks to be executed. The transmission module 1122 is used to complete the transmission of task data and data processing results, and to obtain information such as the task data to be transmitted and the data transmission rate corresponding to the tasks to be executed.

[0147] Storage node 1130, also known as cluster shared storage, is used to provide storage services. For example, it stores task data (task program instructions, data required for task execution, etc.) corresponding to tasks executed by compute nodes, as well as data processing results obtained by compute nodes from executing tasks.

[0148] The following example uses HPC system 10 as the data processing system, with the first task being the DS job and the second task being the backfill job. Figure 10 The structure of the data management system 100 shown is illustrated, and the data processing method provided in the embodiments of this application is described in detail.

[0149] For example, Figure 10 A logic block diagram of a data processing method is shown according to an embodiment of this application.

[0150] like Figure 10 As shown, the data processing method includes the following steps:

[0151] ① The user submits the ds job to the management node 1110 of HPC system 10. The management service 1111 of the management node 1110 of HPC system 10 allocates resources (e.g., compute node 1120) to the ds job based on its resource requirements.

[0152] ②The management node 1110 of HPC system 10 will distribute the ds job to the corresponding compute node 1120 of HPC system 10.

[0153] ③The proxy service 1121 of the computing node 1120 corresponding to the HPC system 10 performs data storage of the ds job based on the ds job, and transmits the task data in the external storage 220 to the storage node 1130 of the HPC system 10 through the transmission module 1122.

[0154] ④ The proxy service 1121 of the compute node 1120 corresponding to HPC system 10 uses the transmission module 1122 to count the data volume and transmission rate of the task data to be transmitted, and reports the obtained data volume and transmission rate of the task data to be transmitted to the management service 1111 of the management node 1110 of HPC system 10. Based on the data volume and transmission rate of the task data to be transmitted, the management service 1111 of the management node 1110 of HPC system 10 calculates the resource availability time of the compute node 1120 corresponding to the ds job, and reallocates the compute node 1120 corresponding to the ds job.

[0155] ⑤ The user submits the backfill job to the management node 1110 of HPC system 10. The management service 1111 of the management node 1110 of HPC system 10 determines the resource allocation backfill strategy. For example, it determines whether the backfill job can be executed on the compute node 1120 corresponding to the ds job. If the backfill job can be executed on the compute node 1120 corresponding to the ds job, then the compute node 1120 corresponding to the ds job is allocated to the backfill job. The management service 1111 of the management node 1110 of HPC system 10 distributes the backfill job to the agent service 1121 of the compute node 1120 corresponding to the ds job. The agent service 1121 of the compute node 1120 corresponding to the ds job executes the backfill job.

[0156] It is understood that the management service 1111 of management node 1110 and the proxy service 1121 of computing node 1120 are both services provided by the scheduler (or scheduler program). The management service 1111 is used to monitor and manage computing resources (e.g., allocate computing nodes to DS jobs, determine the available time of computing nodes corresponding to DS jobs, and allocate computing nodes corresponding to DS jobs for executing backfilling jobs, etc.). The proxy service 1121 is used to control the execution of tasks to be executed (e.g., DS jobs, backfilling jobs, etc.), the transmission of task data and data processing results corresponding to DS jobs, and to obtain information such as the task data to be transmitted and the data transmission rate corresponding to DS jobs.

[0157] Understandable. Figure 10 This illustrates the data storage process of HPC system 10 performing a DS job. The following section combines... Figure 11 This section continues to describe the process of HPC System 10 executing DS jobs and exporting data from DS jobs.

[0158] For example, Figure 11 According to an embodiment of this application, a logic block diagram is shown of the process of performing the data export phase of a ds job in an HPC system 10.

[0159] like Figure 11 As shown, HPC system 10 performs data storage for the DS job. After the data storage is complete, it executes the DS job (the command line for executing the DS job) and then performs data export for the DS job. The data storage process of HPC system 10 performing the DS job can be referred to the above. Figure 10 Related descriptions.

[0160] Continue to refer to Figure 11 During the execution of DS jobs, the compute node 1120 of HPC system 10 generates data (data processing results) and stores the data processing results in the storage node 1130 of HPC system 10. The data processing results in storage node 1130 are transferred to external storage 220 multiple times.

[0161] In other words, the data processing system 100 can transmit the data processing results of the first task to the external storage 220 through incremental transmission during the execution of the first task.

[0162] The following is combined Figure 9 The structure of the data management system 100 shown is illustrated, and the interaction process of each module of the data management system 100 in the data processing method provided in this application is introduced.

[0163] For example, Figure 12An embodiment of this application illustrates an interactive flow diagram of various modules of a data management system 100 for the data storage stage.

[0164] like Figure 12 As shown, this process includes, but is not limited to:

[0165] S1201: Client 10 sends the first task to the management service 1111 of the management node 1110 of the data management system 100.

[0166] In some embodiments, when client 10 sends a first task to management service 1111, it also sends data information of the first task. This data information includes, but is not limited to: address information of the storage space in external storage 220 corresponding to the task data, and address information of the storage space in external storage 220 corresponding to the data processing result of the first task.

[0167] S1202: The management service 1111 of the management node 1110 of the data management system 100 allocates a computing node for the first task.

[0168] S1203: The management service 1111 of the management node 1110 of the data management system 100 sends the first task to the agent service 1121 of the computing node 1120 of the data management system 100.

[0169] S1204: The proxy service 1121 of the computing node 1120 of the data management system 100 performs the data storage stage.

[0170] S1205: The proxy service 1121 of the computing node 1120 of the data management system 100 sends a message to the transmission module 1122 of the computing node 1120 of the data management system 100 to perform the data storage phase.

[0171] S1206: The transmission module 1122 of the computing node 1120 of the data management system 100 controls the storage node 1130 of the data management system 100 to perform data transfer according to the data information corresponding to the first task.

[0172] In some embodiments, the transmission module 1122 retrieves the task data of the first task from the corresponding storage space based on the address information of the storage space in the external storage 220 corresponding to the task data. For example, it sends a data retrieval request to the external storage 220 and sends the address information to the external storage 220.

[0173] S1207: External storage 220 sends the task data of the first task to storage node 1130 of data management system 100.

[0174] In some embodiments, the external storage 220 retrieves the task data of the first task from the corresponding storage space based on the address information corresponding to the task data, and sends the retrieved task data to the storage node 1130 of the data management system 100.

[0175] S1208: The transmission module 1122 of the computing node 1120 of the data management system 100 obtains the data volume and transmission rate of the task data to be transmitted.

[0176] S1209: The transmission module 1122 of the computing node 1120 of the data management system 100 sends the data volume and transmission rate of the task data to be transmitted to the proxy service 1121 of the computing node 1120 of the data management system 100.

[0177] S1210: The proxy service 1121 of the computing node 1120 of the data management system 100 sends the data volume and transmission rate of the task data to be transmitted to the management service 1111 of the management node 1110 of the data management system 100.

[0178] S1211: The management service 1111 of the management node 1110 of the data management system 100 calculates the available time of the computing node corresponding to the first task.

[0179] In some embodiments, the management service 1111 uses the product of the data volume of the task data and the data transmission rate as the available time (i.e., the first duration) of the computing node corresponding to the first task.

[0180] In other embodiments, the management service 1111 multiplies the product of the task data volume and the data transmission rate by a preset proportional coefficient (e.g., 1.2), and uses the result as the available lending time of the computing node corresponding to the first task. This application does not limit the specific calculation method for the available lending time of the computing node corresponding to the first task.

[0181] S1212: Client 10 sends a second task to the management service 1111 of the management node 1110 of the data management system 100.

[0182] S1213: The management service 1111 of the management node 1110 of the data management system 100 will assign the computing node assigned to the first task to the second task.

[0183] S1214: The management service 1111 of the management node 1110 of the data management system 100 sends a second task to the agent service 1121 of the computing node 1120 of the data management system 100.

[0184] S1215: The proxy service 1121 of the computing node 1120 of the data management system 100 executes the second task until the execution of the second task is completed.

[0185] In some embodiments, the proxy service 1121 executes the second task based on backfilling technology. That is, the execution of the second task does not affect the delayed execution of the first task. In other words, the execution time of the second task is less than the available lending time of the computing node corresponding to the first task.

[0186] S1216: The agent service 1121 of the computing node 1120 of the data management system 100 sends a message to the management service 1111 of the management node 1110 of the data management system 100 that the second task has been completed.

[0187] S1217: The management service 1111 of the management node 1110 of the data management system 100 reassigns the computing node to the first task.

[0188] It is understood that before the data transmission of the first task, the data transmission system 100 allocates the computing resources required for the first task and determines the available time for the resources of the computing nodes allocated to execute the first task based on the amount of data in the first task. During the available time, the second task is executed based on the computing nodes allocated to execute the first task. After the transmission of the task data of the first task and the execution of the second task are completed, the first task is executed based on the computing nodes allocated to execute the first task.

[0189] In this way, even if the data transmission of the task corresponding to the first task to the data processing system takes a significant amount of time, the computing nodes allocated to execute the first task can still perform other tasks during the data transmission process, reducing resource waste and improving resource utilization. Furthermore, since the execution time of other tasks is less than the transmission time, the first task does not need to wait for computing resources after the data transmission is complete.

[0190] For example, Figure 13 An embodiment of this application illustrates an interactive flow diagram of various modules of a data management system 100 during task execution.

[0191] like Figure 13 As shown, this process includes, but is not limited to:

[0192] S1301: The proxy service 1121 of the computing node 1120 of the data management system 100 executes the first task and starts the data export corresponding to the first task.

[0193] In some embodiments, the data management system 100 performs a first task and transmits the data processing results of the first task in parallel.

[0194] S1302: The proxy service 1121 of the computing node 1120 of the data management system 100 generates data processing results and sends the data processing results to the storage node 1130 of the data processing system 100.

[0195] S1303: The storage node 1130 of the data processing system 100 writes the data processing results to the result file.

[0196] In some embodiments, after receiving the data processing result, the storage node 1130 writes the received data processing result into the result file of the data processing result of the corresponding storage first task.

[0197] S1304: The proxy service 1121 of the computing node 1120 of the data management system 100 controls the transmission module 1122 of the computing node 1120 of the data management system 100 to start.

[0198] In some embodiments, upon detecting that the data processing result of the first task has been generated, the agent service 1121 controls the transmission module 1122 to start.

[0199] S1305: The transmission module 1122 of the computing node 1120 of the data management system 100 obtains the contents of the result file from the storage node 1130 of the data management system 100 and detects the generation status of the data processing results.

[0200] In some embodiments, the transmission module 1122 obtains the data processing results stored in the result file corresponding to the data processing results of the first task from the storage node 1130, so as to detect the generation status of the data processing results of the first task.

[0201] S1306: The transmission module 1122 of the computing node 1120 of the data management system 100 detects that the transmission conditions are met and starts the transmission task.

[0202] In some embodiments, the transmission module 1122 detects that the generated data processing result meets the transmission conditions and starts the transmission task.

[0203] In some embodiments, the transmission conditions include, but are not limited to: the first task is completed, the data processing results of the first task are not generated sequentially and the first task is completed, the data processing results of the first task are generated sequentially and the generated data processing results meet the time condition, and the data processing results of the first task are generated sequentially and the generated data processing results meet the quantity condition.

[0204] The time condition can be a preset transmission time interval between two adjacent data transmissions. The quantity condition can be a preset data volume for each data transmission. This application does not impose any limitations on this.

[0205] S1307: The transmission module 1122 of the computing node 1120 of the data management system 100 controls the storage node 1130 of the data management system 100 to perform data transfer.

[0206] In some embodiments, the transmission module 1122 sends a message to the management node 1130 regarding the transmission data processing result and the address information of the storage space in the external storage 220 that stores the corresponding data processing result.

[0207] S1308: The storage node 1130 of the data management system 100 sends the data processing result to the external storage 220.

[0208] In some embodiments, the storage node transmits the generated data processing results to the external storage 2220 in an incremental manner based on the address information of the storage space in the corresponding external storage 220 that stores the data processing results.

[0209] It is understandable that if the data processing result is large, the data processing result can be transmitted to the external storage device simultaneously during the execution of the first task, achieving the effect of "compute and transmit at the same time", without having to wait for the first task to be completed before spending a lot of time transmitting the data processing result.

[0210] In summary, the data processing method provided in this application embodiment allocates computing resources required for the first task before the task data of the first task is transmitted, and determines the available resource lending time of the computing nodes allocated for executing the first task based on the data volume of the task data of the first task. During the available resource lending time, the second task is executed based on the computing nodes allocated for executing the first task. After the transmission of the task data of the first task and the execution of the second task are completed, the first task is executed based on the computing nodes allocated for executing the first task. Furthermore, if the first task includes multiple data processing results and the multiple data processing results are generated sequentially, the data processing system 100 transmits the generated data processing results to external storage in an incremental manner during the execution of the first task.

[0211] Thus, even if transmitting the task data corresponding to the first task to the data processing system takes a significant amount of time, the computing nodes allocated to execute the first task can still perform other tasks during the data transmission process, reducing resource waste and improving resource utilization. Furthermore, since the execution time of other tasks is less than the transmission time, the first task does not need to wait for computing resources after the task data transmission is complete. In addition, if the data processing result is large, the result can be transmitted to external storage devices simultaneously during the execution of the first task, achieving a "compute-as-you-go" effect without waiting for the first task to complete before consuming significant time to transmit the result.

[0212] The hardware structure of the data processing system 100 is described below with reference to the accompanying drawings.

[0213] For example, Figure 14 A schematic diagram of the hardware structure of a data processing system 100 is shown according to an embodiment of this application.

[0214] like Figure 14 As shown, the data processing system 100 includes one or more (only one is shown in the figure) processing units 1410, a memory 1420, a communication interface 1430, and a bus 1440. The processing units 1410, the memory 1420, and the communication interface 1430 are interconnected via the bus 1440.

[0215] The processing unit 1410 includes, but is not limited to, a management node and a service node. It is understood that if the data processing system 100 is a cluster system, the processing unit 1410 can be a server, such as a GPU server, an NPU server, or a similar type of server. If the data processing system 100 is a standalone server, the processing unit 1410 can be a CPU, GPU, NPU, TPU, microprocessor, application-specific integrated circuit, etc. The processing unit 1410 is used to execute relevant programs (e.g., a scheduler) to implement the functions required by the management node and computing node in the data processing system 100 of this application embodiment.

[0216] Memory 1420 may include one or more memories for storing data (program instructions corresponding to the task and data to be processed required to execute the task). The memory may be read-only memory (ROM), static storage device, dynamic storage device, or random access memory (RAM). Memory 1420 may be a storage system with a distributed storage architecture or a centralized storage architecture.

[0217] The steps of the method disclosed in the embodiments of this application can be directly manifested as being executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. This storage medium is located in memory 1420. The processing unit 1410 reads information from memory 1420 and, in conjunction with its hardware, completes the functions required by the management nodes, computing nodes, and other nodes of the data processing system of this application embodiment.

[0218] Communication interface 1430 is used to enable communication between the data processing system 100 and other devices or communication networks. In some embodiments, the data processing system 100 establishes a communication connection with the client 20 through communication interface 230. In other embodiments, the data processing system 100 establishes a communication connection with external storage 220 through communication interface 230.

[0219] Bus 1440 is used to connect processing unit 1410, memory 1420, communication interface 1430 and other possible modules or circuits.

[0220] It should be understood that Figure 14 The structure of the data processing system 100 shown is only an example. In other embodiments, the data processing system 100 may include more or fewer modules, which is not limited here.

[0221] In some embodiments, this application also provides a computer-readable storage medium storing at least one computer program instruction, at least one program segment, code set, or instruction set, which is loaded and executed by a model training system to implement the data processing methods provided in the above-described method embodiments.

[0222] In some embodiments, this application also provides a computer program product, which includes computer program instructions that, when executed by a model training system, enable the device to implement the data processing methods provided in the above-described method embodiments.

[0223] The various embodiments of the mechanisms disclosed in this application can be implemented in hardware, software, firmware, or a combination of these implementation methods. Embodiments of this application can be implemented as computer programs or program code executable on a programmable system, the programmable system including at least one processor, a storage system (including volatile and non-volatile memory and / or storage elements), at least one input device, and at least one output device.

[0224] Program code can be applied to input instructions to execute the functions described in this application and generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, the processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application-specific integrated circuit (ASIC), or a microprocessor.

[0225] The program code can be implemented using a high-level procedural language or an object-oriented programming language to communicate with the processing system. Assembly language or machine language can also be used when needed. In fact, the mechanisms described in this application are not limited to any particular programming language. In either case, the language can be a compiled language or an interpreted language.

[0226] In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried or stored thereon on one or more temporary or non-temporary machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or through other computer-readable media. Therefore, machine-readable media may include any mechanism for storing or transmitting information in a machine-readable (e.g., computer-readable) form, including but not limited to floppy disks, optical disks, optical discs, magneto-optical disks, read-only memory (ROM), random access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic cards or optical cards, flash memory, or tangible machine-readable storage for transmitting information (e.g., carrier waves, infrared signals, digital signals, etc.) using the Internet in the form of electrical, optical, acoustic, or other forms of propagated signals. Therefore, machine-readable media include any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a machine-readable (e.g., computer-readable) form.

[0227] In the accompanying drawings, some structural or methodological features may be shown in a specific arrangement and / or order. However, it should be understood that such a specific arrangement and / or order may not be necessary. Rather, in some embodiments, these features may be arranged in a manner and / or order different from that shown in the illustrative drawings. Furthermore, the inclusion of structural or methodological features in a particular figure does not imply that such features are required in all embodiments, and in some embodiments, these features may be omitted or may be combined with other features.

[0228] It should be noted that all units / modules mentioned in the device embodiments of this application are logical units / modules. Physically, a logical unit / module can be a physical unit / module, a part of a physical unit / module, or a combination of multiple physical units / modules. The physical implementation of these logical units / modules themselves is not the most important factor; the combination of functions implemented by these logical units / modules is the key to solving the technical problems proposed in this application. Furthermore, to highlight the innovative aspects of this application, the above-described device embodiments of this application have not introduced units / modules that are not closely related to solving the technical problems proposed in this application. This does not mean that the above-described device embodiments do not contain other units / modules.

[0229] It should be noted that in the examples and description of this patent, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one" does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0230] Although this application has been illustrated and described with reference to certain preferred embodiments thereof, those skilled in the art should understand that various changes in form and detail may be made thereto without departing from the spirit and scope of this application.

Claims

1. A data processing method, characterized in that, Applied to a data processing system, the method includes: In the data processing system, computing nodes are allocated for executing the first task; The first duration for transmitting the task data to the data processing system is determined based on the amount of data in the task data of the first task. The task data is received in parallel, and a second task is executed on the computing node, wherein the execution time of the second task on the computing node is less than the first time. Based on the completion of the second task and the completion of the data transmission of the task, the first task is executed on the computing node.

2. The method according to claim 1, characterized in that, The first task includes multiple data processing results; and the method further includes: The multiple data processing results are generated sequentially, and the generated data processing results meet time or quantity conditions. The data processing results generated during the execution of the first task are sent to the first storage device via incremental transmission. The time condition is that the time interval between two consecutive data processing system sending data processing results to the first storage device is the same. The quantity condition is that the amount of data processing results sent by the data processing system to the first storage device each time is the same.

3. The method according to claim 1 or 2, characterized in that, The task data includes at least one of the following: program instructions corresponding to the first task, and data to be processed for the first task.

4. The method according to any one of claims 1 to 3, characterized in that, The determination of the first duration for transmitting the task data to the data processing system based on the data volume of the task data of the first task includes: The amount of data in the task data and the transmission rate of the task data are obtained; The first duration is calculated based on the amount of data in the task data and the transmission rate of the task data.

5. The method according to claim 4, characterized in that, The calculation of the first duration based on the data volume and transmission rate of the task data includes: The first duration is the product of the amount of task data and the transmission rate of the task data; or, A preset ratio coefficient is determined, and the product of the preset ratio coefficient, the amount of data in the task data, and the transmission rate of the task data is used as the first duration.

6. The method according to any one of claims 1 to 5, characterized in that, The second task is performed on the computing node using the following method: The computing node is assigned to the second task using a backfilling technique, and the second task is executed on the computing node.

7. A data processing system, characterized in that, include: A management node is used to allocate computing nodes in the data processing system to perform the first task; The management node is also used to determine the first duration for transmitting the task data to the data processing system based on the amount of task data of the first task; Storage nodes are used to receive the task data; A computing node is used to execute a second task, wherein the execution time of the second task on the computing node is less than the first time, and wherein the execution of the second task by the computing node and the reception of the task data by the storage node are in parallel. The computing node is further configured to execute the first task based on the completion of the second task and the completion of the data transmission of the task data.

8. A data processing system, characterized in that, include: A memory for storing instructions executed by one or more processing units of the data processing system; as well as The processing unit is one of the processing units of the data processing system, and is used to execute the instructions stored in the memory to implement the method of any one of claims 1 to 7.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores instructions that, when executed on the system, cause the system to perform the method of any one of claims 1 to 7.

10. A computer program product, characterized in that, The computer program product includes instructions that, when executed on the system, cause the system to perform the method of any one of claims 1 to 7.