Data processing methods, apparatus, computers, and storage media for data streams

By acquiring the DAG graph and data communication volume at the operating system level, and optimizing thread allocation and scheduling, the efficiency and optimizable space limitations of the data flow execution mode in traditional control flow processors are solved, enabling efficient data processing in multi-core processors and large-scale distributed systems.

CN116048759BActive Publication Date: 2026-06-30SHENZHEN POLYTECHNIC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN POLYTECHNIC
Filing Date
2023-01-10
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In traditional control flow processor environments, existing technologies have limited efficiency and optimization potential in data flow execution modes, making it difficult to effectively support multi-core processors and large-scale distributed systems.

Method used

The operating system obtains the dependency DAG graph and data communication volume of node tasks from the process block PCB of the program, allocates threads to node tasks based on ready nodes, direct successor nodes and the total number of system threads, and performs scheduling based on processor system load and communication volume, optimizing task allocation through pre-scheduling and latency estimation.

Benefits of technology

It provides support for data flow execution mode at the operating system level, which significantly improves efficiency and optimization potential, and is suitable for multi-core processors and large-scale distributed systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116048759B_ABST
    Figure CN116048759B_ABST
Patent Text Reader

Abstract

This invention discloses a data processing method for data streams, comprising: In this embodiment, the operating system obtains the dependency DAG graph of node tasks and data communication volume from the process block PCB of the program, wherein nodes in the dependency DAG graph represent node tasks, and edges connecting the nodes in the dependency DAG graph represent the transmission quantity and data volume of the node tasks; threads are allocated to node tasks based on ready nodes, direct successor nodes, and the total number of system threads in the dependency DAG graph; and threads are scheduled based on the current processor system load, node task communication relationships, and communication volume. This invention provides a support method for data stream execution mode at the operating system level, greatly improving its efficiency and optimization potential.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of data processing technology, and in particular to a data processing method, apparatus, computer, and storage medium for data streams. Background Technology

[0002] The development of processors has shifted from simply increasing processing speed to focusing on multi-core processors, and large-scale distributed systems are becoming increasingly common. Traditionally, programming uses a sequential command structure, where data is often "static," requiring continuous data access. This makes programs less suitable for multi-core processors and large-scale distributed systems. Dataflow programming, on the other hand, emphasizes data-driven operations, explicitly defining input and output connections. It doesn't use commands; as soon as data is ready and valid, the relevant operations are executed immediately. Therefore, dataflow programming is inherently parallel and can run well on multi-core processors and large-scale distributed systems.

[0003] In the current context of massively parallel applications, dataflow computing outperforms the existing mainstream control-flow execution model in both programming and execution. While dataflow execution can be implemented at the application level in current control-flow processor environments (e.g., TensorFlow's internal execution engine uses dataflow execution), some dedicated dataflow libraries (e.g., TaskFlow) also implement dataflow execution within existing control-flow processors, operating systems, and programming languages.

[0004] However, since there is no support at the operating system level, its efficiency and room for optimization are greatly limited. Summary of the Invention

[0005] To address the aforementioned technical problems, embodiments of the present invention provide a data processing method for a data stream, comprising:

[0006] The operating system obtains the dependency DAG graph of node tasks and the data communication volume from the process block PCB of the program. The nodes in the dependency DAG graph represent the node tasks, and the edges connecting the nodes of the dependency DAG graph represent the number of transmissions and the amount of data of the node tasks.

[0007] Threads are allocated to node tasks based on the ready nodes, direct successor nodes, and total system threads in the dependency DAG graph.

[0008] Threads are scheduled based on the current processor's system load, node task communication relationships, and communication volume.

[0009] Further, the step of allocating threads to node tasks based on the ready nodes, direct successor nodes, and total system threads in the dependency DAG graph includes:

[0010] The ready nodes of the node tasks in the dependency DAG graph are sorted from most to least number of edges, and the node tasks are assigned to the thread of the first online ready node.

[0011] Furthermore, the scheduling of threads based on the current processor's system load, node task communication relationships, and communication volume includes:

[0012] The total amount of task data on the corresponding processor is obtained by statistically analyzing the amount of task data in each processor.

[0013] Pre-schedule tasks one by one to each processor according to the preset scheduling algorithm;

[0014] Calculate the total latency estimate for all processors based on the pre-scheduling results and the total workload;

[0015] The various pre-scheduling results are evaluated, and the threads of the ready node tasks are bound to the pre-scheduled processor with the minimum total latency estimate for data processing.

[0016] Further, the step of calculating the total latency estimate for all processors based on the pre-scheduling results and the total task load includes:

[0017] The estimated data transfer time of the node task on the DAG graph is calculated as Wedge = the sum of the times it takes for all input data to be copied from the NUMA node where the predecessor node is located to the NUMA node where the processor is located;

[0018] The ratio k of total data capacity to cache is obtained by obtaining the total data capacity and the last-level cache capacity share of the node tasks on each processor core;

[0019] Calculate the total latency estimate for each processing core: Td = Wedge + total data capacity * k * λ, where λ is an empirical value.

[0020] Furthermore, the remaining nodes without allocated threads are offline ready nodes and offline direct successor nodes, and also include:

[0021] The readiness status of online blocked nodes is tracked by the operating system based on the sequential dependencies in the PCB, while the readiness status of offline direct successor nodes is tracked and supported by user code or user-mode runtime libraries.

[0022] A data processing apparatus for a data stream, comprising:

[0023] The acquisition module is used to acquire the dependency DAG graph of node tasks and the data communication volume from the process block PCB of the program. The nodes in the dependency DAG graph represent the node tasks, and the edges connecting the nodes of the dependency DAG graph represent the transmission quantity and data volume of the node tasks.

[0024] The processing module is used to allocate threads to node tasks based on the ready nodes, direct successor nodes and the total number of system threads in the dependency DAG graph;

[0025] The execution module is used to schedule threads based on the current processor's system load, node task communication relationships, and communication volume.

[0026] Furthermore, the processing module is also used to sort the ready nodes of the node tasks in the dependency DAG graph from most to least number of edges, and assign the node tasks to the thread of the first online ready node.

[0027] Furthermore, the execution module includes:

[0028] The first acquisition submodule is used to calculate the total amount of task data on each processor by counting the amount of task data in each processor.

[0029] The first processing submodule is used to pre-schedule tasks one by one to each processor according to a preset scheduling algorithm.

[0030] The second processing submodule is used to calculate the total latency estimate of all processors based on the pre-scheduling results and the total workload.

[0031] The first execution submodule is used to evaluate various pre-scheduling results and bind the threads of the ready node tasks to the pre-scheduled processor with the smallest total latency estimate for data processing.

[0032] Furthermore, the first execution submodule includes:

[0033] The second acquisition submodule is used to calculate the estimated data transfer time of the node task on the dependency DAG graph, Wedge = the sum of the time for all input data to be copied from the NUMA node where the predecessor node is located to the NUMA node where the processor is located;

[0034] The third acquisition submodule is used to obtain the total data capacity and the last-level cache capacity share of the node tasks on each processor core to obtain the ratio k of the total data capacity to the cache.

[0035] The second execution submodule is used to calculate the total latency estimate for each processing core, Td = Wedge + total data capacity * k * λ, where λ is an empirical value.

[0036] Furthermore, the remaining unallocated thread nodes are offline ready nodes and offline direct successor nodes. The execution module is also used to track the ready status of online blocked nodes according to the sequential dependencies in the PCB. The ready status of the offline direct successor nodes is tracked and supported by the status provided by user code or user-mode runtime library.

[0037] A computer device includes a memory and a processor, the memory storing computer-readable instructions that, when executed by the processor, cause the processor to perform the steps of the data processing method for the data stream as described above.

[0038] A storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the data processing method for the data stream as described above.

[0039] In this embodiment of the invention, the operating system obtains the dependency DAG graph of node tasks and data communication volume from the process block PCB of the program. Nodes in the dependency DAG graph represent node tasks, and edges connecting nodes represent the transmission quantity and data volume of each node task. Threads are allocated to node tasks based on ready nodes, direct successor nodes, and the total number of system threads in the dependency DAG graph. Threads are scheduled according to the current processor's system load, node task communication relationships, and communication volume. This embodiment of the invention provides a support method for data flow execution mode at the operating system level, greatly improving its efficiency and optimization potential. Attached Figure Description

[0040] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0041] Figure 1 A schematic flowchart illustrating the data processing method for a data stream provided in an embodiment of the present invention;

[0042] Figure 2 A schematic diagram of the data flow provided in an embodiment of the present invention;

[0043] Figure 3 A basic structural block diagram of a data processing device for data streams provided in an embodiment of the present invention;

[0044] Figure 4 This is a basic structural block diagram of a computer device provided in an embodiment of the present invention. Detailed Implementation

[0045] To enable those skilled in the art to better understand the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.

[0046] In some of the processes described in the specification, claims, and accompanying drawings of this invention, multiple operations appearing in a specific order are included. However, it should be clearly understood that these operations may not be executed in the order they appear herein, or may be executed in parallel. The operation numbers, such as 101, 102, etc., are merely used to distinguish different operations and do not represent any execution order. Furthermore, these processes may include more or fewer operations, and these operations may be executed sequentially or in parallel. It should be noted that the descriptions such as "first," "second," etc., in this document are used to distinguish different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit "first" and "second" to different types.

[0047] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0048] Please refer to Figure 1 , Figure 1 This invention provides a data processing method for a data stream, such as... Figure 1 As shown, the method specifically includes the following steps:

[0049] S1. The operating system obtains the dependency DAG graph of node tasks and the data communication volume from the process block PCB of the program. The nodes in the dependency DAG graph represent the node tasks, and the edges connecting the nodes of the dependency DAG graph represent the transmission quantity and data volume of the node tasks.

[0050] Dataflow programming is a high-performance parallel programming model that addresses the efficiency utilization issues of multi-core processors. It differs significantly from traditional programming languages ​​in that it executes in a data-driven manner, distributing the data to be processed across the cores, separating computation from communication, and leveraging the parallel characteristics of software pipelining through task scheduling and allocation to fully exploit the potential parallelism within the streaming program, thus achieving load balancing across the cores. In the dataflow paradigm, a static instance of a dataflow program is described as a directed graph (DAG) according to its structure. For example... Figure 2As shown in the diagram, nodes represent computational units, and edges represent data transmission paths. Adjacent nodes transmit data through edges, nodes consume data to perform computations, and output the resulting data to the input-output sequence as input for the next computational unit.

[0051] It should be noted that, in this embodiment of the invention, the data flow task is managed as a directed acyclic graph (DAG) to manage the entire data flow computation task. Data flow tasks are executed using threads as carriers. The information in the process control block (PCB) is modified, adding fields to indicate the sequential dependencies between data flow tasks, fields to record the size (number of bytes) of the data corresponding to the output edges, a field to record the stack frame length required by the task, a data preparation condition count for each data flow task node, and a data flow task activation flag.

[0052] S2. Allocate threads to node tasks based on the ready nodes, direct successor nodes and total system threads in the dependency DAG graph.

[0053] Specifically, step S2 involves sorting the ready nodes of the node tasks in the dependency DAG graph from most to least number of edges, and assigning the node tasks to the thread of the first online ready node.

[0054] The operating system calculates the number of node tasks currently running on each processor core, summing the data size of the output edges required by these tasks and the stack frame length required by the tasks to obtain the total data capacity. This total data capacity is calculated on each core and used as a basis for new task scheduling. Specifically, the node tasks currently processing data are counted from the end of the DAG (Directed Acyclic Graph), and the total data capacity of the node tasks executing across all processor cores is calculated, including:

[0055] Step 1: Find the currently ready node in the DAG directed graph that is associated with the target node where the data being processed is located;

[0056] Step 2: Sum the data size of the edge between the target node and the currently ready node and the required stack frame length to obtain the total data capacity of each processor core.

[0057] S3. Schedule threads based on the current processor's system load, node task communication relationships, and communication volume.

[0058] Specifically, step S3 includes the following steps:

[0059] Step 1: Calculate the total amount of task data on each processor.

[0060] Step 2: Perform pre-scheduling according to the preset scheduling algorithm to schedule tasks one by one to each processor;

[0061] Step 3: Calculate the total latency estimate for all processors based on the pre-scheduling results and the total task load;

[0062] In practical applications, step three includes: calculating the estimated data transfer time Wedge of the node task on the dependency DAG graph, which is the sum of the times it takes for all input data to be copied from the NUMA node where the predecessor node is located to the NUMA node where the processor is located; obtaining the total data capacity and the last-level cache capacity share of the node task on each processor core to obtain the ratio k of total data capacity to cache; and calculating the total latency estimate Td of each processing core, which is Wedge + total data capacity * k * λ, where λ is an empirical value.

[0063] When there are multiple cores, the total data capacity and last-level cache capacity of the node tasks on each processor core need to be equally divided.

[0064] In one embodiment of the present invention, the predecessor relationship and corresponding data size, the difference in communication cost between processor cores, the final cache capacity of each processor core, and the total data capacity, which is the sum of the data size of the output edge of the predecessor node, the data size of the output edge of this node, and the stack frame length of this node task, are obtained from the control block PCB of the node task. All processor cores are traversed to obtain the total data capacity and final cache capacity of the node tasks on each processor core, and the ratio k of the total data capacity to the cache is obtained.

[0065] In one embodiment of the present invention, for example in the Linux kernel, a structure member is added to the thread control block task_struct{} structure.

[0066] pre-suc{i nt pre_count; / / Number of predecessor nodes

[0067] struct pre-suc*prenodes[]; / / Array of pointers to predecessor nodes nt suc_count; / / Number of successor nodes

[0068] struct suc-suc*sucnodes[]; / / Array of successor node pointers

[0069] };

[0070] Add the required stack frame length for the current node's task to the task_struct:

[0071] in frame_size; / / The stack frame length required by the task in this node

[0072] Add a data preparation condition for this node and an activation flag to the task_struct:

[0073] `int data_ready_count;` / / Used to record the number of preceding data items that are ready.

[0074] in activated; / / Activated when data_ready_count = pre_count

[0075] In the operating system kernel, increase the data overhead of data stream tasks on each CPU:

[0076] `int current_size[CPUs];` / / Each CPU core counts its own CPUs.

[0077] current_size[n] records the total data overhead of all data flow tasks on processor core n, including the total length of multiple output side data and stack frames.

[0078] by Figure 2 The data stream task shown is an example:

[0079] After task a completes its computation, the task_struct information for task c will be updated: the number of ready records in the preceding data, data_ready_count, will be incremented by one; if data_ready_count = pre_count, the task will be activated and act ivated = 1. The same operation will be performed on task f.

[0080] If task c is activated at this time, scheduling will be implemented using the data added in this patent. A feasible scheduling scheme example is shown below.

[0081] Assuming task c is scheduled to processor i, the following calculations are performed: Calculate the estimated data transfer time Wedge on the edges of the DAG graph. Calculate the total data capacity on processor core i and the total data capacity of task c, obtaining the ratio k of the total data capacity to the cache. Calculate the total latency estimate Td = Wedge + total data capacity of c * k * λ, where λ is an empirical value obtained statistically (e.g., unit: 1 microsecond / kB). Iterate through all processor cores, performing the above calculations one by one, and select the core m with the lowest Td, scheduling task c to run on core m.

[0082] Step 4: Evaluate the various pre-scheduling results and bind the threads of the ready node tasks to the pre-scheduled processor with the lowest total latency estimate for data processing.

[0083] In this embodiment of the invention, the remaining unallocated threads are offline ready nodes and offline direct successor nodes. The ready status of online blocked nodes is tracked by the operating system according to the sequential dependencies in the PCB, and the ready status of offline direct successor nodes is tracked and supported by user code or user-mode runtime library.

[0084] In this embodiment of the invention, the operating system obtains the dependency DAG graph of node tasks and data communication volume from the process block PCB of the program. Nodes in the dependency DAG graph represent node tasks, and edges connecting nodes represent the transmission quantity and data volume of each node task. Threads are allocated to node tasks based on ready nodes, direct successor nodes, and the total number of system threads in the dependency DAG graph. Threads are scheduled according to the current processor's system load, node task communication relationships, and communication volume. This embodiment of the invention provides a support method for data flow execution mode at the operating system level, greatly improving its efficiency and optimization potential.

[0085] like Figure 3 As shown, to solve the above problems, this embodiment of the invention also provides a data processing device for data streams, including: an acquisition module 2100, a processing module 2200, and an execution module 2300. The acquisition module 2100 is used to acquire, through which the operating system acquires the dependency DAG graph of node tasks and data communication volume from the process block PCB of the program. Nodes in the dependency DAG graph represent node tasks, and edges connecting nodes in the dependency DAG graph represent the transmission quantity and data volume of the node tasks. The processing module 2200 is used to allocate threads to node tasks based on ready nodes, direct successor nodes, and the total number of system threads in the dependency DAG graph. The execution module 2300 is used to schedule threads based on the current processor's system load, node task communication relationships, and communication volume.

[0086] In some implementations, the processing module is further configured to sort the ready nodes of the node tasks in the dependency DAG graph from most to least number of edges, and assign the node tasks to the thread of the first and online ready node.

[0087] In some embodiments, the execution module includes: a first acquisition submodule, used to statistically analyze the amount of task data in each processor to obtain the total amount of task data on the corresponding processor; a first processing submodule, used to pre-schedule tasks one by one to each processor according to a preset scheduling algorithm; a second processing submodule, used to calculate the total latency estimate of all processors based on the pre-scheduling result and the total amount of tasks; and a first execution submodule, used to evaluate various pre-scheduling results and bind the threads of the ready node tasks to the pre-scheduled processor with the smallest total latency estimate for data processing.

[0088] In some implementations, the first execution submodule includes: a second acquisition submodule, configured to calculate the estimated data transfer time Wedge of the node task on the dependency DAG graph, which is the sum of the times it takes for all input data to be copied from the NUMA node where the predecessor node is located to the NUMA node where the current processor is located; a third acquisition submodule, configured to obtain the total data capacity and the last-level cache capacity share of the node task on each processor core to obtain the ratio k of the total data capacity to the cache; and a second execution submodule, configured to calculate the estimated total latency Td of each processing core, which is Td = Wedge + total data capacity * k * λ, where λ is an empirical value.

[0089] In some implementations, the nodes with remaining unallocated threads are offline ready nodes and offline direct successor nodes. The execution module is also used to track the ready status of online blocked nodes according to the sequential dependencies in the PCB. The ready status of the offline direct successor nodes is tracked and supported by the status provided by user code or user-mode runtime library.

[0090] In this embodiment of the invention, the data processing device for data streams obtains the dependency DAG graph of node tasks and data communication volume from the process block PCB of the program. Nodes in the dependency DAG graph represent node tasks, and edges connecting the nodes represent the transmission quantity and data volume of each node task. Threads are allocated to node tasks based on ready nodes, direct successor nodes, and the total number of system threads in the dependency DAG graph. Threads are scheduled according to the current processor's system load, node task communication relationships, and communication volume. This embodiment of the invention provides a support method for the data stream execution mode at the operating system level, greatly improving its efficiency and optimization potential.

[0091] To address the aforementioned technical problems, embodiments of the present invention also provide a computer device. Please refer to [link / reference needed]. Figure 4 , Figure 4 This is a basic structural block diagram of the computer device in this embodiment.

[0092] like Figure 4The diagram shows the internal structure of a computer device. Figure 4 As shown, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected via a system bus. The non-volatile storage medium stores an operating system, a database, and computer-readable instructions. The database may store a sequence of control information. When executed by the processor, the computer-readable instructions enable the processor to implement an image processing method. The processor provides computational and control capabilities, supporting the operation of the entire computer device. The memory stores computer-readable instructions, which, when executed by the processor, enable the processor to perform an image processing method. The network interface of the computer device is used for communication with a terminal. Those skilled in the art will understand that… Figure 4 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0093] In this embodiment, the processor is used to execute... Figure 3 The memory stores the specific contents of the acquisition module 2100, processing module 2200, and execution module 2300, and stores the program code and various types of data required to execute the above modules. The network interface is used for data transmission between the user terminal and the server. In this embodiment, the memory stores the program code and data required to execute all sub-modules in the image processing method, and the server can call the server's program code and data to execute the functions of all sub-modules.

[0094] The computer device provided in this invention allows the operating system to obtain the dependency DAG graph of node tasks and data communication volume from the process block PCB of a program. Nodes in the dependency DAG graph represent node tasks, and edges connecting the nodes represent the transmission quantity and data volume of each node task. Threads are allocated to node tasks based on ready nodes, direct successor nodes, and the total number of system threads in the dependency DAG graph. Threads are scheduled according to the current processor's system load, node task communication relationships, and communication volume. This invention provides a method for supporting data flow execution mode at the operating system level, significantly improving its efficiency and optimization potential.

[0095] The present invention also provides a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the image processing method described in any of the above embodiments.

[0096] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. This computer program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. The aforementioned storage medium can be a non-volatile storage medium such as a magnetic disk, optical disk, or read-only memory (ROM), or random access memory (RAM).

[0097] It should be understood that although the steps in the flowcharts of the accompanying figures are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the accompanying figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times, and their execution order is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.

[0098] The above description is only a partial embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.

Claims

1. A data processing method for a data stream, comprising: The operating system obtains the dependency DAG graph of node tasks and the data communication volume from the process block PCB of the program. The nodes in the dependency DAG graph represent the node tasks, and the edges connecting the nodes of the dependency DAG graph represent the number of transmissions and the amount of data of the node tasks. Threads are allocated to node tasks based on the ready nodes, direct successor nodes, and total system threads in the dependency DAG graph. Threads are scheduled based on the current processor's system load, node task communication relationships, and communication volume; The process of scheduling threads based on the current processor's system load, node task communication relationships, and communication volume includes: The total amount of task data on the corresponding processor is obtained by statistically analyzing the amount of task data in each processor. Pre-schedule tasks one by one to each processor according to the preset scheduling algorithm; Calculate the total latency estimate for all processors based on the pre-scheduling results and the total task data volume; The various pre-scheduling results are evaluated, and the threads of the ready node tasks are bound to the pre-scheduled processor with the minimum total latency estimate for data processing. The step of calculating the total latency estimate for all processors based on the pre-scheduling results and the total task data volume includes: The estimated data transfer time of the node task on the DAG graph is calculated as Wedge = the sum of the times it takes for all input data to be copied from the NUMA node where the predecessor node is located to the NUMA node where the processor is located; The ratio k of total data capacity to cache is obtained by obtaining the total data capacity and the last-level cache capacity share of the node tasks on each processor core; Calculate the total latency estimate for each processing core , where λ is an empirical value.

2. The data processing method according to claim 1, characterized in that, The process of allocating threads to node tasks based on the ready nodes, direct successor nodes, and total system threads in the dependency DAG graph includes: The ready nodes of the node tasks in the dependency DAG graph are sorted from most to least number of edges, and the node tasks are assigned to the thread of the first online ready node.

3. The data processing method according to claim 1, characterized in that, The remaining unallocated threads are offline ready nodes and offline direct successor nodes, and also include: The readiness status of online blocked nodes is tracked by the operating system based on the sequential dependencies in the PCB, while the readiness status of offline direct successor nodes is tracked and supported by user code or user-mode runtime libraries.

4. A data processing device for a data stream, characterized in that, include: The acquisition module is used to acquire the dependency DAG graph of node tasks and the data communication volume from the process block PCB of the program. The nodes in the dependency DAG graph represent the node tasks, and the edges connecting the nodes of the dependency DAG graph represent the transmission quantity and data volume of the node tasks. The processing module is used to allocate threads to node tasks based on the ready nodes, direct successor nodes and the total number of system threads in the dependency DAG graph; The execution module is used to schedule threads based on the current processor's system load, node task communication relationships, and communication volume; The execution module includes: The first acquisition submodule is used to calculate the total amount of task data on each processor by counting the amount of task data in each processor. The first processing submodule is used to pre-schedule tasks one by one to each processor according to a preset scheduling algorithm. The second processing submodule is used to calculate the total latency estimate of all processors based on the pre-scheduling results and the total amount of task data. The first execution submodule is used to evaluate various pre-scheduling results and bind the threads of the ready node tasks to the pre-scheduled processor with the smallest total latency estimate for data processing. The first execution submodule includes: a second acquisition submodule, used to calculate the estimated data transfer time Wedge of the node task on the dependency DAG graph, which is the sum of the times it takes for all input data to be copied from the NUMA node where the predecessor node is located to the NUMA node where the current processor is located; a third acquisition submodule, used to obtain the total data capacity and the last-level cache capacity share of the node task on each processor core to obtain the ratio k of the total data capacity to the cache; and the second execution submodule, used to calculate the total latency estimate for each processing core. , where λ is an empirical value.

5. The data processing apparatus according to claim 4, characterized in that, The processing module is further configured to sort the ready nodes of the node tasks in the dependency DAG graph from most to least number of edges, and assign the node tasks to the thread of the first online ready node.

6. A computer device comprising a memory and a processor, the memory storing computer-readable instructions which, when executed by the processor, cause the processor to perform the steps of the data processing method for a data stream as claimed in any one of claims 1 to 3.

7. A storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the data processing method for a data stream as described in any one of claims 1 to 3.