A method for processing distributed artificial intelligence tasks and a storage medium

By employing peer-to-peer networks and distributed state synchronization mechanisms in the distributed system, the problems of single-point performance bottlenecks and failure risks in distributed computing architectures are solved, achieving highly reliable and adaptive intelligent task scheduling, and improving the system's scalability and task execution efficiency.

CN122240256APending Publication Date: 2026-06-19DATANG HYDROPOWER SCI & TECH RES INST CO LTD +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DATANG HYDROPOWER SCI & TECH RES INST CO LTD
Filing Date
2026-03-03
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing distributed computing architectures suffer from single-point performance bottlenecks and failure risks, have high network environment requirements, insufficient security, fragmented resources and low utilization, and complex programming models, making it difficult to achieve highly reliable and adaptive intelligent task scheduling.

Method used

A peer-to-peer network protocol is adopted to enable each computing node to form an autonomous network. A global resource view is maintained through a distributed state synchronization mechanism. The target node for executing computing tasks is determined based on task attributes and real-time status. Intelligent decision-making is carried out using a multi-dimensional cost evaluation function to avoid single-point bottlenecks and failure risks of the central scheduler.

Benefits of technology

It achieves high scalability and service reliability, makes full use of distributed computing resources, improves task execution efficiency, realizes load balancing, and avoids the performance bottlenecks and failure risks of traditional central scheduler architecture.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240256A_ABST
    Figure CN122240256A_ABST
Patent Text Reader

Abstract

This invention provides a method for processing distributed artificial intelligence tasks and a storage medium. The distributed artificial intelligence task processing method provided by this invention enables each computing node (the first computing node) to autonomously network and securely interconnect based on a peer-to-peer network. By utilizing a distributed state synchronization mechanism to maintain a global resource view, each node can intelligently determine the optimal node (its own or a peer node) to execute the computing task based on the real-time global view and task attributes, thereby achieving local execution of the task or migration to the optimal peer node. This method effectively avoids the single-point performance bottlenecks and failure risks of traditional centralized scheduler architectures, significantly improving the system's scalability and service reliability. Simultaneously, adaptive scheduling decisions based on global state can fully utilize distributed computing resources, achieve load balancing, and improve the overall efficiency of task execution.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of distributed computing and artificial intelligence systems, specifically to a method for processing distributed artificial intelligence tasks and a storage medium. Background Technology

[0002] Existing distributed computing frameworks (such as Ray.io, Spark, and Kubernetes) generally adopt a centralized control architecture, which involves setting up one or more control nodes in the cluster to be responsible for task scheduling and state maintenance. While this architecture is efficient in cloud data center scenarios, it has the following significant shortcomings in large-scale edge or heterogeneous network environments: Single-point bottlenecks and central dependency: Failure of the Head or Master node can lead to global scheduling interruption or performance degradation; as the cluster size increases, all node registration, heartbeat, task submission and status query requests converge on the head node, which can easily cause the head node to become a performance bottleneck in terms of computing, memory and network bandwidth, thus limiting the scalability of the cluster.

[0003] It has high requirements for the network environment: it relies heavily on stable, low-latency communication between worker nodes and head nodes. In unstable network environments (such as wide area networks, IoT, or edge computing scenarios across regions), the interruption of the connection between the node and the head node will directly lead to node disconnection and task scheduling failure, resulting in poor robustness.

[0004] The security trust model is weak and the inherent security is insufficient: the existing system assumes that the internal network is trustworthy, which makes it difficult to prevent node spoofing, data eavesdropping or malicious attacks, and there are huge security risks. It is necessary to rely on external measures such as VPNs and firewalls for remediation, and the effect of preventing security threats at the data level is not good.

[0005] Fragmented computing resources, resource silos, and low utilization: Each application or project needs to deploy an independent cluster, causing computing resources to be tied to the application, forming "resource silos." This makes it difficult to achieve cross-project and cross-departmental computing resource sharing and peak / valley smoothing, resulting in a large amount of idle and wasted hardware resources. In multi-tenant environments, computing resources are often statically bound to a single project, leading to low resource pooling and sharing utilization.

[0006] The programming model is complex: existing frameworks are mostly based on interpreted languages ​​(such as Python), which are subject to the Global Interpreter Lock (GIL), making it difficult to fully utilize multi-core and parallel performance.

[0007] In summary, among the relevant technologies, the distributed computing architecture based on a central scheduler has the drawback of single-point performance bottlenecks and failure risks caused by the central node, which leads to the technical problem of difficulty in achieving highly reliable and adaptive intelligent task scheduling. Summary of the Invention

[0008] The purpose of this invention is to overcome the above-mentioned technical deficiencies and provide a method for processing distributed artificial intelligence tasks and a storage medium to solve the technical problem of difficulty in achieving highly reliable and adaptive intelligent task scheduling in related technologies.

[0009] To achieve the above-mentioned technical objectives, the present invention adopts the following technical solution: In a first aspect, the present invention provides a method for processing distributed artificial intelligence tasks, applied to a first computing node in a distributed system, the method comprising: Obtain the task data frame to be processed; wherein the task data frame encapsulates the corresponding computing task and the dependency and attribute information of the computing task; Based on a peer-to-peer network protocol, a secure communication connection is established with at least one second computing node in the distributed system to join the peer-to-peer network of the distributed system. By performing distributed state synchronization with at least one second computing node, a global state view of the distributed system is obtained and maintained; wherein, the global state view records the real-time resource status of each computing node in the distributed system. Based on the attribute information of the task data frame and the global state view, the target computing node for executing the computing task is determined; wherein, the target computing node is one of the first computing node or at least one second computing node; If the target computing node is the first computing node, the first computing node invokes its local environment to execute the computing task. If the target computing node is one of the at least one second computing node, the task data frame is migrated to the target computing node so that the target computing node can execute the computing task.

[0010] Further, the step of acquiring the task data frame to be processed includes: In response to a task description file received from the task orchestration layer, the task data frame is generated; wherein the task description file is generated through a visual graphical orchestration interface or a declarative configuration file interface provided by the task orchestration layer, and is used to define a task flow containing at least one computation task.

[0011] Furthermore, the step of establishing a secure communication connection with at least one second computing node in the distributed system based on a peer-to-peer network protocol to join the peer-to-peer network of the distributed system includes: Perform two-way authentication with the second computing node based on public-key cryptography to verify the legitimacy of the node identifiers and security tokens of both parties; After the two-way authentication is successful, an end-to-end encrypted communication channel is established with the second computing node.

[0012] Furthermore, the step of obtaining and maintaining a global state view of the distributed system by performing distributed state synchronization with at least one second computing node includes: A dual-path synchronization mechanism combining a primary communication path and a backup communication path is adopted to periodically interact with the at least one second computing node to obtain the real-time resource status of each node. The state data of the interaction is merged based on conflict-independent data types to eliminate state conflicts. The locally maintained global state view is updated based on the merged state data, so that the global state view of the first computing node eventually converges with the global state views of other computing nodes in the distributed system.

[0013] Furthermore, the primary communication path is a communication path constructed based on the minimum spanning tree (MST) of the network topology, and the backup communication path is a communication path constructed based on the Gossip protocol. The first computing node preferentially interacts with the at least one second computing node through the communication path constructed by the minimum spanning tree (MST), and switches to the path constructed by the Gossip protocol for state interaction when an anomaly is detected in the communication path constructed by the minimum spanning tree (MST).

[0014] Furthermore, the step of determining the target computing node for executing the computing task based on the attribute information of the task data frame and the global state view includes: Based on the resource requirement constraints indicated in the attribute information of the task data frame and the real-time resource status of the candidate computing nodes recorded in the global status view, a comprehensive score for each candidate computing node is calculated using a preset multi-dimensional cost evaluation function. Based on the comprehensive score, the target computing node is determined from the candidate computing nodes.

[0015] Furthermore, the multi-dimensional cost evaluation function calculates the comprehensive score based on at least the following two dimensions of factors: the computing resource utilization rate of the candidate computing node, the memory resource occupancy rate, the network transmission latency, and the node reliability index; wherein, for different types of computing tasks, the weighting coefficients of the corresponding factors in the multi-dimensional cost evaluation function can be dynamically adjusted.

[0016] Furthermore, the local environment is a programming language runtime environment that eliminates the global interpreter lock.

[0017] Thirdly, the present invention provides an electronic device, comprising: a memory, and one or more processors communicatively connected to the memory; the memory stores instructions executable by the one or more processors, the instructions being executed by the one or more processors to cause the one or more processors to implement the method described above.

[0018] Fourthly, the present invention provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method.

[0019] Beneficial effects: The distributed artificial intelligence task processing method provided by this invention enables each computing node (the first computing node) to autonomously network and securely interconnect based on a peer-to-peer network. By utilizing a distributed state synchronization mechanism to maintain a global resource view, each node can intelligently determine the optimal node (its own or a peer node) to execute the computing task based on the real-time global view and task attributes, thereby achieving local execution of the task or migration to the optimal peer node. This method effectively avoids the single-point performance bottlenecks and failure risks of traditional centralized scheduler architectures, significantly improving the system's scalability and service reliability. Simultaneously, adaptive scheduling decisions based on global state can fully utilize distributed computing resources, achieve load balancing, and improve the overall efficiency of task execution. Attached Figure Description

[0020] The accompanying drawings, which form part of this specification, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings: Figure 1 This is one of the flowcharts illustrating a method for processing distributed artificial intelligence tasks provided in an embodiment of the present invention; Figure 2 This is a second flowchart illustrating a method for processing distributed artificial intelligence tasks provided in an embodiment of the present invention; Figure 3 This is a block diagram of a distributed artificial intelligence task processing device used in an embodiment of the present invention; Figure 4 This is a block diagram of an electronic device used in an embodiment of the present invention. Detailed Implementation

[0021] The present invention will now be described in detail with reference to the accompanying drawings and embodiments. It should be noted that, unless otherwise specified, the embodiments and features described herein can be combined with each other.

[0022] The following detailed description is exemplary and intended to provide further detailed explanation of the invention. Unless otherwise specified, all technical terms used in this invention have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in this invention is for describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention.

[0023] In related technologies, distributed artificial intelligence computing systems are often used to handle computationally intensive tasks such as training machine learning models and inference on large-scale data. These systems break down complex computational tasks into multiple subtasks and schedule them to be executed in parallel on a cluster of multiple computing nodes (e.g., servers, virtual machines, or containers) to improve computational efficiency.

[0024] In related technologies, mainstream distributed computing frameworks often adopt a central scheduler architecture. This architecture includes a central node (e.g., the Driver in Spark, the Head Node in Ray) and multiple worker nodes. The central scheduler is responsible for receiving all computing tasks, maintaining the resource status of the entire cluster, and assigning execution nodes (worker nodes) to each subtask based on the global information it possesses. Worker nodes are responsible for receiving task instructions and data from the central scheduler, performing specific computations, and returning the results. Coordination of inter-task dependencies, data flow, and error recovery all rely on the central scheduler.

[0025] This architecture, centered around a central scheduler, has inherent flaws in its operating mechanism. First, the central scheduler easily becomes a performance bottleneck during task distribution, state management, and node coordination, limiting the scalability and task throughput of the entire cluster. Second, and more critically, the central scheduler constitutes a single point of failure. If this node fails, task scheduling for the entire cluster will be completely interrupted, unfinished task states may be lost, and the system requires complex recovery procedures or even manual intervention, severely compromising the system's high availability and service reliability. Furthermore, because all scheduling decisions are centralized on a single node, the system struggles to quickly and adaptively respond to dynamically changing load and network conditions within the cluster, limiting the flexibility and intelligence of the scheduling strategy.

[0026] Therefore, the following technical problems exist in the related technologies: the distributed computing architecture based on the central scheduler has single-point performance bottlenecks and failure risks caused by the central node, which makes it difficult to achieve highly reliable and adaptive intelligent task scheduling.

[0027] In view of this, the inventive concept of this invention lies in: abandoning a centralized scheduler, enabling each computing node to autonomously discover and securely interconnect based on a peer-to-peer network protocol, and utilizing a consistent distributed state synchronization mechanism to allow all nodes to maintain an approximate global resource view. Based on this, each node can serve as the starting point for scheduling decisions. By integrating intelligent decision-making logic that combines multi-dimensional resource evaluation and task adaptation, the optimal target node for executing computing tasks is determined locally or among peer nodes, thereby achieving a decentralized, highly reliable, and adaptive intelligent task scheduling technical solution.

[0028] like Figure 1 and Figure 2 As shown, this embodiment provides a method for processing distributed artificial intelligence tasks, applied to the first computing node in a distributed system. The method includes: Step S10: Obtain the task data frame to be processed; wherein, the task data frame encapsulates the corresponding computing task and the dependency and attribute information of the computing task.

[0029] In this embodiment, the distributed system can be represented as a decentralized collaborative computing system formed by multiple geographically dispersed computing nodes, each with independent computing capabilities, connected through a peer-to-peer network protocol. In this system, all computing nodes are equal in status, eliminating the need for a centralized scheduler for task allocation and state management. Through direct communication between nodes, state synchronization, and distributed decision-making, collaborative execution of large-scale artificial intelligence tasks is achieved. The first computing node can be represented as any computing node in the distributed system that initiates or receives pending artificial intelligence tasks and performs core operations such as task acquisition, network access, state synchronization, target node decision-making, and task execution / migration. It can have the same hardware configuration, software modules, and network interaction permissions as other computing nodes in the distributed system (i.e., the second computing node).

[0030] In one possible implementation, the distributed system may include multiple computing nodes (a first computing node and at least one second computing node), a peer-to-peer network communication link, a distributed state sharing module, and a zero-trust security authentication system. The peer-to-peer network communication link is built based on a P2P protocol, supporting direct data transmission between nodes. The distributed state sharing module stores the real-time resource status and task execution status of each node, ensuring the consistency of the global state view. The zero-trust security authentication system is used for node authentication, data transmission encryption, and operation permission control.

[0031] In one possible specific implementation, the computing nodes in the distributed system can be industrial servers, edge computing gateways, smart terminal devices, and IoT sensor nodes, etc.

[0032] In this embodiment, the task data frame can be represented as a structured data encapsulation unit, which is used to bundle together the computational task to be executed and its related metadata, serving as the basic unit for scheduling, transmission, and execution in a distributed system. Specifically, the content encapsulated in the task data frame may include: 1. The corresponding computational task entity, such as the machine learning model training script to be executed, batch processing inference code, or data preprocessing function; 2. The dependencies of the computational task, which can be used to indicate the predecessor and / or successor tasks of the task in the execution logic. These dependencies can collectively constitute a task pipeline described by a directed acyclic graph (DAG); 3. The attribute information of the computational task, such as task type (e.g., CPU-intensive, GPU-intensive, I / O-intensive), estimated resource requirements (e.g., required number of CPU cores, memory size, GPU model and quantity), priority, and the identifier or location of the data input source.

[0033] In one possible implementation, the first computing node can generate the task data frame in response to a task description file received from the task orchestration layer. The task description file can be generated by the task orchestration layer, which provides a visual graphical orchestration interface or a declarative configuration file interface. Users can define a task flow including at least one computing task by dragging and dropping graphical components or writing configuration files. The task orchestration layer can automatically generate a corresponding task description file based on this task flow. After receiving the task description file, the first computing node can parse the task information in the file and encapsulate it into a task data frame.

[0034] In one possible implementation, the first computing node can generate task data frames locally. When the first computing node has the ability to initiate tasks, it can directly construct computing tasks and their corresponding dependencies and attribute information based on the locally preset task configuration information, and then encapsulate them into task data frames.

[0035] In one possible implementation, the first computing node can receive task data frames from other computing nodes. In a distributed system, if other computing nodes determine that the first computing node is the better execution node during task scheduling, they can transmit the encapsulated task data frame to the first computing node. The first computing node then receives the task data frame through a secure communication channel, completing the acquisition process.

[0036] Step S12: Based on the peer-to-peer network protocol, establish a secure communication connection with at least one second computing node in the distributed system to join the peer-to-peer network of the distributed system.

[0037] In this embodiment, the peer-to-peer network protocol can be represented as a network communication protocol without a central control node, where all nodes have equal status, supporting direct interaction and resource sharing between nodes without the need for data forwarding through a central node. The second computing node can be represented as any other node in the distributed system besides the first computing node that possesses computing capabilities and participates in the peer-to-peer network. The secure communication connection can be represented as a communication link with authentication and data encryption functions, capable of preventing security risks such as node forgery, data eavesdropping, and data tampering.

[0038] In one possible implementation, after the first computing node starts up, it can actively search for or passively learn of at least one second computing node in the network through a node discovery mechanism (e.g., the multicast-based mDNS protocol, the Kademlia protocol based on a distributed hash table, or querying a pre-configured list of seed nodes). Once the second computing node to which a connection needs to be established is identified, both parties can perform a two-way authentication process based on a zero-trust security model. For example, the first computing node can use its private key to sign its digital identity (e.g., a certificate) and a security token including a timestamp, and send the signature, its public key, and the token to the target second computing node. The target second computing node uses the received public key to verify the validity of the signature and checks the legitimacy and freshness of the token (e.g., checking the timestamp to prevent replay attacks). Similarly, the first computing node also verifies the identity of the other node. Only after both authentications are successful do both parties establish an end-to-end two-way encrypted communication channel using a transport layer security protocol or its evolution. By successfully establishing such secure connections with one or more second computing nodes in sequence, the first computing node can integrate into the peer-to-peer network of the distributed system.

[0039] Step S14: Obtain and maintain a global state view of the distributed system by performing distributed state synchronization with at least one second computing node; wherein the global state view records the real-time resource status of each computing node in the distributed system.

[0040] In this embodiment, the distributed state synchronization can be represented as, in a peer-to-peer network, each node communicates with the others, causing their respective information about the overall system state to gradually converge. The global state view can be represented as a dataset maintained locally by the first computing node, characterizing the real-time operating state of all participating nodes in the distributed system, and serving as the basis for task scheduling decisions. The real-time resource state can be represented as quantifiable state parameters such as the current hardware resource usage, network communication status, and node reliability of each computing node. The maintenance action can be represented as the first computing node updating and correcting its local global state view based on the dynamic changes in node states, ensuring that it always reflects the latest state of the distributed system.

[0041] In one possible implementation, the first computing node may periodically or event-drivenly exchange status information with one or more second computing nodes via a secure communication connection established in step S12. The status information may include: the identifier of each computing node, its network address, and real-time resource load information, such as the utilization rate of the central processing unit (CPU), the utilization rate and video memory usage of the graphics processing unit (GPU), memory occupancy, disk I / O status, network bandwidth availability, and node reliability coefficients calculated based on historical operating data.

[0042] In one possible implementation, state information can be propagated and collected primarily via a low-latency, high-efficiency main communication path (e.g., a minimum spanning tree path dynamically constructed based on the current peer-to-peer network topology). Simultaneously, a Gossip-based communication path can be used as a backup. When a communication anomaly or timeout is detected on the main path, the system automatically switches to the backup path for pervasive propagation of state information, ensuring robust synchronization during network fluctuations.

[0043] In this embodiment, upon receiving state information from other nodes, the first compute node processes it using a conflict-independent data merging algorithm. For example, a conflict-independent reproducible data type is used to store and merge state data. This data type ensures that regardless of the order in which state updates arrive from different nodes, the first compute node can deterministically merge them into its locally maintained global state view, automatically eliminating potential data conflicts and ultimately making the local global state view convergent and consistent with the views of other nodes in the system.

[0044] Step S16: Based on the attribute information of the task data frame and the global state view, determine the target computing node for executing the computing task; wherein the target computing node is one of the first computing node or at least one second computing node.

[0045] In this embodiment, the target computing node can be represented as the node most suitable for executing the current computing task selected from the first computing node and all second computing nodes, and is the specific carrier of task execution.

[0046] In one possible implementation, the first computing node can parse the attribute information in the task data frame obtained in step S10, extracting the resource requirement constraints of the computing task (e.g., requiring GPU acceleration, requiring large memory) and the task type label. Simultaneously, it queries the global state view maintained in step S14 to obtain the real-time resource status of all candidate computing nodes (including the first computing node itself and other available second computing nodes). Then, the first computing node calls a preset multi-dimensional cost evaluation function. This function can take the resource status parameters of the candidate nodes in various dimensions (e.g., CPU utilization, memory usage, network latency, node reliability coefficient, etc.) and task attribute information as input. Internally, the function can preset corresponding weighting coefficients for each evaluation dimension. Specifically, these weighting coefficients can be dynamic and adjusted according to the task type of the current computing task. For example, for GPU-intensive tasks, the weights of GPU utilization and memory-related dimensions can be increased. For pipeline tasks that require frequent exchange of intermediate data, the weight of the network latency dimension can be increased.

[0047] This evaluation function can calculate a comprehensive score for each candidate computing node. The score reflects the estimated cost or suitability of performing the current task on that node; a higher score indicates a better suitability. Finally, the first computing node can determine the target computing node from all candidate nodes based on the comprehensive score. This can be done by selecting the node with the highest comprehensive score. The target computing node can be the first computing node itself or a second computing node.

[0048] Step S18: If the target computing node is the first computing node, the first computing node calls its local environment to execute the computing task. If the target computing node is one of the at least one second computing node, the task data frame is migrated to the target computing node so that the target computing node can execute the computing task.

[0049] In this embodiment, the local environment can be represented as a software runtime environment provided on the first computing node, capable of loading and running the code and data upon which the computing task depends. Specifically, this environment can be a programming language runtime environment without a global interpreter lock. For example, it could be a runtime built on the Mojo language. This type of environment allows multiple threads within a computing task to execute in true parallel, thereby fully utilizing the computing power of a multi-core CPU and avoiding performance bottlenecks caused by a global interpreter lock. The first computing node invokes this local runtime to submit the computing task in the task data frame for execution.

[0050] In this embodiment, when remote migration is required, the first computing node can migrate the entire task data frame (or the portion necessary for its execution) to the second computing node identified as the target computing node via a secure communication connection established or temporarily created in step S12. The migration process can be implemented using a high-performance data transmission protocol, such as streaming using the Apache Arrow Flight framework, to reduce migration latency. After the task data frame arrives at the target second computing node, it is received by that node, which then invokes its own local environment (preferably a runtime environment without a global interpreter lock) to execute the encapsulated computational task. The execution result can be returned to the first computing node or passed to the node where the subsequent task resides, according to the dependencies indicated in the task data frame.

[0051] The distributed artificial intelligence task processing method provided in this embodiment enables each computing node (the first computing node) to autonomously network and securely interconnect based on a peer-to-peer network. By utilizing a distributed state synchronization mechanism to maintain a global resource view, each node can intelligently determine the optimal node (its own or a peer node) to execute the computing task based on the real-time global view and task attributes, thereby achieving local execution of the task or migration to the optimal peer node. This method effectively avoids the single-point performance bottlenecks and failure risks of traditional centralized scheduler architectures, significantly improving the system's scalability and service reliability. Simultaneously, adaptive scheduling decisions based on global state can fully utilize distributed computing resources, achieve load balancing, and improve the overall efficiency of task execution.

[0052] In some implementations, the step of acquiring the task data frame to be processed includes: Step S102: In response to the task description file received from the task orchestration layer, generate the task data frame; wherein the task description file is generated through a visual graphical orchestration interface or a declarative configuration file interface provided by the task orchestration layer, and is used to define a task flow containing at least one computation task.

[0053] In this embodiment, the task orchestration layer can be represented as a functional layer independent of the distributed system peer-to-peer network, used to receive user task configurations, generate task execution logic, and output task description files. The task description file can be represented as a structured text or binary file generated by the task orchestration layer, recording task flow definition information.

[0054] In one possible implementation, a visual graphical orchestration interface is provided. This interface presents an interactive graphical canvas to the user. Users can drag and drop graphical components representing different computational tasks from a predefined library of computational units (e.g., data loading, data cleaning, feature extraction, model training, model evaluation, result export, etc.) onto the canvas. Users then define the data flow and dependencies between operators by drawing connecting lines, thus visually constructing a directed acyclic graph (DAG) structured workflow. Users can also configure specific parameters for each operator (e.g., script path, input data source, hyperparameters, etc.) through a property panel. Once the user completes the orchestration and triggers the execution command, the task orchestration layer can convert the graphical workflow description on the canvas into a structured task description file.

[0055] In one possible specific implementation, a declarative configuration file interface is used. This interface allows users to declaratively define task flows by writing text files in a pre-formatted format (e.g., YAML files, JSON files, or DSL scripts). In this configuration file, users can list all computational tasks included in the workflow using code or configuration key-value pairs, and explicitly specify the type of each task, its execution entry point, required parameters, and input / output dependencies between tasks.

[0056] In this embodiment, the task scheduling client or agent running on the first computing node can receive a task description file from the task orchestration layer. This receiving action can be the result of the first computing node actively polling the task orchestration layer to request new tasks, or it can be that the task orchestration layer actively pushes the task to the first computing node after preparing it. After receiving the task description file, the first computing node initiates the parsing and generation process. First, the task description file is parsed for syntax and structure, extracting each defined computing task instance, the attribute configuration of each task, and the dependency graph between tasks. Then, based on the parsing results, an independent task data frame is generated for each computing task instance in the workflow. When generating each task data frame, the execution code (or references to it), configuration parameters, input data source information, etc., of the corresponding computing task parsed from the description file can be encapsulated into the computing task content of the data frame. The predecessor and successor task identifiers of the parsed task in the graph can be encapsulated into the dependency relationship of the data frame. The parsed task type, resource requirements, priority, and other attributes can be encapsulated into the attribute information of the data frame.

[0057] In some implementations, the step of establishing a secure communication connection with at least one second computing node in the distributed system based on a peer-to-peer network protocol to join the peer-to-peer network of the distributed system includes: Step S122: Perform two-way authentication with the second computing node based on public-key cryptography to verify the legitimacy of the node identifiers and security tokens of both parties.

[0058] In this embodiment, public-key cryptography can be represented as a cryptographic technique based on asymmetric key pairs (public and private keys) to achieve encryption, decryption, and digital signatures. The public key can be publicly transmitted, while the private key can be held exclusively by a node. The two-way authentication can be represented as a process where a first computing node and a second computing node mutually verify each other's identity, unlike one-way authentication (where only one party verifies the other), ensuring that both parties are legitimate nodes in the distributed system and preventing malicious nodes from accessing the system. The node identifier can be represented as characteristic information used to uniquely distinguish each computing node in the distributed system, possessing global uniqueness and serving as the basis for identity authentication. The security token can be represented as a temporary identity credential generated by a pre-set authentication center in the distributed system or through negotiation between nodes.

[0059] In one possible implementation, each computing node in the distributed system (including the first and second computing nodes) is pre-configured with its own asymmetric key pair (public and private keys). The private key can be securely stored locally on the node (e.g., via an encryption chip or secure memory area), while the public key can be shared with other nodes in the system through a pre-defined trusted channel or temporarily transmitted securely during the initial authentication phase. Each computing node can pre-store a list of node identifiers (or node identifier verification rules) of legitimate nodes in the system to initially determine whether the other node belongs to the system's accessible range. The security token can be requested by the node from the system's trusted authentication module before authentication is initiated, or generated by the node initiating authentication based on its own private key and carrying valid information.

[0060] In one possible implementation, after the first computing node discovers the second computing node, it can generate an authentication request message. The message content may include: the first computing node's node identifier, its public key, a security token, and a digital signature of the above information using the first computing node's private key (the signature is to prevent information tampering). The authentication request message can be transmitted to the second computing node via a temporary basic communication link (this basic communication link is only used to transmit authentication-related information and is not encrypted). Upon receiving the authentication request message, the second computing node first extracts the first computing node's node identifier and compares it against a locally stored list of valid node identifiers (or according to preset verification rules) to determine if the node identifier is a valid identifier recognized by the system. If not, authentication is rejected directly, and the basic communication link is disconnected. If the node identifier is valid, the second computing node can extract the first computing node's public key and use it to decrypt and verify the digital signature in the authentication request message. If the decrypted information matches the original information (node ​​identifier, public key, security token) in the authentication request, the signature of the first computing node is confirmed to be valid, and the information has not been tampered with. The second computing node can also further verify the validity of the security token. The specific verification content may include: the validity period of the security token (to determine if it has expired), the scope of permissions in the security token (to determine if the user has the right to access the peer network), and the signature information of the security token (if the token is generated by a trusted authentication module, the signature must be verified using the public key of the authentication module). If the security token verification fails, authentication is rejected and the reason for failure is reported. After the second computing node completes the identity verification of the first computing node, if all verification items pass, it can generate an authentication response message. The message content may include: the node identifier of the second computing node, the public key of the second computing node, the security token of the second computing node, and a digital signature of the above information using the private key of the second computing node. This response message can be transmitted to the first computing node, simultaneously initiating the reverse authentication.

[0061] After receiving the authentication response message, the first computing node can use the same verification logic as the second computing node to sequentially verify the legitimacy of the second computing node's node identifier, the validity of the digital signature, and the legitimacy of the security token. The verification process is consistent with the above steps, ensuring that the second computing node is a legitimate node in the system.

[0062] If the first computing node successfully verifies the identity of the second computing node, it sends an authentication confirmation message to the second computing node. If any verification item fails, it sends an authentication failure message and disconnects the communication link. After the second computing node receives the authentication confirmation message, the two-way authentication is complete. If it receives an authentication failure message or does not receive a confirmation message within a preset timeout period, it terminates the authentication process.

[0063] Step S124: After the two-way authentication is successful, establish an end-to-end encrypted communication channel with the second computing node.

[0064] In this embodiment, the end-to-end encrypted communication channel can be represented as a communication link directly established between the first computing node and the second computing node, with data transmission encrypted throughout.

[0065] In one possible specific implementation, after two-way authentication is successful, the first computing node and the second computing node can exchange encrypted parameter negotiation requests and responses through the basic communication link (which is still unencrypted at this time, but can temporarily transmit negotiation information because the identities have been verified).

[0066] The parameters to be negotiated can include: The encryption algorithm type can be selected from: AES algorithm (e.g., AES-256), ChaCha20 algorithm, SM4 algorithm, etc.

[0067] The key exchange method, also known as the negotiation of the method used to generate the session key, can be achieved using a public-key cryptography-based key exchange protocol (e.g., ECDH protocol, RSA key exchange) or through a pre-shared key.

[0068] Data verification algorithms are algorithms used to verify the integrity of transmitted data, such as hash algorithms like SHA-256 and SM3.

[0069] For communication protocols, the basic communication protocol can be TCP, UDP, etc., which can be combined with encryption mechanisms to form a secure communication protocol.

[0070] In one possible implementation, the ECDH key exchange protocol can be used to generate the session key. Specifically, the first computing node and the second computing node can calculate the same shared key using the ECDH algorithm, based on their own private key and the other party's public key, respectively. This shared key is then hashed to generate the final session key (the session key is a symmetric encryption key used for subsequent encrypted data transmission).

[0071] In one possible implementation, RSA key exchange is used to generate the session key. Specifically, the first computing node can randomly generate the session key, encrypt it using the public key of the second computing node, and transmit it to the second computing node. The second computing node then decrypts the session key using its own private key to obtain the session key.

[0072] In one possible implementation, after the first and second computing nodes negotiate encryption parameters and generate a session key, each can configure its local encrypted communication module. Specifically, they can set the encryption algorithm, session key, data verification algorithm, and communication protocol parameters to initialize the encrypted communication link. After initialization, the first computing node can send a channel-ready message (encrypted using the session key) to the second computing node. The second computing node receives and decrypts this message and then sends back a confirmation message (also encrypted). Once the first computing node receives the confirmation message, the end-to-end encrypted communication channel is formally established. All subsequent data transmissions (e.g., task data frames, status synchronization information) will be conducted through this channel.

[0073] In some implementations, the step of obtaining and maintaining a global state view of the distributed system by performing distributed state synchronization with at least one second computing node includes: Step S142: A dual-path synchronization mechanism combining the main communication path and the backup communication path is adopted to periodically interact with the at least one second computing node to obtain the real-time resource status of each node.

[0074] In this embodiment, the first computing node can maintain and dynamically update two logical communication topologies within the established secure peer-to-peer network. The main communication path can be constructed as a tree structure covering the entire network, acyclic, and with minimal connection overhead. For example, it can be a minimum spanning tree communication path dynamically calculated based on the current network topology (considering factors such as inter-node latency and bandwidth). This path can serve as an efficient, low-redundancy state synchronization backbone. The first computing node can preferentially send its own state information to its neighboring nodes through this path and receive update information from these neighboring nodes that aggregates the states of other nodes in the network, thereby achieving rapid and orderly propagation of state information along the tree structure.

[0075] In this embodiment, the backup communication path can adopt a decentralized, randomly distributed communication mode, such as a communication path built based on the Gossip protocol. In this mode, the first computing node can periodically and randomly select several other nodes in the network (not necessarily direct neighbors in the MST) and send its own state or known partial global state to them in a chat-like manner, while also receiving random state pushes from other nodes.

[0076] In one possible implementation, the first compute node can preferentially perform periodic, regular state interactions via the main path of the minimum spanning tree (MST). Simultaneously, it can continuously monitor the communication quality of the main path. If an anomaly is detected in the interaction with certain neighboring nodes via the MST path (e.g., consecutive timeouts, excessively high packet loss rate), a switchover is automatically triggered, transferring some or all of the state interaction traffic to a backup path based on the Gossip protocol. This ensures that the state synchronization process is not interrupted during local network fluctuations or temporary node failures. The real-time resource status being interacted with can be a dataset, which may specifically include: the identifier of each compute node, network endpoint address, CPU utilization, memory utilization and availability, GPU utilization and video memory status, disk storage usage, network interface input / output bandwidth and network round-trip latency of several key peer nodes, and node reliability metrics calculated based on historical successful task rates and online duration.

[0077] Step S144: Merge the interactive state data based on conflict-independent data types to eliminate state conflicts.

[0078] In this embodiment, the conflict-independent data type can be represented as a data structure that allows data merging among distributed nodes without central coordination and ensures consistency of the merging results.

[0079] The state conflict can be represented as a situation where real-time resource state data sent by different nodes to the same target node differs (for example, node A perceives node B's CPU load rate as 30%, while node C perceives node B's CPU load rate as 50%). The merging process can be represented as the first computing node calling the CRDT's processing logic to integrate the state data from different nodes in the temporary buffer.

[0080] In one possible implementation, the appropriate CRDT type can be selected based on the different types of real-time resource status data. For example, for incremental data such as node online time and task execution count, a growth counter (G-Counter) can be selected; for fluctuating data such as CPU load rate and memory usage ratio, a last-written-first-register (LWW-Register) can be selected. For data that requires comprehensive evaluation of multiple nodes, such as node reliability coefficients, a weighted set can be selected.

[0081] In one possible implementation, the synchronization module of the first computing node can pre-define a mapping table between various status data and CRDT types. After receiving status data, it can automatically match the corresponding CRDT processing logic according to the data type. At the same time, the core parameters of CRDT are configured (e.g., the timestamp precision of LWW-Register is in milliseconds, and the evaluation weight of each node in Weighted-Set is the node's own reliability coefficient).

[0082] In one possible implementation, the first compute node can first classify the state data in the temporary cache, grouping them by target node identifier and data type. Then, it filters out data from each group whose timestamps are within a preset validity period (within 10 seconds), discarding expired data (to avoid using outdated states). Consistency checks can be performed on each group of data. If the core values ​​of all data within a group (e.g., a specific percentage of CPU load) are completely identical, it is determined to be conflict-free, and the data is directly considered valid. If there are numerical differences, it is determined to be a state conflict, and the CRDT merging logic is initiated.

[0083] Specifically, for the LWW-Register type (e.g., CPU load rate), timestamps can be extracted from each group of data, and the data with the latest timestamp can be selected as the merged result. If the timestamps are the same, the reliability coefficients of the data sending nodes are further compared, and data sent by the node with the higher coefficient is selected to ensure the accuracy of the result. For the Weighted-Set type (e.g., node reliability coefficient), the reliability coefficients evaluated by each node in each group can be extracted, combined with the weights of each sending node (its own reliability coefficient), and a weighted average can be calculated as the merged result (e.g., if node A evaluates node B with a coefficient of 0.8 and a weight of 0.9; and node C evaluates node B with a coefficient of 0.9 and a weight of 0.8, then the merged result is (0.8×0.9+0.9×0.8) / (0.9+0.8)≈0.85). For the G-Counter type (online duration), the values ​​of all data in the group can be directly summed to obtain the merged result (since online duration is incremental data, the summation can reflect the true online status of the nodes). After the merge is complete, the results can be validated for reasonableness (e.g., CPU load rate should be between 0-100%, and memory usage should not be negative). If the results exceed reasonable limits, the abnormal data should be removed, and the next best data should be selected for merging to ensure the validity of the merged results.

[0084] Step S146: Update the locally maintained global state view according to the merged state data, so that the global state view of the first computing node eventually converges with the global state views of other computing nodes in the distributed system.

[0085] In this embodiment, the locally maintained global state view can be represented as a structured dataset stored locally on the first computing node, used to record the real-time state of all legitimate nodes in the distributed system. The convergence and consistency can be represented as a global state view of all computing nodes in the distributed system. After a finite time of state synchronization and merging, the records of the same state data for the same target node are completely consistent (or the differences are within a preset error range), ensuring that the scheduling decisions of all nodes are based on a unified standard.

[0086] In one possible implementation, a dual storage mechanism of in-memory caching and local persistent storage is adopted for the storage of the global state view. The in-memory cache (e.g., a Redis in-memory database) stores the latest global state view, ensuring fast querying during scheduling decisions. Local persistent storage (e.g., LevelDB) stores historical update records and snapshots of the view, preventing data loss after node restarts. The global state view can employ a multi-level hash table structure. The first level uses the target node identifier as the key, corresponding to the second level. The second level uses the state data type (e.g., CPU load rate, memory usage ratio) as the key, corresponding to specific state values, data collection timestamps, merge processing timestamps, and other information.

[0087] In one possible implementation, the global state view can be updated as follows: First, after the first compute node completes the merging of state data, it compares the merged result with the corresponding old data in the local global state view. If the merged result differs from the old data (or the old data has expired), an incremental update is triggered (only the differing parts are updated, rather than a full update, reducing resource consumption). Then, the merged state data is written to the corresponding location in the memory cache, replacing the old data. Simultaneously, a version number (version number = original version number + 1) and an update timestamp are added to each updated data entry to mark the latest state of the data. A snapshot of the global state view in the memory cache can be generated and stored in the local persistent storage module every preset period (e.g., 30 seconds) or when the number of view updates reaches a preset threshold (e.g., 10 times). During the update process, state data in the global state view that has not been updated for more than a preset validity period (e.g., 30 seconds) can be synchronously cleaned up (the corresponding node may be offline), and the corresponding node is marked as a suspected failed node, with its status being closely monitored in subsequent synchronization cycles.

[0088] In this implementation, all computing nodes in the distributed system can use the same CRDT merging rules, view update logic, and data validity period settings to ensure that different nodes have consistent processing results for the same state data.

[0089] In some implementations, the primary communication path is a communication path constructed based on the minimum spanning tree (MST) of the network topology, and the backup communication path is a communication path constructed based on the Gossip protocol; wherein, the first computing node preferentially interacts with the at least one second computing node through the communication path constructed by the minimum spanning tree (MST), and switches to the path constructed by the Gossip protocol for state interaction when an anomaly is detected in the communication path constructed by the minimum spanning tree (MST).

[0090] In this embodiment, the main communication path can specifically be a minimum spanning tree communication path based on the network topology. This network topology can be represented as a graph consisting of all active computing nodes in the current distributed system and their logical connections, where the weight of each edge can be the network round-trip delay between nodes, the reciprocal of the communication bandwidth, or a measure of the overall network overhead.

[0091] The minimum spanning tree (MST) can be based on this weighted topology graph. It can be a tree structure with the minimum total weight of edges connecting all nodes, calculated by an algorithm (e.g., Prim's algorithm or Kruskal's algorithm). The logical links in this tree structure serve as the main communication paths for state synchronization. After joining the network, the first computing node can participate in or independently calculate the current network's MST and determine its parent and child nodes (i.e., logical neighbors) in the tree, thus embedding itself into this efficient synchronization backbone network.

[0092] In this embodiment, the backup communication path can specifically be a communication path built based on the Gossip protocol. The Gossip protocol is a communication mode that spreads information by mimicking the spread of epidemics. Its specific working process is as follows: each node (the first computing node) can periodically and randomly select one or more nodes from a known list of nodes as infection targets, sending its current state information to them. Simultaneously, it can also act as a receiver, passively receiving random state pushes from other nodes. This mechanism does not rely on a fixed routing structure.

[0093] In this embodiment, when an anomaly is detected in the MST path, the first computing node can continuously monitor the communication status of its MST path. The anomaly detection criteria can take various forms, such as: after sending state synchronization messages to a certain MST neighbor node multiple times consecutively, no acknowledgment or response is received within a predetermined timeout period; the packet loss rate with a certain neighbor node consistently exceeds a preset threshold; or the application layer perceives an abnormally prolonged interval for receiving state updates through this path.

[0094] Once an anomaly is detected in the communication path constructed through one or all MSTs, the first compute node can immediately trigger a switchover mechanism. It can temporarily downgrade or bypass the problematic MST link and instead activate and switch to state interaction via the Gossip protocol. At this time, it can encapsulate the state information that needs to be synchronized into Gossip messages. Following the logic of the Gossip protocol, it randomly selects other currently reachable peer nodes (not limited to the original MST neighbors) for information dissemination and collection, thereby maintaining the continuous state synchronization process and preventing the entire node from becoming disconnected due to a single point or partial failure on the main path.

[0095] In some implementations, the step of determining the target computing node for executing the computing task based on the attribute information of the task data frame and the global state view includes: Step S162: Based on the resource requirement constraints indicated in the attribute information of the task data frame and the real-time resource status of the candidate computing nodes recorded in the global status view, calculate the comprehensive score of each candidate computing node through a preset multi-dimensional cost evaluation function.

[0096] In this embodiment, the first computing node first parses the attribute information in the task data frame to extract the explicitly indicated resource requirement constraints. These constraints may be the minimum resource specifications necessary or desired to complete the computing task, specifically including: a minimum number of required CPU cores, a minimum memory capacity, whether a graphics processing unit (GPU) is required and its model or computing power requirements, the required local temporary storage space size, and a sensitivity declaration to network bandwidth or latency, etc. Simultaneously, the first computing node queries its maintained global state view to obtain the real-time resource status of all candidate computing nodes. Candidate computing nodes may include the first computing node itself and all second computing nodes recorded in the current global state view that are in an available or healthy state.

[0097] In this embodiment, the first computing node can invoke a preset multi-dimensional cost evaluation function. This function can be a pre-packaged evaluation logic module, specifically an algorithm that maps discrete resource states with different dimensions to task constraints into a single comparable value. The multi-dimensionality of the function can be reflected in its internal evaluation model simultaneously considering multiple resource indicator dimensions, such as at least the computing dimension (CPU / GPU utilization), memory dimension, network dimension (latency, bandwidth), and reliability dimension.

[0098] Specifically, for each candidate computing node, the function can take its specific real-time resource status data vector and the resource requirement constraints of the current task as input. Internally, the function can define a corresponding basic scoring sub-function for each evaluation dimension. This sub-function outputs a raw score based on the degree of matching between the node's actual state and the task requirements (e.g., a high score if the number of available CPU cores far exceeds the requirement, and a low score if the utilization is close to saturation). Then, the function can assign a corresponding weighting coefficient to each dimension, representing the relative importance of that dimension in this evaluation.

[0099] It should be noted that the weighting coefficients mentioned above are dynamic, and can be dynamically adjusted based on the task type of the current computation task (obtained from the attribute information of the task data frame). For example, when the task type attribute identifies it as GPU-intensive training, the weights of GPU-related dimensions (e.g., GPU utilization, available video memory) can be significantly increased, while the weights of dimensions related to pure CPU computation can be correspondingly decreased. Then, the function can combine the raw scores of each dimension into a single value through weighted synthesis, which is the comprehensive score of the candidate computation node.

[0100] Step S164: Based on the comprehensive score, determine the target computing node from the candidate computing nodes.

[0101] In this embodiment, the candidate computing node with the highest comprehensive score can be selected as the target computing node.

[0102] In some implementations, the multi-dimensional cost evaluation function calculates the comprehensive score based on factors in at least two of the following dimensions: the computing resource utilization rate of the candidate computing node, the memory resource occupancy rate, the network transmission latency, and the node reliability index; wherein, for different types of computing tasks, the weighting coefficients of the corresponding factors in the multi-dimensional cost evaluation function can be dynamically adjusted.

[0103] In this implementation, the node reliability index is used to measure the stability of candidate computing nodes in historical operation and the reliability of successful task completion. Specifically, it can be a value calculated by a statistical model based on historical data such as the node's recent task failure rate, unplanned offline duration, and number of hardware failure reports (e.g., a reliability score between 0 and 1).

[0104] In this embodiment, the ability to dynamically adjust the weighting coefficients of each dimension factor in the multi-dimensional cost evaluation function for different types of computing tasks can be expressed as the contribution weights of each dimension factor in the evaluation function to the final comprehensive score not being preset and fixed, but being configured in real time and adaptively according to the type attributes of the specific computing task currently being scheduled.

[0105] In one possible implementation, when the first computing node executes the evaluation function, it can first parse the task type identifier from the attribute information of the task data frame. This identifier can be a predefined enumerated value. Then, based on the identified task type, the function can invoke a preset weight configuration strategy to assign appropriate weighting coefficients to different dimensions (computation, memory, network, reliability). More specifically: For GPU-intensive training tasks, the weight of computational resource utilization (especially GPU-related metrics) can be increased, and the weight of memory resource utilization can also be increased accordingly (because training often requires a large amount of memory), while the weight of network transmission latency can be reduced (assuming that data exchange is not frequent during the training phase).

[0106] For low-latency inference tasks, the weight of the network transmission latency dimension can be set to the highest level to ensure that the task is scheduled to the node closest to the request source or data source, while the utilization of computing resources can also be given a certain weight to ensure processing speed.

[0107] For highly reliable batch processing tasks, the weight of node reliability metrics can be emphasized.

[0108] In some implementations, the local environment is a programming language runtime environment that eliminates global interpreter locks.

[0109] In this implementation, the global interpreter lock can be represented as a mutex lock used in a programming language interpreter (e.g., the standard CPython interpreter) to protect the interpreter's internal data structures. That is, only one thread is allowed to execute Python bytecode at a time. Even in a multi-core CPU hardware environment, multithreading cannot achieve true parallel computing; it can only achieve pseudo-concurrency through thread switching. This is the core bottleneck restricting the execution efficiency of CPU-intensive AI tasks.

[0110] Understandably, in distributed AI task processing scenarios, the AI ​​tasks that the first computing node needs to execute (e.g., data preprocessing, model inference, feature computation, etc.) are mostly CPU / GPU intensive tasks. Traditional runtime environments with GILs often have many limitations. Specifically, there is a waste of multi-core CPU resources. Even if the first computing node is equipped with a multi-core CPU, an interpreter with a GIL can only allow one thread to occupy one CPU core to perform computation, leaving the other cores idle, resulting in insufficient utilization of hardware resources. Moreover, task execution efficiency is low. Distributed AI tasks often need to process multiple batches of data frames in parallel. The GIL prevents multi-threaded parallel processing, forcing sequential processing. The task execution time increases linearly with the amount of data, failing to meet real-time requirements.

[0111] In this embodiment, the local environment can be the PyPy interpreter in the Python ecosystem, or Jython / IronPython in the Python ecosystem, or a native programming language runtime environment without GIL. For example, for high-performance AI tasks, the first computing node can deploy runtime environments for GIL programming languages ​​such as Go, Rust, and Julia.

[0112] like Figure 3 As shown, according to an embodiment of the present invention, a processing apparatus for distributed artificial intelligence tasks is provided, applied to a first computing node in a distributed system, the apparatus comprising: The acquisition module is used to acquire the task data frame to be processed; wherein the task data frame encapsulates the corresponding computing task and the dependency and attribute information of the computing task; A connection module, which is used to establish a secure communication connection with at least one second computing node in a distributed system based on a peer-to-peer network protocol, so as to join the peer-to-peer network of the distributed system; A synchronization module is used to acquire and maintain a global state view of the distributed system by performing distributed state synchronization with at least one second computing node; wherein the global state view records the real-time resource status of each computing node in the distributed system. The determination module is used to determine the target computing node for performing the computing task based on the attribute information of the task data frame and the global state view; wherein the target computing node is one of the first computing node or at least one second computing node; An execution module is configured to, when the target computing node is the first computing node, have the first computing node invoke its local environment to execute the computing task, and when the target computing node is one of the at least one second computing node, migrate the task data frame to the target computing node so that the target computing node can execute the computing task.

[0113] In one specific implementation plan, a distributed artificial intelligence computing and scheduling method based on a decentralized zero-trust architecture is provided. This method relies on a P2P network structure without a central node to build a distributed system, which can unify edge computing, cloud computing and multi-tenant secure collaboration scenarios, enabling any device to participate in distributed intelligent task processing as an independent computing node.

[0114] The system corresponding to this implementation plan is divided into three layers from bottom to top. Each layer works together to realize the closed loop of distributed artificial intelligence computing and task scheduling: The first layer is the Mojo-based runtime layer, which provides a Python-like syntax interface for algorithm development engineers. It achieves high-performance execution through JIT compilation, eliminates the interpreter lock (GIL) to support concurrent thread tasks, and breaks through the performance and multi-threaded concurrency bottlenecks of Python while maintaining its development efficiency and high readability, providing core computing capabilities for a single node.

[0115] The second layer is the distributed scheduling layer, which is responsible for node search, task scheduling, efficient transmission of structured data between nodes and access security. It constructs a distributed network topology without a central node, where all nodes are peers and interact through the P2P communication protocol. It adopts the MST main path and the Gossip backup path (to ensure network connectivity, and ensures that the status of all nodes automatically converges and becomes consistent by periodically synchronizing status information and combining it with CRDT technology).

[0116] The third layer is the task orchestration layer, which defines task dependencies and data flow directions for business logic. It allows developers to build tasks and data flows (DAGs) through simple descriptions or drag-and-drop processes, driving the computation process in the form of data frames. Each data frame carries metadata such as task ID, type, priority, and token, and the scheduler determines the execution node based on resource availability. Each layer provides security authentication and state sharing support through the SecurityToken and KVStore modules, forming a collaborative connection where the runtime layer provides computing power, the distributed scheduling layer handles communication and scheduling, and the task orchestration layer calls the scheduling layer to allocate tasks.

[0117] In this distributed system, each node connects to the network through four main modules. Inter-node communication uses a decentralized public-private key signature verification method, with each communication carrying a signature and timestamp to effectively prevent forgery or replay attacks. The four modules are: 1) Discovery module, responsible for proactively discovering and identifying other peer nodes in a dynamic network environment and establishing a list of potential communication neighbors; 2) Connectivity module, establishing persistent, bidirectional, and secure encrypted communication channels with neighboring nodes found by the discovery module, undertaking the underlying transmission responsibility for all structured data; 3) Synchronization module, efficiently and reliably synchronizing resource load, network topology, and other state information among all nodes in the network, ensuring that the states of all nodes eventually automatically converge to consistency under network partitioning or concurrent operation scenarios; and 4) Scheduling module, acting as the decision-making brain of the nodes, intelligently deciding how to execute new tasks (local execution or migration to other more suitable nodes in the cluster) based on the global state view provided by the synchronization module.

[0118] The distributed artificial intelligence computing and scheduling method in this implementation plan follows the following data processing flow: First, in the task definition phase, users define task flows through a visual interface or YAML file provided by the task orchestration layer. Second, in the data frame generation phase, each task unit (Filter) generates a DataFrame structure including input data and metadata tags. Third, in the task dispatch phase, the distributed scheduling layer receives the data frame and sequentially performs steps such as target task type analysis, neighbor node resource table lookup, optimal node evaluation (based on parameters such as GPU availability, latency, network bandwidth, and token validity), and execution method decision (local execution or remote migration). Next, in the task execution phase, the target node executes the task in parallel using XLang Runtime, generating new data frames from the execution results and returning them upstream. Then, in the result aggregation phase, the Galaxy Pipeline merges the execution results from each node, triggering subsequent tasks or transmitting them to the user interface. Finally, in the security audit phase, each task execution or migration is recorded by the SecurityToken module, generating encrypted audit logs to prevent unauthorized or forged operations.

[0119] This implementation plan achieves efficient and secure operation of distributed computing and scheduling through the following core technologies: Firstly, it employs a decentralized scheduling technology. Unlike traditional systems like Ray and Spark that rely on Head / Master nodes for centralized task allocation, this method distributes scheduling logic across nodes via a P2P communication layer, ensuring stable scheduling in large-scale network scenarios and overcoming centralized bottlenecks. Each node possesses local scheduling capabilities and can independently decide whether to execute or migrate tasks. The minimum spanning tree algorithm ensures connectivity across all nodes with minimal communication latency. It also features an automatic topology reconstruction mechanism; when a node goes offline, neighboring nodes automatically re-establish connectivity paths, enabling system self-healing.

[0120] Secondly, dynamic task migration and resource awareness technology. Each node in the system periodically publishes its own resource usage information, including CPU core count, GPU load, memory usage, and network latency; the scheduling engine comprehensively evaluates node adaptability through a weighted decision function.

[0121] In the formula, , , , , as well as The weights of each indicator are (which can be dynamically adjusted according to the task type). , , , These are the values ​​for CPU load rate, GPU load rate, memory usage ratio, and network latency after unifying their dimensions and directions. , These are network bandwidth availability and node reliability coefficient, respectively. When a node's load exceeds a threshold or path latency changes, the system automatically migrates tasks to other nodes. The migration process maintains state consistency through a checkpoint mechanism, ensuring computational continuity.

[0122] Third, zero-trust secure communication technology. This method does not rely on the assumption of trust within the internal network. Each node communication requires verification of the token and signature. The specific communication steps are: the requester signs the task token; the receiver verifies the signature and timestamp; after successful verification, an encrypted channel is established and data transmission continues; all communication logs are written to a security audit database. This mechanism effectively prevents node forgery and data leakage.

[0123] Fourth, data frame-driven and intelligent orchestration execution technology. As the core carrier of computing units, the data frame contains key information such as input data pointers, task dependency information (upstream / downstream node IDs), and metadata tags (priority, target model, execution strategy). The task orchestration layer uses an event-driven mechanism to parse and schedule data frames, automatically identifying the execution order and dependencies, eliminating the need for manual synchronization control and significantly reducing development complexity.

[0124] Fifth, high-performance language runtime technology. Based on the open-source Mojo, a high-performance runtime environment is implemented, which has core features such as just-in-time (JIT) compilation to native machine code and a multi-threaded lock-free execution model. It also has a built-in Python interpreter, which supports the direct import and use of any Python module in Mojo code. This runtime effectively solves the performance bottleneck of traditional Ray systems that are limited by Python GIL, and achieves efficient concurrent execution on a single node with multiple cores.

[0125] Sixth, task self-healing and fault tolerance technology. When a node fails or the network is abnormal, other nodes automatically rebuild the network topology and re-instantiate computing units according to the historical task state; by saving state snapshots through distributed KVStore, tasks can be resumed and executed on any node, ensuring system robustness.

[0126] According to an embodiment of the present invention, an electronic device is provided; please refer to... Figure 4 The electronic device in this embodiment may include one or more of the following components: a processor, a network interface, memory, non-volatile memory, and one or more application programs, wherein the one or more application programs may be stored in non-volatile memory and configured to be executed by one or more processors, and the one or more programs are configured to perform the methods as described in the foregoing method embodiments.

[0127] According to embodiments of the present invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a computer, causes the computer to perform the method described in any of the above embodiments.

[0128] According to embodiments of the present invention, a computer program product comprising instructions is also provided, which, when executed by a computer, cause the computer to perform a method in any of the above embodiments.

[0129] As is known from common technical knowledge, this invention can be implemented through other embodiments that do not depart from its spirit or essential characteristics. Therefore, the disclosed embodiments described above are merely illustrative in all respects and are not the only ones. All modifications within the scope of this invention or its equivalents are included in this invention.

[0130] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0131] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0132] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0133] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0134] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit it. Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the specific implementation of the present invention. Any modifications or equivalent substitutions that do not depart from the spirit and scope of the present invention should be covered within the scope of protection of the claims of the present invention.

Claims

1. A method for processing distributed artificial intelligence tasks, characterized in that, The method, applied to the first computing node in a distributed system, includes: Obtain the task data frame to be processed; wherein the task data frame encapsulates the corresponding computing task and the dependency and attribute information of the computing task; Based on a peer-to-peer network protocol, a secure communication connection is established with at least one second computing node in the distributed system to join the peer-to-peer network of the distributed system. By performing distributed state synchronization with at least one second computing node, a global state view of the distributed system is obtained and maintained; wherein, the global state view records the real-time resource status of each computing node in the distributed system. Based on the attribute information of the task data frame and the global state view, the target computing node for executing the computing task is determined; wherein, the target computing node is one of the first computing node or at least one second computing node; If the target computing node is the first computing node, the first computing node invokes its local environment to execute the computing task. If the target computing node is one of the at least one second computing node, the task data frame is migrated to the target computing node so that the target computing node can execute the computing task.

2. The method of claim 1, wherein, The step of obtaining the task data frame to be processed includes: In response to a task description file received from the task orchestration layer, the task data frame is generated; wherein the task description file is generated through a visual graphical orchestration interface or a declarative configuration file interface provided by the task orchestration layer, and is used to define a task flow containing at least one computation task.

3. The method of claim 1, wherein, The step of establishing a secure communication connection with at least one second computing node in the distributed system based on a peer-to-peer network protocol to join the peer-to-peer network of the distributed system includes: Perform two-way authentication with the second computing node based on public-key cryptography to verify the legitimacy of the node identifiers and security tokens of both parties; After the two-way authentication is successful, an end-to-end encrypted communication channel is established with the second computing node.

4. The method of claim 1, wherein, The step of obtaining and maintaining a global state view of the distributed system by performing distributed state synchronization with at least one second computing node includes: A dual-path synchronization mechanism combining a primary communication path and a backup communication path is adopted to periodically interact with the at least one second computing node to obtain the real-time resource status of each node. The state data of the interaction is merged based on conflict-independent data types to eliminate state conflicts. The locally maintained global state view is updated based on the merged state data, so that the global state view of the first computing node eventually converges with the global state views of other computing nodes in the distributed system.

5. The method according to claim 4, characterized in that, The primary communication path is a communication path constructed based on the minimum spanning tree (MST) of the network topology, and the backup communication path is a communication path constructed based on the Gossip protocol. The first computing node preferentially interacts with the at least one second computing node through the communication path constructed by the minimum spanning tree (MST), and switches to the path constructed by the Gossip protocol for state interaction when an anomaly is detected in the communication path constructed by the minimum spanning tree (MST).

6. The method of claim 4, wherein, The step of determining the target computing node for executing the computing task based on the attribute information of the task data frame and the global state view includes: Based on the resource requirement constraints indicated in the attribute information of the task data frame and the real-time resource status of the candidate computing nodes recorded in the global status view, a comprehensive score for each candidate computing node is calculated using a preset multi-dimensional cost evaluation function. Based on the comprehensive score, the target computing node is determined from the candidate computing nodes.

7. The method of claim 6, wherein, The multi-dimensional cost evaluation function calculates the comprehensive score based on factors in at least the following two dimensions: the computing resource utilization rate of candidate computing nodes, the memory resource occupancy rate, the network transmission latency, and the node reliability index; wherein, for different types of computing tasks, the weighting coefficients of the corresponding factors in the multi-dimensional cost evaluation function can be dynamically adjusted.

8. The method of claim 1, wherein, The local environment is a programming language runtime environment that eliminates the global interpreter lock.

9. An electronic device, comprising: include: A memory, and one or more processors communicatively connected to the memory; The memory stores instructions that can be executed by the one or more processors to cause the one or more processors to implement the method as described in any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that, The readable storage medium stores a computer program that, when executed by a processor, implements the method of any one of claims 1 to 8.