A method for orchestrating and running AI task flows

By encapsulating custom operators into images and executing task flows in container form, the problems of low resource utilization and incomplete isolation in existing technologies are solved, achieving efficient task flow orchestration and visualization, and improving user work efficiency.

CN120085986BActive Publication Date: 2026-06-30SHENZHEN KAIQIAO TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN KAIQIAO TECHNOLOGY CO LTD
Filing Date
2025-02-13
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing cloud-native AI task flow orchestration systems suffer from low resource utilization, incomplete resource isolation, and complex user orchestration, making it difficult to achieve drag-and-drop task flow orchestration and custom operators.

Method used

User-defined operators are encapsulated into images, registered to the task flow orchestration and scheduling platform, and the task flow is executed in the form of containers. The task flow is then orchestrated in a drag-and-drop manner on the web interface, enabling the visualization of the task flow and the storage of results.

Benefits of technology

It improves resource utilization and resource isolation, simplifies the task flow orchestration process, enhances user work efficiency, and enables rapid task flow creation and modification without complex configuration or programming.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120085986B_ABST
    Figure CN120085986B_ABST
Patent Text Reader

Abstract

This invention provides a method for orchestrating and executing AI task flows, comprising: user-developed code for custom operators; encapsulating the custom operators into images; registering the images into an operator library in a task flow orchestration and scheduling platform; user-configured task flow orchestration and execution parameters via a drag-and-drop interface using operators; orchestrating multiple nodes of upstream and downstream tasks and the execution parameters of each task into a task flow, which is then executed in the form of containers; and returning the execution results to the web interface for visualization by the user after the containers have finished running. This invention uses containers to host task execution, allowing for on-site resource allocation during task execution, achieving greater resource utilization and isolation. Furthermore, this invention supports user-defined task operators for task flow orchestration, enabling more flexible application scenarios for enterprises.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and more specifically, to a method for orchestrating and running AI task flows. Background Technology

[0002] Cloud-native AI development platforms integrate mature AI development frameworks with cloud-native tools to flexibly access cloud resources and efficiently deploy cloud applications. On one hand, they help enterprise developers improve the efficiency of algorithm model development; on the other hand, they enhance the efficiency of delivery, deployment, and maintenance while reducing various costs. AI model training requires repeated running, evaluation, and correction of the model. Through automated process components provided by cloud-native technologies such as pipelines, the automation level of processes such as parameter selection, hyperparameter tuning, and periodic training with new data can be improved, thereby increasing the speed of model output.

[0003] In cloud-native AI task flow orchestration, tasks are typically carried out as processes, such as those in Airflow and DolphinScheduler. This architecture requires the platform to deploy the necessary components, such as worker executors, as pods before tasks start. Deploying these worker executors already consumes machine resources. Airflow cannot implement drag-and-drop task flow orchestration via a web interface; users need to write Python scripts to define each task and its upstream / downstream relationships, which is quite complex. Furthermore, Airflow tasks are carried out as processes running within pre-deployed workers. Even when tasks are not running, processes still consume machine resources, resulting in low resource utilization and a lack of resource isolation between tasks. DolphinScheduler can implement drag-and-drop task flow orchestration via a web interface, but users cannot define custom operators. Like Airflow, DolphinScheduler tasks are also carried out as processes running within pre-deployed workers, leading to similar issues of low resource utilization and incomplete resource isolation. Summary of the Invention

[0004] To address the aforementioned problems, the present invention aims to provide a method for orchestrating and executing AI task flows.

[0005] A method for orchestrating and executing AI task flows includes:

[0006] Step 1: Users develop their own code for custom operators;

[0007] Step 2: Package the custom operator into an image;

[0008] Step 3: Register the image to the operator library in the task flow orchestration and scheduling platform;

[0009] Step 4: Users can use a drag-and-drop interface to orchestrate the task flow of operators and configure task execution parameters.

[0010] Step 5: Arrange the multiple nodes of upstream and downstream tasks and the execution parameters of each task into a task flow and execute it in the form of containers;

[0011] Step 6: After the container finishes running, return the results to the web interface for a visual display to the user.

[0012] Preferably, in step 2, the custom operator is encapsulated into a container image, and the custom operator has a start time and an end time.

[0013] Preferably, in step 5, the task flow orchestration and scheduling platform renders the corresponding operator into a web interface based on the task execution parameters, and then the user fills in the input parameters. The task flow orchestration and scheduling platform then passes the user input parameters to the image of the corresponding operator, thereby starting the task.

[0014] Preferably, in step 6, the container includes a control container and a business container; the control container is started, and the control container starts the business container to run the orchestrated task flow. The business container stores the visual output of the running results in the / metric.json file; the control container obtains the / metric.json file from the business container and writes the / metric.json file to distributed storage; the control container reads / metric.json from the distributed storage, parses it, and displays it on the web interface.

[0015] Preferably, in the task flow, the transfer of variables between upstream and downstream tasks is divided into key-value type variable transfer, file transfer, and cache transfer.

[0016] Preferably, for key-value type variable passing, each task node in the task flow will start the corresponding business container and control container. The business container will output the variable and write it to the / output file, and the control container will write it to the distributed storage. Downstream nodes read the upstream output as input through {{task_name.output}}.

[0017] Preferably, for file transfer, each task automatically mounts its personal distributed storage directory into the container, and the container can write variables to the file in the upstream task and read the file in the downstream task.

[0018] Preferably, for cache passing, each task's container will configure the address of the cached Redis in the form of an environment variable, and users can read and write variables in the cache themselves in the task.

[0019] The present invention also provides an electronic device, including a bus, a transceiver, a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the transceiver, the memory, and the processor are connected via the bus, characterized in that the computer program, when executed by the processor, implements the steps in the above-described method for orchestrating and running an AI task flow.

[0020] The present invention also provides a computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a processor, implements the steps in the above-described method for orchestrating and running an AI task flow.

[0021] The beneficial effects of the AI ​​task flow orchestration method provided by this invention are as follows: Compared with the prior art, this invention uses containers to carry out task execution, which can occupy resources on-site during task execution, achieving greater resource utilization and resource isolation. Furthermore, this invention achieves task flow orchestration through drag-and-drop, allowing users to quickly create and modify task processes without complex configuration or programming, thus improving work efficiency.

[0022] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0024] Figure 1 A flowchart illustrating a method for orchestrating and running AI task flows according to an embodiment of the present invention is shown.

[0025] Figure 2 This diagram illustrates a task flow orchestration method provided in an embodiment of the present invention.

[0026] Figure 3 This diagram illustrates a process as a task carrier according to an embodiment of the present invention.

[0027] Figure 4 A schematic diagram illustrating the use of a container as a carrier provided in an embodiment of the present invention is shown;

[0028] Figure 5 This diagram illustrates the visualization implementation flowchart provided in the embodiment of the present invention.

[0029] Figure 6 The diagram shows the effect of the front-end display provided in the embodiment of the present invention;

[0030] Figure 7 A flowchart illustrating the variable transfer process provided in an embodiment of the present invention is shown;

[0031] Figure 8 A flowchart illustrating the file transfer process provided in an embodiment of the present invention is shown;

[0032] Figure 9 This diagram illustrates the definition of constants provided in an embodiment of the present invention.

[0033] Figure 10 A schematic diagram illustrating the use of constants provided in an embodiment of the present invention is shown;

[0034] Figure 11 This diagram illustrates the flow control of the task flow provided in an embodiment of the present invention. Detailed Implementation

[0035] In the description of this invention, it should be understood that the terms "center," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," and "counterclockwise," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing this invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on this invention.

[0036] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0037] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," and "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.

[0038] Please see Figure 1 A method for orchestrating and executing AI task flows, comprising:

[0039] Step 1: Users develop their own code for custom operators;

[0040] Step 2: Package the custom operator into an image;

[0041] In step 2, the custom operator is encapsulated into a container image, and the custom operator has a start time and an end time;

[0042] Step 3: Register the image to the operator library in the task flow orchestration and scheduling platform;

[0043] Step 4: Users can use a drag-and-drop interface to orchestrate the task flow of operators and configure task execution parameters.

[0044] Operators are functionalities fixed through code and can accept user parameters, allowing even users with no prior knowledge to implement algorithmic or engineering processing templates through simple parameter configuration. Task operators are nodes in the task flow orchestration interface that are dragged and dropped; each node represents an actual executable process. To enable the reuse of general algorithm training functions and allow users without algorithmic skills to directly debug algorithms, general functions need to be abstracted and defined as task operators. Task operators contain certain general algorithm training functions, forming an operator list as shown on the left. Users drag and drop task operators onto the canvas to create tasks, configure task startup parameters, complete task definition, and define upstream and downstream relationships between multiple tasks through connections.

[0045] Step 5: Arrange the multiple nodes of upstream and downstream tasks and the execution parameters of each task into a task flow and execute it in the form of containers;

[0046] In step 5, the task flow orchestration and scheduling platform renders the corresponding operators into a web interface based on the task execution parameters. Then, the user fills in the input parameters, and the task flow orchestration and scheduling platform passes the user input parameters to the image of the corresponding operator, thereby starting the task.

[0047] Step 6: After the container finishes running, return the results to the web interface for a visual display to the user.

[0048] In step 6, the container includes a control container and a business container; the control container is started, and the control container starts the business container to run the orchestrated task flow. The business container stores the visualization output of the running results in the / metric.json file; the control container obtains the / metric.json file from the business container and writes the / metric.json file to distributed storage; the control container reads the / metric.json file from the distributed storage, parses it, and displays it on the web interface.

[0049] In a task flow, variable passing between upstream and downstream tasks can be categorized into key-value variable passing, file passing, and cache passing.

[0050] For key-value type variable passing, each task node in the task flow will start the corresponding business container and control container. The business container will output the variable and write it to the / output file, and the control container will write it to the distributed storage. Downstream nodes read the upstream output as input through {{task_name.output}}.

[0051] For file transfer, each task automatically mounts its own distributed storage directory to the container. The container can write variables to the file in the upstream task and read the file in the downstream task.

[0052] For cache passing, each task's container will configure the address of the cached Redis as an environment variable, and users can read and write variables in the cache themselves within the task.

[0053] The following describes a method for orchestrating and running AI task flows according to the present invention in further detail with reference to specific embodiments:

[0054] like Figure 1-2 As shown, users fill in parameters on the web interface according to the corresponding operator, and these parameters are saved to the platform's backend. The platform then starts a container based on the image, using the operator registered by the developer and the parameters entered by the user. The parameters are passed into the container. In other words, the container passes the user-entered parameters to the developer-defined code, thereby enabling the reuse of the developer's functional code. After the container finishes running, the results are returned to the web interface for a visual display to the user.

[0055] To implement the above process, the following six functions need to be implemented first:

[0056] 1. Task container: The task container should be used. This makes it easier to schedule and isolate AI task resources in a multi-machine environment. Figure 3 This is a diagram illustrating how a process acts as a task carrier. Before a task runs, the container starts first, and its resources are already occupied. Figure 4 This is a schematic diagram illustrating the use of containers as carriers. There is a one-to-one correspondence between tasks and containers; a container is only started and resources are consumed when a task needs to be run. Therefore, using containers as task carriers can achieve higher resource utilization and resource isolation.

[0057] 2. Formal Definition of Operators: Operators are not defined by the platform, but by the user. However, users must follow the platform's rules for operator definition. Users can implement any operator code, without any coding language restrictions. After the code development is complete, it can be packaged into a container image and then registered as an operator. Users only need to provide the platform with the relevant information about the input parameters required for the operator image to run.

[0058] Rules for operator definition:

[0059] 1) Operators must have a start time and an end time; they cannot be permanently online.

[0060] Since the operator ultimately exists as a task node in the task flow, it must have a start time and an end time; otherwise, the next task cannot be started.

[0061] 2) The operator code is not subject to language restrictions;

[0062] Since operators ultimately need to be built into container images, the implementation language of the operators is not important, as long as they can be encapsulated into images.

[0063] 3) The operator code must be able to accept startup parameters and startup commands used to start the code;

[0064] This allows operators to be reused, enabling different users to achieve different functionalities by setting different startup parameters. The startup parameters must be of type string, long text, JSON, etc.

[0065] 4) Operators must be able to be packaged into container images.

[0066] The final operator runs on a cloud-native environment, so it must exist in the form of an image, because the cloud-native environment is essentially container orchestration, and containers are started based on images.

[0067] The operators that users can implement themselves are divided into four categories:

[0068] 1) Request external interfaces of the platform via API;

[0069] This involves sending requests to external server interfaces and retrieving the corresponding returned data. Developers typically use these external platforms. For example, they might use an API to pull data to their local machine, such as downloading the Huggingface dataset, downloading a model, or importing data from an annotation platform.

[0070] 2) Call the platform's own interface via API;

[0071] These operators call internal interfaces, and developers are typically platform developers. Operators can automate the interaction between different modules within the platform via pipelines. For example, deploying inference services through a pipeline on the platform itself and automatically publishing them is similar to acting as a client, except that here it connects to the platform itself, allowing for automation and interaction with other platform modules. A simple pipeline cannot deploy online services; it can only deploy services with a defined end time. However, operators enable interaction between pipeline modules and inference service modules.

[0072] 3) Execute a standalone task;

[0073] Standalone operation refers to the ability to run independently within a single container without interacting with other application platforms. Examples include traditional machine learning model training, model evaluation, offline inference, and the processing of image, text, and audio data.

[0074] 4) Start a distributed cluster;

[0075] Starting, monitoring, and recycling this distributed cluster are tasks that require high-level permissions and complex parameter configurations. Examples include distributed TensorFlow, PyTorch, and distributed DeepSpeed.

[0076] For example, if this invention aims to develop a decision tree model operator that allows users to reuse decision tree model training code, requiring only control of operator startup parameters such as input data addresses and model parameters, the following steps are needed:

[0077] 1) Develop training code for the decision tree model;

[0078] 2) Write a Dockerfile to define the environment required to run the code in 1);

[0079] 3) Write a shell script to package the code into an image and push the image to the image repository;

[0080] 4) Register the image as an operator, configure the startup parameters, and publish the template;

[0081] 5) Drag and drop operators in the pipeline to form task nodes, configure startup parameters, and execute.

[0082] The page for filling in parameters in the web interface is shown below. These are the startup parameters related to the training of a decision tree model operator.

[0083]

[0084] The platform can render the operator into a web interface based on the information of the operator input parameters. Then, the user fills in the input parameters, and the platform passes the user input parameters to the operator's image, thereby starting the task.

[0085] The visualization results of the task are actively written to the container's / metric.json by the user during container runtime. The format is defined as follows: it can visualize images, text, CSV data, ECharts source code, HTML source code, iframes embedded in other pages, etc.

[0086] The process flow for visualizing the results is as follows: Figure 5 As shown.

[0087] 1. First, the task will start the control container, and at the same time, the backend will inform the control container of the task ID in the form of environment variables;

[0088] 2. The control container starts the business container to run the operator. The business container will generate the output that needs to be visualized and store it in the / metric.json file in the format defined above.

[0089] 3. The control container will proactively obtain / metric.json from the business container, and then write the file to distributed storage.

[0090] 4. The backend reads / metric.json from distributed storage;

[0091] 5. The front-end parses ` / metric.json` and displays it in the web interface, such as... Figure 6 As shown.

[0092] Task inputs and outputs:

[0093] The input and output of a task, that is, the transfer of variables between upstream and downstream tasks in a task flow, can take three forms. The first is the transfer of key-value type variables, the second is file transfer, and the third is cache transfer.

[0094] 1. Passing variables (for key-value type variables)

[0095] like Figure 7As shown, each task node in the task flow starts its corresponding business container and control container. The business container outputs variables, writes them to the ` / output` file, and the control container writes them to distributed storage. Downstream nodes read the upstream output as input through `{{task_name.output}}`.

[0096] When the system reads a template variable represented by {{}} containing an upstream node name, it will automatically convert it into an upstream output of argo, and argo will then implement the output variable that was transferred from object storage.

[0097] Because the control container completes the variable transfer before the business container starts, these variables can be used as startup parameters for the business container, which is the biggest difference from the other two scenarios.

[0098] 2. File transfer (for large file variables)

[0099] like Figure 8 As shown, each task automatically mounts its personal distributed storage directory into the container. The task container can write variables to a file in the upstream task and read the file in the downstream task.

[0100] 3. Passing data from a cache (for large memory variables)

[0101] Each task's container will configure the address of the cached Redis as an environment variable, and users can read and write the variables in the cache themselves within the task.

[0102] Global constants for the task flow: These are user-defined and support Python objects, such as `datetime`. Users configure them themselves when using them. For example, the process for defining a time variable is as follows:

[0103] like Figure 9 As shown, define constants

[0104] YYYYMMDD={{datetime.datetime.now().strftime('%Y-%m-%d')}}

[0105] like Figure 10 As shown, constants are used.

[0106] v{{YYYYMMDD}}

[0107] Task flow direction control: such as Figure 11 As shown, the downstream task to be run is determined by the name of the downstream task contained in the last line of the task's log.

[0108] This invention uses containers to carry out task execution, which can occupy resources on-site during task execution, achieving greater resource utilization and resource isolation. Furthermore, this invention enables task flow orchestration through drag-and-drop, allowing users to quickly create and modify task processes without complex configuration or programming, thus improving work efficiency.

[0109] The present invention also provides an electronic device, including a bus, a transceiver, a memory, a processor, and a computer program stored in the memory and executable on the processor. The transceiver, the memory, and the processor are connected via the bus. The computer program, when executed by the processor, implements the steps in the above-described method for orchestrating and running an AI task flow. Compared with the prior art, the beneficial effects of the electronic device provided by the present invention are the same as those of the AI ​​task flow orchestration and running method described in the above-described technical solution, and will not be repeated here.

[0110] The present invention also provides a computer-readable storage medium having a computer program stored thereon, characterized in that, when the computer program is executed by a processor, it implements the steps in the above-described method for orchestrating and running an AI task flow. Compared with the prior art, the beneficial effects of the computer-readable storage medium provided by the present invention are the same as the beneficial effects of the above-described method for orchestrating and running an AI task flow, and will not be repeated here.

[0111] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for AI task flow orchestration running, characterized in that, include: Step 1: Users develop their own code for custom operators; Step 2: Package the custom operator into an image; In step 2, the custom operator is encapsulated into a container image, and the custom operator has a start time and an end time; Step 3: Register the image to the operator library in the task flow orchestration and scheduling platform; Step 4: Users can use drag-and-drop to orchestrate the task flow of operators and configure the execution parameters of tasks in the web interface; Step 5: The execution parameters of multiple nodes of upstream and downstream tasks and each task are finally orchestrated into a task flow. The tasks in the task flow are executed in the form of containers, and there is a one-to-one correspondence between tasks and containers. In step 5, the task flow orchestration and scheduling platform renders the corresponding operators into a web interface according to the task execution parameters. Then, the user fills in the input parameters, and the task flow orchestration and scheduling platform passes the user input parameters to the image of the corresponding operator, thereby starting the task. Step 6: After the container finishes running, return the results to the web interface for a visual display to the user; In step 6, the container includes a control container and a business container; The control container starts the business container to run the orchestrated task flow. The business container stores the visual output of the results in the / metric.json file. The control container retrieves the / metric.json file from the business container and writes it to distributed storage. The control container then reads the / metric.json file from the distributed storage, parses it, and displays it on the web interface. For key-value type variable passing, each task node in the task flow will start the corresponding business container and control container. The business container will output the variable and write it to the / output file, and the control container will write it to the distributed storage. Downstream nodes read the upstream output as input through {{task_name.output}}. For file transfer, each task will automatically mount its own distributed storage directory to the container. The container can write variables to the file in the upstream task and read the file in the downstream task. For cache passing, each task's container will configure the address of the cached Redis as an environment variable, and users can read and write variables in the cache themselves within the task.

2. An electronic device comprising a bus, a transceiver, a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the transceiver, the memory, and the processor are connected via the bus, characterized in that, When the computer program is executed by the processor, it implements the steps in the method for orchestrating and running an AI task flow as described in claim 1.

3. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps in the method for orchestrating and running an AI task flow as described in claim 1.