Model management method and device for model training and reasoning based on orchestration system

By integrating model training, storage, and deployment functions into the orchestration system, and utilizing directed acyclic graphs and orchestration engines to automatically manage the lifecycle of machine learning models, the problem of separating model training and deployment is solved, achieving efficient end-to-end automated management and resource optimization.

CN122198130APending Publication Date: 2026-06-12ULTRAPOWER SOFTWARE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ULTRAPOWER SOFTWARE
Filing Date
2026-03-12
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing machine learning model lifecycle management, the separation of model training, storage and deployment functions leads to cumbersome processes, is prone to errors, and has low efficiency in task coordination.

Method used

By adopting an orchestration-based approach, model training, storage, and deployment functions are integrated into a single platform. Training tasks are generated through drag-and-drop operations on a directed acyclic graph and a visual interface. The orchestration engine resolves dependencies, automatically executes sub-tasks, and dynamically allocates computing resources and image files based on metadata and computing cluster resource status, achieving fully automated management of the entire process.

🎯Benefits of technology

It has achieved fully automated management of the entire process from model development to deployment, improved the work efficiency and collaboration efficiency of R&D personnel, and ensured the efficient operation and resource utilization of inference services under different load environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122198130A_ABST
    Figure CN122198130A_ABST
Patent Text Reader

Abstract

This application provides a model management method and apparatus based on an orchestration system for model training and inference, which integrates model training, storage, and automatic deployment functions onto a single platform, achieving fully automated management of the entire process from model development to deployment. The method includes: acquiring a directed acyclic graph (DAG) generated by the user through a visual interface by dragging and configuring multiple nodes; this graph is used to execute the target model training task, with each node corresponding to a subtask; responding to the user's start command for the DAG, resolving the logical dependencies between nodes through an orchestration engine, and executing the subtasks sequentially according to the node configuration information; after the training process is completed, uploading the trained model file and corresponding metadata to a storage center; responding to the user's deployment command for the model file, dynamically determining the deployment resources and image file based on the metadata and real-time resources of the computing cluster, and using the model file to create and expose the inference service of the target model in the computing cluster.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of machine learning lifecycle management technology, and in particular to a model management method and apparatus for model training and inference based on an orchestration system. Background Technology

[0002] With the widespread application of machine learning technology, the complete lifecycle management of models, from development and training to final deployment, has become increasingly important. A typical machine learning model lifecycle usually includes multiple stages such as data preparation, model training, model storage, and service deployment.

[0003] Currently, the industry typically manages the lifecycle of machine learning models using a technical solution pieced together from multiple independent tools. For example, a separate training framework is used for model development and training, the resulting model files are uploaded to a separate model repository for storage, and finally, a separate deployment platform loads and publishes the model as an online inference service. In this loose architecture, the various systems rely heavily on manual coordination; that is, operations or development personnel need to manually perform configuration conversions, file transfers, and interface integrations between different systems. However, this approach requires multiple manual transfers and configurations between multiple systems from training completion to deployment, making the process cumbersome, error-prone, and inefficient in terms of task coordination.

[0004] Therefore, there is an urgent need for a management solution that can integrate model training, storage and deployment functions on the same platform to achieve fully automated management of the entire process from model development to deployment. Summary of the Invention

[0005] This application provides a model management method and apparatus for model training and inference based on an orchestration system, which can integrate model training, storage and automatic deployment functions into the same platform to achieve fully automated management of the entire process from model development to online deployment.

[0006] Firstly, a model management method based on an orchestration system for model training and inference is provided, the method including: The system obtains a directed acyclic graph (DAG) generated after the user drags and drops multiple nodes through a visual interface and configures them. The DAG is used to perform the training task of the target model, and each node corresponds to a subtask of the training task. In response to the user's command to initiate the operation of the directed acyclic graph, the orchestration engine parses the logical dependencies between nodes and executes the corresponding subtasks in sequence according to the configuration information of each node based on the logical dependencies. After the training process of the target model is completed, the trained model file and metadata of the target model are uploaded to the storage center. The metadata includes the deployment configuration information corresponding to the model file. In response to the user's deployment command for the model file of the target model stored in the storage center, the associated metadata is obtained through the storage center. Based on the metadata and the real-time resource status of the computing power cluster, the computing resources required for deploying the model file and at least one first image file are dynamically determined. Each first image file is a container image containing the execution logic of an inference service of the target model. Based on the computing resources required to deploy the model file, at least one first image file, the corresponding deployment configuration information, and the model file, the inference service corresponding to the target model is automatically created and exposed in the computing power cluster.

[0007] In a feasible design, the metadata includes the size and framework of the model file. Based on the metadata and the real-time resource status of the computing cluster, the computing resources required to deploy the model file and at least one first image file are dynamically determined, including: Based on the model size of the model file, the performance requirements in the deployment configuration information, and the real-time resource status of the computing cluster, the computing resources required for deploying the model file are dynamically allocated. Based on the model framework corresponding to the model file, determine the latest at least one first image file that is compatible with the model framework.

[0008] In a feasible design, the container orchestration system of the computing power cluster is Kubernetes. Based on the computing resources required to deploy the model file, at least one first image file, the corresponding deployment configuration information, and the model file, the computing power cluster automatically creates and exposes the inference service corresponding to the target model, including: Based on the computing resources required for the deployment model file, at least one first image file, and the corresponding deployment configuration information, the service configuration information is determined. The service configuration information includes deployment configuration information and one or more of the following parameters: Service name, namespace, resource requests and limits, auto-scaling policy, health check and readiness check settings; Send service configuration information to Kubernetes via an interface to trigger Kubernetes to create and expose the inference service corresponding to the target model in the computing cluster based on the service configuration information and the model file.

[0009] In one feasible design, the metadata includes non-deployment configuration information other than deployment configuration information. The storage center stores a first association, which is used to associate the non-deployment configuration information and deployment configuration information of successfully deployed models. Before uploading the trained model file and metadata of the target model to the storage center, the method also includes a step of determining the deployment configuration information. The step of determining the deployment configuration information includes: Determine the similarity between the non-deployment configuration information in the metadata and the non-deployment configuration information in the first association relationship; The deployment configuration information corresponding to the non-deployment configuration information whose similarity meets the preset conditions in the first association relationship is determined as the deployment configuration information that can be referenced; Determine the deployment configuration information corresponding to the model file based on the available deployment configuration information.

[0010] In a feasible design, the deployment configuration information corresponding to the model file is determined based on the available deployment configuration information, including: Obtain user-defined deployment strategies, which define whether to use a resource-saving mode or a high-performance mode; Determine the deployment configuration information corresponding to the model file based on the available deployment configuration information and deployment strategy.

[0011] In a feasible design, the metadata also includes the version of the model file, and the method also includes: In response to a user's rollback command for a target historical version of the target model, the metadata of the corresponding target historical model file stored in the storage center is determined based on the target historical version, where the target historical version is earlier than the version of the model file. Based on the metadata corresponding to the target historical model file and the real-time resource status of the computing cluster, determine the computing resources required to deploy the target historical model file and at least one first image file; Based on the computing resources required to deploy the target historical model file, at least one first image file, and the corresponding deployment configuration information, the inference service corresponding to the target model is recreated and exposed in the computing power cluster.

[0012] In a feasible design, the deployment configuration information includes the name of the model file, the identifier of the computing cluster, the name of the available first image file, the computing power configuration parameters required to execute each first image file, the computing power configuration parameters required to run the model file, and the configuration parameters for calling the interface of the target model.

[0013] In one feasible design, the configuration information of the configured node includes the node name, the name of the second image file, and the computing power configuration parameters required to execute the second image file. When the node is the first node, the configuration information of the node also includes the name of the initial model file of the target model and the name of the training dataset. The second image file is a container image containing the code that executes the node.

[0014] In a feasible design, the method also includes: Based on the code of the second image file corresponding to each node, generate the executable code template for the corresponding subtask; In response to the user's editing command on the code template of the target node, a code editing interface is provided to the user so that the user can perform at least one of the following operations on the code template: Fill in the parameters required to run the second image file and add debugging commands.

[0015] Secondly, a model management device based on an orchestration system for model training and inference is provided, comprising: The Directed Acyclic Graph (DAG) acquisition module is used to acquire the DAG generated after the user drags and drops multiple nodes through the visual interface and configures them. The DAG is used to perform the training task of the target model, and each node corresponds to a subtask of the training task. The model training execution module is used to respond to the user's command to start the operation of the directed acyclic graph. It parses the logical dependencies between nodes through the orchestration engine and executes the corresponding subtasks in sequence according to the configuration information of each node based on the logical dependencies. The model file upload module is used to upload the trained model file and metadata of the target model to the storage center after the training process of the target model is completed. The metadata includes the deployment configuration information corresponding to the model file. The model deployment module is used to respond to user deployment instructions for model files of target models stored in the storage center. It obtains the associated metadata through the storage center and dynamically determines the computing resources required to deploy the model files and at least one first image file based on the metadata and the real-time resource status of the computing power cluster. Each first image file is a container image containing the execution logic of an inference service of the target model. The model deployment module is also used to automatically create and expose the inference service corresponding to the target model in the computing power cluster based on the computing resources required to deploy the model file, at least one first image file, the corresponding deployment configuration information and the model file.

[0016] In this embodiment, after the user generates a directed acyclic graph for model training by dragging and dropping nodes in the visual interface provided by the platform, the orchestration engine automatically resolves dependencies and executes the subtasks of each node sequentially. This integrates the definition and execution of the training process within the platform, solving the problems of manual execution of each subtask and difficulty in managing dependencies between tasks in the traditional model. It achieves automated orchestration and observability of the training process. After training, the platform automatically uploads the generated model files and metadata to a unified storage center, allowing them to be centrally managed within the platform. When deployment is required, the user can issue a deployment command by selecting the model file to be deployed without leaving the platform. This triggers the platform to dynamically allocate the computing resources required for deploying the model file and at least one first image file based on the metadata and the real-time resource status of the computing cluster. This ensures efficient operation of the inference service under different load environments and maximizes resource utilization. Finally, based on computing resources, image files, model files, and corresponding deployment configuration information, the inference service is automatically created and exposed in the computing cluster. This application integrates the entire process of model training, associated storage of model files and corresponding metadata, automatic allocation of deployment resources, and automated deployment into a single platform, achieving a "one-stop" operation from model development to service launch. This solves the problems of fragmented toolchains and the need for manual configuration in traditional solutions, thereby improving the work efficiency and collaboration efficiency of R&D personnel. Attached Figure Description

[0017] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a schematic flowchart illustrating an exemplary embodiment of the model management method for model training and inference based on an orchestration system provided in this application. Figure 2 This is a schematic diagram of a model management device for model training and inference based on an orchestration system, provided in an exemplary embodiment of this application. Detailed Implementation

[0019] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0020] To integrate model training, storage, and deployment functions onto a single platform, thereby improving the efficiency and automation of tasks from model development and training to final deployment, such as... Figure 1 As shown, this application provides a model management method for model training and inference based on an orchestration system, the method including: S110, obtains the directed acyclic graph generated after the user drags and drops multiple nodes through the visual interface and configures them.

[0021] The directed acyclic graph is used to perform the training task of the target model, and each node corresponds to a subtask of the training task.

[0022] The training tasks of the model include subtasks that can be set according to actual needs, such as data loading, data preprocessing, data splitting, model evaluation, and model deployment. Each subtask corresponds to an executable node on the visualization interface, and the logical dependencies between nodes (including the connection order) reflect the execution logic order of the subtasks.

[0023] Taking Argo Workflows as an orchestration engine as an example, after users drag and drop nodes and configure parameters in the visual interface, the nodes form a Directed Acyclic Graph (DAG). This DAG can be parsed by the orchestration engine into executable workflow tasks for model training.

[0024] In one feasible design, the configuration information of the configured node includes the node name, the name of the second image file, and the computing power configuration parameters required to execute the second image file. When the node is the first node, the configuration information of the node also includes the name of the initial model file of the target model and the name of the training dataset. The second image file is a container image containing the code that executes the node.

[0025] For example, the computing power configuration parameters required to execute the second image file include, but are not limited to, the number of CPU cores, CPU memory capacity, number of GPU cards, GPU type, GPU memory capacity, and storage capacity (e.g., the required hard disk capacity). These computing power configuration parameters can be set according to actual needs, and this application does not limit them.

[0026] It should be noted that the image files of this application are stored in an image repository, which can be used to manage and access the image files stored therein, ensuring that tasks on each node are executed in the specified runtime environment. The model files of this application (including the initial model files before training and the model files generated after training) are stored in a storage center, and the training dataset can also be stored in the storage center. The storage center can store unstructured data and can be used to manage and access the model files and training datasets stored therein.

[0027] In the example above, users can configure node names through a visual interface, clearly identifying the function and sequence of each subtask, facilitating subsequent tracking and maintenance. By specifying a second image file name, it ensures that each node calls the corresponding image file from the image repository, enabling the node to execute the corresponding logic in the required runtime environment. By configuring the computing power parameters required to execute the second image file, it ensures that the node task receives sufficient computing resources to guarantee the efficient and stable operation of the training task. Furthermore, by configuring image files for nodes instead of directly injecting execution logic, the same image file can be used for multiple different training tasks, improving developer efficiency.

[0028] In a feasible design, the method also includes: Based on the code of the second image file corresponding to each node, generate the executable code template for the corresponding subtask; In response to the user's editing command on the code template of the target node, a code editing interface is provided to the user so that the user can perform at least one of the following operations on the code template: Fill in the parameters required to run the second image file and add debugging commands.

[0029] In the above embodiments, the executable code template is a runnable code template generated based on the code in the second image file corresponding to the node, and is specifically designed for the subtask of that node. It extracts the fixed execution logic structure from the image file, retains the core execution flow of the subtask, and eliminates the need for users to write code from scratch. When a user triggers an editing command, this application provides a code editing interface, allowing users to supplement the personalized parameters required for running the second image file, enabling the framework to adapt to specific training scenarios; or, to add debugging commands for troubleshooting and performance optimization during task execution, thereby improving the flexibility and controllability of model training.

[0030] S120 responds to the user's command to start and execute the directed acyclic graph operation, resolves the logical dependencies between nodes through the orchestration engine, and executes the corresponding subtasks in sequence according to the configuration information of each node based on the logical dependencies.

[0031] Specifically, Argo Workflows transforms the directed acyclic graph into custom Workflow resources within a Kubernetes-managed computing cluster. The Argo Workflows controller listens to and parses these custom Workflow resources. Based on the logical dependencies between nodes, the corresponding container groups (Pods) are created and run sequentially for each node in the computing cluster by calling the Kubernetes application programming interface. The container groups are created and run based on the second image file configured for the corresponding node and the corresponding computing power configuration parameters to execute the corresponding subtasks.

[0032] S130: After the training process of the target model is completed, the trained model file and metadata of the target model are uploaded to the storage center.

[0033] The metadata includes deployment configuration information corresponding to the model file and non-deployment configuration information other than deployment configuration information.

[0034] Specifically, the node in the DAG used to publish the model file (usually the last node) can extract the metadata corresponding to the model file and upload the trained model file and metadata of the target model to the storage center.

[0035] For example, a Directed Acyclic Graph (DAG) includes nodes such as data loading, data preprocessing, data splitting, model evaluation, and model deployment. Once the model evaluation node completes, the target model training process is finished. At this point, the model deployment node is triggered, extracting the metadata corresponding to the trained model file. The trained model file and its metadata are automatically uploaded to a designated storage center. The storage center persists the model file to MinIO object storage and writes the metadata to a MySQL database. Model files and their corresponding metadata are associated through their names, facilitating unified management and version control of the model files.

[0036] For example, the non-deployment configuration information of metadata includes one or more of the following data: The training process includes hyperparameters corresponding to the model file, the name and version of the second image file used at each node during training, the model file version, the model file size, performance metrics, model framework, timestamps, and model training configuration parameters.

[0037] For example, the model file version is generated using the format of "model file name + timestamp" to ensure that the model file version produced by each training is unique and traceable, which facilitates subsequent model iteration management and online rollback.

[0038] For example, the hyperparameters in the training process corresponding to the model file include one or more of the following parameters: Learning rate: Used to control the magnitude by which the model adjusts parameters based on prediction error, such as the size of a student's steps to correct an error; Batch size: Affects the stability of model updates and training speed, such as the batch processing method for students to complete questions; The number of layers in a neural network determines the model's capacity; more layers can solve more complex problems, but they are also more prone to overfitting. Regularization parameter: penalizes model complexity and prevents overfitting, such as when concise answers are required; Number of trees: The size of the expert committee in a random forest affects stability and accuracy; Number of neurons: controls the complexity and fitting ability of the model.

[0039] For example, performance metrics include one or more of the following metrics: Accuracy: Calculated as (TP+TN) / (TP+TN+FP+FN), where TP represents the number of true positive samples, TN represents the number of true negative samples, FP represents the number of false positive samples, and FN represents the number of false negative samples. It is used to calculate the proportion of correct samples to the total number of samples. Precision: Calculated as TP / (TP+FP), used to calculate the proportion of truly positive samples among the positive samples; Recall: Calculated as TP / (TP+FN), used to calculate the proportion of true positive samples that are correctly predicted; F1 score: Calculated as 2 × (precision × recall) / (precision + recall), it is the harmonic mean of precision and recall, used to comprehensively evaluate the model's performance under imbalanced data; Mean Squared Error (MSE): Calculated as follows , Represents the true value. represents the predicted value, n represents the total number of samples, and i represents the i-th sample. It is used to measure the deviation between the predicted value and the true value, and the penalty for large errors is heavier. Mean Absolute Error (MAE): Calculated as follows It is used to measure the absolute deviation between the predicted value and the actual value, and has low sensitivity to outliers; Expected inference latency: The time it takes for the model to complete a single inference request; Throughput refers to the number of requests that a model can process per unit of time.

[0040] For example, model training configuration parameters include one or more of the following parameters: Learning rate, batch size, number of training epochs, optimizer, model type (i.e., the model architecture of the target model), and regularization method.

[0041] The above embodiments store metadata extracted from model files. Since the model file version in the metadata includes the model file name, and the model file's own attributes also include its name, the model file and metadata can be stored by association through the model file name. This achieves centralized management of information such as model files, dependent images, and hyperparameters, avoiding the problem of scattered storage. This allows developers to quickly trace the model's training configuration and performance, improves the efficiency of model file management, and facilitates subsequent horizontal comparisons and iterative optimization of model performance.

[0042] In one feasible design, the deployment configuration information includes the name of the model file, the identifier of the computing cluster, the name of the available first image file, the computing power configuration parameters required to execute each first image file, the computing power configuration parameters required to run the model file, and the configuration parameters for calling the interface of the target model. Each first image file is a container image containing the execution logic of an inference service of the target model.

[0043] The model file name can be used by Kubernetes to pull the model file from the storage center and match it with metadata. The identifier of the computing power cluster is used to determine the deployment environment for deploying the model file. The name of the first image file can be used by Kubernetes to pull the corresponding image file from the image repository. The computing power configuration parameters required to execute the first image file are used to specify the minimum computing resources required to execute the first image file (i.e., the performance requirements for executing the first image file). The computing power configuration parameters required to run the model file are used to specify the minimum computing resources required to run the model file (i.e., the performance requirements for running the target model). The configuration parameters for calling the interface of the target model are used to expose services.

[0044] For example, the computing power configuration parameters required to execute at least one first image file include, but are not limited to, the number of CPU cores, CPU memory capacity, number of GPU cards, GPU type, GPU memory capacity, and storage capacity (e.g., the required hard disk capacity). These computing power configuration parameters can be set according to actual needs, and this application does not limit them.

[0045] For example, the computing power configuration parameters required to run the model file include, but are not limited to, GPU type, number of GPU cards, memory capacity, number of CPU cores, mounted volumes (including the base model path, fine-tuning model path, etc.), startup command, and environment variables. These computing power configuration parameters can be set according to actual needs, and this application does not limit them.

[0046] For example, the configuration parameters for calling the target model's interface include, but are not limited to, port number, interface name, interface description, component path, request method, request path, request parameters (including path parameters, query parameters, request body, etc.), data format, response format, and security authentication method. The configuration parameters for calling the target model's interface can be set according to actual needs, and this application does not impose any limitations on them.

[0047] The content type of deployment configuration information can be set according to actual needs, and this application does not limit it. For example, it can also include the English name of the model, the Chinese name of the model, the model description, and other information. Therefore, the solution in this application is applicable to various inference frameworks and cluster environments.

[0048] In one feasible design, the storage center stores a first association relationship, which is used to associate the non-deployment configuration information and deployment configuration information of successfully deployed models. Before uploading the trained model files and metadata of the target model to the storage center, the method also includes a step of determining the deployment configuration information, which includes: Determine the similarity between the non-deployment configuration information in the metadata and the non-deployment configuration information in the first association relationship; The deployment configuration information corresponding to the non-deployment configuration information whose similarity meets the preset conditions in the first association relationship is determined as the deployment configuration information that can be referenced; Determine the deployment configuration information corresponding to the model file based on the available deployment configuration information.

[0049] For example, the preset condition is that the non-deployment configuration information of the model file's metadata has the highest similarity.

[0050] For example, deployment configuration information can also be obtained through a deployment configuration interface, which is used by users to input deployment configuration information. The above embodiments establish a first association between the non-deployment configuration information and deployment configuration information of successfully deployed models in the storage center beforehand. Before uploading the model file, the non-deployment configuration information in its metadata is matched with the non-deployment configuration information of successfully deployed models in the association. The historical deployment configuration information corresponding to the non-deployment configuration information that meets the preset similarity conditions is used as a reference to determine the deployment configuration information of the model file. This method fully utilizes historical deployment experience, recommends reasonable deployment parameters through intelligent matching, improves the efficiency and success rate of model deployment, and ensures the stability and consistency of new model deployment under similar conditions.

[0051] In a feasible design, the deployment configuration information corresponding to the model file is determined based on the available deployment configuration information, including: Obtain user-defined deployment strategies, which define whether to use a resource-saving mode or a high-performance mode; Determine the deployment configuration information corresponding to the model file based on the available deployment configuration information and deployment strategy.

[0052] The deployment strategy can be obtained from user input through the deployment configuration interface.

[0053] High-performance mode refers to a configuration strategy that prioritizes optimizing service performance while meeting the basic operational requirements of the target model. This mode ensures lower inference latency and higher throughput by prioritizing the use of the latest image files and allocating more computing resources (such as GPUs and CPUs) and storage resources in the computing cluster. Resource-saving mode refers to a configuration strategy that prioritizes improving resource utilization efficiency while meeting the basic operational requirements of the target model. This mode maximizes resource utilization and reduces operational costs by prioritizing lightweight images with lower resource consumption and by scheduling and utilizing idle computing and storage resources in the computing cluster.

[0054] The example above, after obtaining deployment configuration information (i.e., reference deployment configuration information) from historical models similar to the target model, adapts or filters the reference configuration information based on the user-specified deployment strategy (e.g., high-performance mode or resource-saving mode) to generate the final deployment configuration information suitable for the current target model. This allows users to flexibly balance response speed and resource consumption according to actual application scenarios, improving the adaptability and economy of model deployment.

[0055] For example, the steps for obtaining the computing power configuration parameters required for at least one first image file include: The image repository is used to store pre-entered image files and their corresponding computing power configuration parameters. The image repository is used to store pre-entered image files and their corresponding computing power configuration parameters. Determine the required computing power configuration parameters for at least one first image file based on the computing power configuration parameters corresponding to each first image file.

[0056] Specifically, after querying the computing power configuration parameters of the corresponding first image file in the image repository based on the name of each first image file, if there are multiple first image files, the total computing power configuration parameters required for these image files to run together are calculated based on the computing power parameters of each file (e.g., taking the maximum value of each parameter, the sum, or integrating according to the deployment strategy) to ensure that sufficient resources are allocated when deploying the inference service.

[0057] In the above embodiments, after the user selects the name of the first image file, the system can automatically query the computing power configuration parameters corresponding to the first image file in the image repository, and use the query results as the computing power configuration parameters required to execute the first image file. These parameters are then displayed on the interface for the user to confirm or adjust. This avoids the need for manual querying or estimation of resources based on experience during each deployment, significantly improving configuration efficiency and accuracy.

[0058] S140, in response to the user's deployment instruction for the model file of the target model stored in the storage center, obtains the associated metadata through the storage center, and dynamically determines the computing resources required to deploy the model file and at least one first image file based on the metadata and the real-time resource status of the computing power cluster.

[0059] Each first image file is a container image containing the execution logic of an inference service for the target model.

[0060] Retrieve the relevant metadata from the storage center, including: Retrieve the corresponding metadata from the storage center based on the name of the model file.

[0061] In a feasible design, the metadata includes the size and framework of the model file. Based on the metadata and the real-time resource status of the computing cluster, the computing resources required to deploy the model file and at least one first image file are dynamically determined, including: Based on the model size of the model file, the performance requirements in the deployment configuration information, and the real-time resource status of the computing cluster, the computing resources required for deploying the model file are dynamically allocated. Based on the model framework corresponding to the model file, determine the latest at least one first image file that is compatible with the model framework.

[0062] Specifically, based on the model size of the model file, the performance requirements for executing the first image file in the deployment configuration information, and the performance requirements for running the target model, computing resources such as GPUs and CPUs are allocated to the model file from the real-time resources of the computing cluster. At least one of the latest first image files (such as the Triton Inference Server image file) that is compatible with the model framework is automatically selected from the available first image files indicated in the deployment configuration information.

[0063] The above embodiments dynamically allocate the computing resources required to deploy model files by combining the size of the model files in the metadata, the performance requirements in the deployment configuration information, and the real-time resource status of the computing cluster. This allows for flexible adjustment of resource supply based on the current cluster load and model requirements, avoiding resource waste or insufficient performance, and improving resource utilization efficiency and deployment stability. Simultaneously, the latest compatible image file is automatically determined based on the model framework corresponding to the model file, ensuring the compatibility between the runtime environment and the model framework. It also enables timely introduction of framework updates, reducing the tedious manual image configuration and compatibility risks.

[0064] S150: Based on the computing resources required for deploying the model file, at least one first image file, the corresponding deployment configuration information, and the model file, automatically create and expose the inference service corresponding to the target model in the computing power cluster.

[0065] In a feasible design, the container orchestration system of the computing power cluster is Kubernetes. Based on the computing resources required to deploy the model file, at least one first image file, the corresponding deployment configuration information, and the model file, the computing power cluster automatically creates and exposes the inference service corresponding to the target model, including: Based on the computing resources required for the deployment model file, at least one first image file, and the corresponding deployment configuration information, the service configuration information is determined. The service configuration information includes deployment configuration information and one or more of the following parameters: Service name, namespace, resource requests and limits, auto-scaling policy, health check and readiness check settings; Send service configuration information to Kubernetes via an interface to trigger Kubernetes to create and expose the inference service corresponding to the target model in the computing cluster based on the service configuration information and the model file.

[0066] The service name is used to uniquely identify the reasoning service.

[0067] Namespaces are logical units in Kubernetes for implementing resource isolation and organization. By deploying the inference service within a specified namespace, naming conflicts with other applications can be effectively avoided, and access control, resource quota management, and environment partitioning (such as development, testing, and production environments) can be facilitated.

[0068] Resource requests and limits define the minimum guaranteed value (i.e., resource request) and the maximum upper limit (i.e., resource limit) of computing resources (such as CPU and memory) required by each container in the inference service. Resource requests are used for scheduling decisions to ensure that containers have enough resources to run; resource limits prevent containers from over-consuming cluster resources and ensure the overall stability and fairness of the cluster.

[0069] The automatic scaling strategy dynamically adjusts the number of instance replicas for the inference service based on real-time load metrics (such as CPU utilization) to adapt to traffic fluctuations. By configuring parameters such as minimum / maximum replica count and scaling thresholds, instances can be automatically added during peak periods to ensure responsiveness, and resources can be reclaimed during off-peak periods to improve utilization.

[0070] Health checks are used to detect whether containers are functioning correctly (e.g., through requests or command execution via Hypertext Transfer Protocol). If a check fails, Kubernetes automatically restarts the container to restore service. Readiness checks determine whether a container is ready to receive traffic; only when a readiness check passes will the service forward traffic to that instance. These settings collectively ensure the reliability of the inference service and smooth rolling updates.

[0071] Kubernetes retrieves the model file from the storage center based on the model file name in the service configuration information, and retrieves the first image file from the image repository that stores the image file based on the first image file name. Combining parameters such as service name, namespace, resource requests and limits, auto-scaling strategy, health check and readiness check settings, Kubernetes automatically creates and exposes the inference service corresponding to the target model in the computing power cluster.

[0072] The above embodiments integrate the computing resources required for deploying model files, compatible image files, and deployment configuration information into structured service configuration information (including deployment configuration information, service name, namespace, resource requests and limits, automatic scaling strategies, health check and readiness check settings, etc.), and send this configuration information to the Kubernetes cluster through a standard interface, triggering the cluster to automatically create and expose the inference service based on the model files. This solution automates the entire process from resource configuration to service deployment, reducing the risk of manual intervention and configuration errors. Simultaneously, by finely defining resource limits and scaling strategies, it can dynamically adapt to changes in business load, ensuring service response speed and resource utilization efficiency. Furthermore, the introduction of health check and readiness check mechanisms effectively improves the service's operational stability and fault self-healing capabilities, providing standardized deployment support for large-scale, highly available model inference services.

[0073] For example, the method further includes: In response to the user's command to view the logs of the inference service corresponding to the model file, the runtime logs of the container group instance corresponding to the inference service are obtained from the computing power cluster. Display the operation logs to the user.

[0074] The above example enables users to monitor the running status of the model inference service in real time and troubleshoot problems. By intuitively displaying detailed log information of container group instances, it helps users quickly locate anomalies and optimize model performance.

[0075] To identify model version changes and automatically adjust deployment resource configurations, thereby enabling version rollback, a feasible design achieves historical version rollback of the target model in the following way: In response to a user's rollback command for a target historical version of the target model, the metadata of the corresponding target historical model file stored in the storage center is determined based on the target historical version, where the target historical version is earlier than the version of the model file. Based on the metadata corresponding to the target historical model file and the real-time resource status of the computing cluster, determine the computing resources required to deploy the target historical model file and at least one first image file; Based on the computing resources required to deploy the target historical model file, at least one first image file, and the corresponding deployment configuration information, the inference service corresponding to the target model is recreated and exposed in the computing power cluster.

[0076] Specifically, after a user triggers a version rollback operation for the target model, the system retrieves the corresponding metadata (i.e., the metadata of the target historical model file) record from the database based on the user-specified target historical version (i.e., the name and timestamp of a historical model file), and obtains the deployment configuration information from this metadata. Then, based on this deployment configuration information, the system calls the computing cluster management interface to trigger Kubernetes to unload the inference service corresponding to the currently running model file and deploy the inference service corresponding to the target historical model file.

[0077] In the above embodiments, the metadata also includes deployment configuration information. During fault rollback, manual cross-system verification is unnecessary; all relevant configurations and parameters can be traced directly through the metadata. The above embodiments also support version rollback based on the deployment configuration information in the metadata, effectively solving the technical problem of fragmented version management. This not only improves the efficiency of rollback operations but also reduces errors that may result from manual intervention.

[0078] In this embodiment, after the user generates a directed acyclic graph for model training by dragging and dropping nodes in the visual interface provided by the platform, the orchestration engine automatically resolves dependencies and executes the subtasks of each node sequentially. This integrates the definition and execution of the training process within the platform, solving the problems of manual execution of each subtask and difficulty in managing dependencies between tasks in the traditional model. It achieves automated orchestration and observability of the training process. After training, the platform automatically uploads the generated model files and metadata to a unified storage center, allowing them to be centrally managed within the platform. When deployment is required, the user can issue a deployment command by selecting the model file to be deployed without leaving the platform. This triggers the platform to dynamically allocate the computing resources required for deploying the model file and at least one first image file based on the metadata and the real-time resource status of the computing cluster. This ensures efficient operation of the inference service under different load environments and maximizes resource utilization. Finally, based on computing resources, image files, model files, and corresponding deployment configuration information, the inference service is automatically created and exposed in the computing cluster. This application integrates the entire process of model training, associated storage of model files and corresponding metadata, automatic allocation of deployment resources, and automated deployment into a single platform, achieving a "one-stop" operation from model development to service launch. This solves the problems of fragmented toolchains and the need for manual configuration in traditional solutions, thereby improving the work efficiency and collaboration efficiency of R&D personnel.

[0079] In practical applications, users first drag and drop multiple nodes and configure parameters through the platform's visual interface to generate a Directed Acyclic Graph (DAG) for training the target model. The platform generates an executable code template based on the second image file corresponding to each node, allowing users to supplement runtime parameters or add debugging commands. Next, users initiate the operation command to execute the DAG. The platform uses an orchestration engine to parse the logical dependencies between nodes and executes the subtasks corresponding to each node in sequence. After the training process is completed, the platform automatically uploads the trained model file and extracts metadata containing model name, hyperparameters, dependent image information, etc., to the storage center for associated storage. Then, in response to the user's deployment command for the model file of the target model stored in the storage center, the platform obtains the associated metadata from the storage center. Based on the metadata and the real-time resource status of the computing power cluster, it dynamically determines the computing resources required to deploy the model file and at least one first image file. Based on the computing resources required to deploy the model file, at least one first image file, the corresponding deployment configuration information, and the model file, the platform automatically creates and exposes the inference service corresponding to the target model in the computing power cluster. In addition, users can view the logs of the inference service through the control provided by the platform. The platform will immediately pull the real-time logs of the corresponding container group instance and display them on the interface in a time-series scrolling manner. Users can also select a historical model version and trigger a rollback operation through the version rollback control provided by the platform. The platform will automatically rebuild the inference service based on the metadata of that version.

[0080] like Figure 2 As shown, this application also provides a model management device for model training and inference based on an orchestration system, comprising: The Directed Acyclic Graph (DAG) acquisition module is used to acquire the DAG generated after the user drags and drops multiple nodes through the visual interface and configures them. The DAG is used to perform the training task of the target model, and each node corresponds to a subtask of the training task. The model training execution module is used to respond to the user's command to start the operation of the directed acyclic graph. It parses the logical dependencies between nodes through the orchestration engine and executes the corresponding subtasks in sequence according to the configuration information of each node based on the logical dependencies. The model file upload module is used to upload the trained model file and metadata of the target model to the storage center after the training process of the target model is completed. The metadata includes the deployment configuration information corresponding to the model file. The model deployment module is used to respond to user deployment instructions for model files of target models stored in the storage center. It obtains the associated metadata through the storage center and dynamically determines the computing resources required to deploy the model files and at least one first image file based on the metadata and the real-time resource status of the computing power cluster. Each first image file is a container image containing the execution logic of an inference service of the target model. The model deployment module is also used to automatically create and expose the inference service corresponding to the target model in the computing power cluster based on the computing resources required to deploy the model file, at least one first image file, the corresponding deployment configuration information and the model file.

[0081] In a feasible design, the metadata includes the size of the model file and the model framework. The model deployment module dynamically determines the computing resources required to deploy the model file and at least one first image file based on the metadata and the real-time resource status of the computing cluster. Based on the model size of the model file, the performance requirements in the deployment configuration information, and the real-time resource status of the computing cluster, the computing resources required for deploying the model file are dynamically allocated. Based on the model framework corresponding to the model file, determine the latest at least one first image file that is compatible with the model framework.

[0082] In a feasible design, the container orchestration system of the computing power cluster is Kubernetes. The model deployment module automatically creates and exposes the inference service corresponding to the target model in the computing power cluster based on the computing resources required for deploying the model file, at least one first image file, the corresponding deployment configuration information, and the model file. Based on the computing resources required for the deployment model file, at least one first image file, and the corresponding deployment configuration information, the service configuration information is determined. The service configuration information includes deployment configuration information and one or more of the following parameters: Service name, namespace, resource requests and limits, auto-scaling policy, health check and readiness check settings; Send service configuration information to Kubernetes via an interface to trigger Kubernetes to create and expose the inference service corresponding to the target model in the computing cluster based on the service configuration information and the model file.

[0083] In one feasible design, the device further includes a deployment configuration information determination module. The metadata includes non-deployment configuration information other than deployment configuration information. The storage center stores a first association relationship, which is used to associate the non-deployment configuration information and deployment configuration information of successfully deployed models. Before uploading the trained model file and metadata of the target model to the storage center, the deployment configuration information determination module determines the deployment configuration information in the following way: Determine the similarity between the non-deployment configuration information in the metadata and the non-deployment configuration information in the first association relationship; The deployment configuration information corresponding to the non-deployment configuration information whose similarity meets the preset conditions in the first association relationship is determined as the deployment configuration information that can be referenced; Determine the deployment configuration information corresponding to the model file based on the available deployment configuration information.

[0084] In a feasible design, the deployment configuration information determination module is implemented by determining the deployment configuration information corresponding to the model file based on the available deployment configuration information: Obtain user-defined deployment strategies, which define whether to use a resource-saving mode or a high-performance mode; Determine the deployment configuration information corresponding to the model file based on the available deployment configuration information and deployment strategy.

[0085] In one feasible design, the metadata also includes the version of the model file, and the device also includes a version rollback module, which is used to determine the metadata of the corresponding target historical model file stored in the storage center based on the target historical version in response to the user's rollback command for the target historical version of the target model, wherein the target historical version is earlier than the version of the model file. The model deployment module is also used to determine the computing resources required to deploy the target historical model file and at least one first image file based on the metadata corresponding to the target historical model file and the real-time resource status of the computing power cluster. The model deployment module is also used to recreate and expose the inference service corresponding to the target model in the computing power cluster based on the computing resources required to deploy the target historical model file, at least one first image file, and the corresponding deployment configuration information.

[0086] In a feasible design, the deployment configuration information includes the name of the model file, the identifier of the computing cluster, the name of the available first image file, the computing power configuration parameters required to execute each first image file, the computing power configuration parameters required to run the model file, and the configuration parameters for calling the interface of the target model.

[0087] In one feasible design, the configuration information of the configured node includes the node name, the name of the second image file, and the computing power configuration parameters required to execute the second image file. When the node is the first node, the configuration information of the node also includes the name of the initial model file of the target model and the name of the training dataset. The second image file is a container image containing the code that executes the node.

[0088] In one feasible design, the model training execution module is also used to generate executable code templates for corresponding subtasks based on the code of the second image file corresponding to each node; in response to the user's editing instructions on the code template of the target node, a code editing interface is provided to the user so that the user can perform at least one of the following operations on the code template: Fill in the parameters required to run the second image file and add debugging commands.

[0089] Other implementation methods and effects of the above-mentioned device can be found in the description of the model management method embodiment based on the orchestration system for model training and inference, and will not be repeated here.

[0090] The basic principles of this application have been described above with reference to specific embodiments. However, it should be noted that the advantages, benefits, and effects mentioned in this application are merely examples and not limitations, and should not be considered as essential features of each embodiment of this application. Furthermore, the specific details disclosed above are for illustrative and facilitative purposes only, and are not limitations. These details do not limit the application to the necessity of employing the aforementioned specific details for implementation.

[0091] It should be understood that although the steps in the flowcharts of the accompanying figures are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the accompanying figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times, and their execution order is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.

[0092] The block diagrams of devices, apparatuses, devices, and systems involved in this application are merely illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will recognize, these devices, apparatuses, devices, and systems can be connected, arranged, and configured in any manner. Words such as “comprising,” “including,” “having,” etc., are open-ended terms meaning “including but not limited to,” and are used interchangeably with them. The terms “or” and “and” as used herein refer to the terms “and / or,” and are used interchangeably with them unless the context clearly indicates otherwise. The term “such as” as used herein refers to the phrase “such as but not limited to,” and is used interchangeably with it.

[0093] It should also be noted that in the apparatus, equipment, and methods of this application, the components or steps can be disassembled and / or recombined. These disassemblies and / or recombinations should be considered as equivalent solutions of this application.

[0094] The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use this application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein can be applied to other aspects without departing from the scope of this application. Therefore, this application is not intended to be limited to the aspects shown herein, but rather to be accorded the widest scope consistent with the principles and novel features disclosed herein.

[0095] The above description has been given for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of this application to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations thereof.

Claims

1. A model management method based on an orchestration system for model training and inference, characterized in that, The method includes: The system obtains a directed acyclic graph (DAG) generated after the user drags and drops multiple nodes through a visual interface and configures them. The DAG is used to perform the training task of the target model, and each node corresponds to a subtask of the training task. In response to the user's instruction to start the operation of the directed acyclic graph, the orchestration engine parses the logical dependencies between nodes and executes the corresponding subtasks in sequence according to the configuration information of each node based on the logical dependencies. After the training process of the target model is completed, the trained model file and metadata of the target model are uploaded to the storage center. The metadata includes the deployment configuration information corresponding to the model file. In response to a user's deployment instruction for a model file of a target model stored in the storage center, the associated metadata is obtained through the storage center. Based on the metadata and the real-time resource status of the computing power cluster, the computing resources required to deploy the model file and at least one first image file are dynamically determined. Each first image file is a container image containing the execution logic of an inference service of the target model. Based on the computing resources required to deploy the model file, at least one first image file, the corresponding deployment configuration information, and the model file, the inference service corresponding to the target model is automatically created and exposed in the computing power cluster.

2. The method according to claim 1, characterized in that, The metadata includes the size of the model file and the model framework. The step of dynamically determining the computing resources required to deploy the model file and at least one first image file based on the metadata and the real-time resource status of the computing cluster includes: Based on the model size of the model file, the performance requirements in the deployment configuration information, and the real-time resource status of the computing cluster, the computing resources required for deploying the model file are dynamically allocated; Based on the model framework corresponding to the model file, determine at least one latest first image file that is compatible with the model framework.

3. The method according to claim 1 or 2, characterized in that, The container orchestration system of the computing power cluster is Kubernetes. The automatic creation and exposure of the inference service corresponding to the target model within the computing power cluster, based on the computing resources required to deploy the model file, at least one first image file, corresponding deployment configuration information, and the model file, includes: Based on the computing resources required to deploy the model file, at least one first image file, and the corresponding deployment configuration information, service configuration information is determined. The service configuration information includes deployment configuration information and one or more of the following parameters: Service name, namespace, resource requests and limits, auto-scaling policy, health check and readiness check settings; The service configuration information is sent to the Kubernetes cluster via an interface to trigger the Kubernetes cluster to create and expose the inference service corresponding to the target model in the computing power cluster based on the model file and the service configuration information.

4. The method according to claim 1 or 2, characterized in that, The metadata includes non-deployment configuration information other than deployment configuration information. The storage center stores a first association relationship, which is used to associate the non-deployment configuration information and deployment configuration information of a successfully deployed model. Before uploading the trained model file and metadata of the target model to the storage center, the method further includes a step of determining the deployment configuration information, which includes: Determine the similarity between the non-deployment configuration information in the metadata and each non-deployment configuration information in the first association relationship; The deployment configuration information corresponding to the non-deployment configuration information whose similarity meets the preset conditions in the first association relationship is determined as the deployment configuration information that can be referenced; The deployment configuration information corresponding to the model file is determined based on the available deployment configuration information.

5. The method according to claim 4, characterized in that, The step of determining the deployment configuration information corresponding to the model file based on the referenced deployment configuration information includes: Obtain user-defined deployment strategies, which define whether to use a resource-saving mode or a high-performance mode; The deployment configuration information corresponding to the model file is determined based on the available deployment configuration information and the deployment strategy.

6. The method according to claim 1 or 2, characterized in that, The metadata also includes the version of the model file, and the method further includes: In response to a user's rollback command for a target historical version of the target model, the metadata of the corresponding target historical model file stored in the storage center is determined based on the target historical version, wherein the target historical version is earlier than the version of the model file; Based on the metadata corresponding to the target historical model file and the real-time resource status of the computing power cluster, determine the computing resources required to deploy the target historical model file and at least one first image file; Based on the computing resources required to deploy the target historical model file, at least one first image file, and the corresponding deployment configuration information, the inference service corresponding to the target model is recreated and exposed in the computing power cluster.

7. The method according to claim 1 or 2, characterized in that, The deployment configuration information includes the name of the model file, the identifier of the computing power cluster, the name of the available first image file, the computing power configuration parameters required to execute each first image file, the computing power configuration parameters required to run the model file, and the configuration parameters for calling the interface of the target model.

8. The method according to claim 1 or 2, characterized in that, The configuration information of the configured node includes the node name, the name of the second image file, and the computing power configuration parameters required to execute the second image file. When the node is the first node, the configuration information of the node also includes the name of the initial model file of the target model and the name of the training dataset. The second image file is a container image containing the code that executes the node.

9. The method according to claim 8, characterized in that, The method further includes: Based on the code of the second image file corresponding to each node, generate the executable code template for the corresponding subtask; In response to a user's editing command for the code template of the target node, a code editing interface is provided to the user so that the user can perform at least one of the following operations on the code template: Fill in the parameters required to run the second image file and add debugging commands.

10. A model management device for model training and inference based on an orchestration system, characterized in that, include: The directed acyclic graph acquisition module is used to acquire the directed acyclic graph generated after the user drags and drops multiple nodes through a visual interface and configures them. The directed acyclic graph is used to perform the training task of the target model, and each node corresponds to a subtask of the training task. The model training execution module is used to respond to the user's operation command to start the execution of the directed acyclic graph, and to parse the logical dependencies between nodes through the orchestration engine, and to execute the corresponding sub-tasks in sequence according to the configuration information of each node based on the logical dependencies. The model file upload module is used to upload the trained model file and metadata of the target model to the storage center after the training process of the target model is completed. The metadata includes the deployment configuration information corresponding to the model file. The model deployment module is used to respond to the user's deployment instruction for the model file of the target model stored in the storage center, obtain the associated metadata through the storage center, and dynamically determine the computing resources required to deploy the model file and at least one first image file based on the metadata and the real-time resource status of the computing power cluster. Each first image file is a container image containing the execution logic of an inference service of the target model. The model deployment module is also used to automatically create and expose the inference service corresponding to the target model in the computing power cluster based on the computing resources required to deploy the model file, at least one first image file, the corresponding deployment configuration information, and the model file.