Fault diagnosis method and apparatus

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By applying SOP nodes, RAG agents, and LLM agents to the CI/CD pipeline, the problem of low fault diagnosis efficiency in the CI/CD pipeline is solved, and automated and scalable fault resolution capabilities are achieved.

WO2026123725A1PCT designated stage Publication Date: 2026-06-18HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD
Filing Date: 2025-08-05
Publication Date: 2026-06-18

Application Information

Patent Timeline

05 Aug 2025

Application

18 Jun 2026

Publication

WO2026123725A1

IPC: G06F11/07

AI Tagging

Application Domain

Fault response

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

The challenges of fault diagnosis in CI/CD pipelines lie in the wide variety of faults, complex logs, complex distributed architecture, and difficulties in team collaboration, which lead to low efficiency in fault location and diagnosis.

⚗Method used

By employing a fault diagnosis method, the fault categories and problem descriptions of the CI/CD pipeline are obtained. The diagnostic results are dynamically generated using SOP nodes, RAG agents, and LLM agents. Combined with the SOP tree structure, faults are resolved automatically and extensibly.

🎯Benefits of technology

It improves the automation and efficiency of CI/CD pipeline fault diagnosis, reduces the possibility of manual intervention, and can dynamically learn solutions when faults cannot be directly resolved.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2025112597_18062026_PF_FP_ABST

Patent Text Reader

Abstract

Provided in the present application is a fault diagnosis method. The method is applied to a fault diagnosis service for a CI / CD pipeline, and comprises: acquiring a category of a fault of the CI / CD pipeline, a first problem description of the fault, and a plurality of SOP nodes of the fault diagnosis service, wherein the SOP nodes are used to obtain a diagnosis result on the basis of a problem description; on the basis of the first problem description and at least one first SOP node among the plurality of SOP nodes that corresponds to the category of the fault, determining at least one first diagnosis result; when the at least one first diagnosis result can be used as a solution to the fault, outputting the first diagnosis result; or, when the first diagnosis result cannot be used as a solution to the fault, on the basis of the at least one first diagnosis result and an RAG agent program and / or an LLM agent program of the fault diagnosis service, obtaining a second diagnosis result, wherein the second diagnosis result is used as a solution to the fault; and outputting the second diagnosis result.

Need to check novelty before this filing date? Find Prior Art

Description

Fault Diagnosis Method and Device

[0001] This application claims the priority of a Chinese patent application with the application number 202411833490.3 and the invention title "Fault Diagnosis Method and Device", which was filed with the China National Intellectual Property Administration on December 12, 2024. The entire content of this application is incorporated herein by reference. Technical Field

[0002] This application relates to the field of computer technology, and more specifically, to a fault diagnosis method and device. Background Art

[0003] Continuous Integration (CI) and Continuous Delivery / Deployment (CD), often abbreviated as CI / CD, is a practice method in software development. Specifically, continuous integration emphasizes that developers frequently commit code changes to the code repository and automatically perform builds and tests to ensure the stability and reliability of the code; continuous delivery further deploys these tested codes to the test environment and prepares for release to the production environment at any time, but this step does not automatically deploy the code to the production environment; continuous deployment is an extension of continuous delivery, which automatically releases the code to the production environment to achieve rapid iteration and delivery. CI / CD significantly speeds up the software development life cycle through automated processes, improves development efficiency and software quality, and is an essential part of modern agile software development.

[0004] The diversity and complexity of the tools involved in the CI / CD pipeline increase the probability of failures and the difficulty of troubleshooting and repair. Among them, the difficulties in fault location and diagnosis lie in the variety of fault types, complex logs, complex distributed architectures, and difficulties in team collaboration. To improve the efficiency of fault location and diagnosis for the CI / CD pipeline, it is necessary to optimize the automatic diagnosis service supporting the CI / CD pipeline. Summary of the Invention

[0005] This application provides a fault diagnosis method and device to improve the efficiency of fault diagnosis for the Continuous Integration and Continuous Delivery / Deployment (CI / CD) pipeline.

[0006] Firstly, a fault diagnosis method is provided, applied to a fault diagnosis service for a CI / CD pipeline. The method includes: acquiring a fault category and a first problem description of the fault from the CI / CD pipeline, and multiple Standard Operating Procedure (SOP) nodes of the fault diagnosis service, wherein the SOP nodes are used to obtain a diagnostic result based on the problem description; determining at least one first diagnostic result based on the first problem description and at least one first SOP node corresponding to the fault category from the multiple SOP nodes; outputting the first diagnostic result if at least one first diagnostic result can be used as a solution to the fault; or, if the first diagnostic result cannot be used as a solution to the fault, obtaining a second diagnostic result based on at least one first diagnostic result and a retrieval enhancement generation RAG agent and / or a large language model LLM agent of the fault diagnosis service, the second diagnostic result being used as a solution to the fault; and outputting the second diagnostic result.

[0007] This application embodiment determines the diagnostic result based on the SOP node of the CI / CD pipeline fault diagnosis service, which can improve the automation and scalability of the fault diagnosis method. It can obtain the solution to the current fault based on the solution of the corresponding historical fault. When encountering a fault that cannot be resolved by the first diagnostic result, a second diagnostic result can be determined according to the RAG agent program to be used as the solution to the fault, thereby improving the efficiency of fault diagnosis of CI / CD pipeline.

[0008] In some implementations of the first aspect, at least one first SOP node forms an SOP tree. Determining at least one first diagnostic result based on a first problem description and at least one first SOP node from among a plurality of SOP nodes corresponding to a fault category includes: obtaining a third diagnostic result based on a second SOP node in the SOP tree, wherein the third diagnostic result is used to determine the first diagnostic result, or the third diagnostic result is used as the first diagnostic result, and the second SOP node is an SOP node among at least one first SOP node; determining the first diagnostic result and whether the first diagnostic result can be used as a solution to the fault based on a first condition, a second condition, a third condition, and the third diagnostic result, wherein the first condition is that the third diagnostic result can be used as a solution to the fault, the second condition is that the second SOP node is a leaf node of the SOP tree, and the third condition is that the third diagnostic result can be used as a problem description input to a child node of the second SOP node.

[0009] In this implementation, the structure of the SOP tree can guide the fault diagnosis service to execute the SOP node path, enabling the fault diagnosis service to determine the first diagnosis result and whether the first diagnosis result can be used as a solution to the fault, thereby reducing the possibility of manual intervention in the fault diagnosis process and improving the efficiency of fault diagnosis for CI / CD pipelines.

[0010] In some implementations of the first aspect, if the first and second conditions are not met: if the third condition is not met, the third diagnostic result is used as the first diagnostic result, wherein the first diagnostic result cannot be used as a solution to the fault; or, if the third condition is met, the third diagnostic result is obtained, including: obtaining the third diagnostic result based on the child nodes of the second SOP node and the second problem description, wherein the diagnostic result output by the second SOP node is used as the second problem description, and after determining the second problem description, the child nodes of the second SOP node are used as the second SOP node.

[0011] In this implementation, the SOP tree can break down a broad problem description into multiple problem descriptions and diagnostic results with logical and dependent relationships. The SOP nodes preset by the fault diagnosis service can correspond to specific functions to improve the efficiency of fault diagnosis of CI / CD pipelines.

[0012] In some implementations of the first aspect, if the first condition is not met but the second condition is met, the third diagnostic result is used as the first diagnostic result, wherein the first diagnostic result cannot be used as a solution to the fault.

[0013] In this implementation, if the existing knowledge of the SOP tree cannot resolve the fault, the first diagnostic result can be used as input to the fault diagnosis service's agent program to obtain the second diagnostic result. The fault diagnosis service can dynamically learn the solution to the fault, thereby improving the efficiency of fault diagnosis for the CI / CD pipeline in subsequent use.

[0014] In some implementations of the first aspect, the third diagnostic result is used as the first diagnostic result when the first condition is met, wherein the first diagnostic result can be used as a solution to the fault.

[0015] In some implementations of the first aspect, the method further includes: generating at least one third SOP node based on the fault category and the first problem description, wherein the third SOP node corresponds to the fault category and is used to obtain a second diagnostic result based on the first problem description.

[0016] In this implementation, the RAG agent can dynamically learn solutions to faults based on existing knowledge and unlearned faults, thereby improving the efficiency of fault diagnosis for CI / CD pipelines in subsequent use.

[0017] In some implementations of the first aspect, the SOP node includes metadata of at least one knowledge unit corresponding to the same category as the SOP node, wherein the knowledge unit includes a problem description and solution for a historical failure of the CI / CD pipeline, and the SOP node is obtained based on at least one knowledge unit.

[0018] In this implementation, the SOP node includes metadata of the knowledge unit, which can more accurately and quickly determine the correlation between the SOP node and historical faults, thereby improving the automation and scalability of the fault diagnosis method.

[0019] In some implementations of the first aspect, the SOP node is used to obtain diagnostic results based on the problem description and the problem description and solution of at least one knowledge unit corresponding to the SOP node.

[0020] In this implementation, knowledge units correspond to common faults in the CI / CD pipeline and difficult faults for which solutions mainly rely on human experience. The diagnostic results obtained from the knowledge units can further improve the automation and scalability of the fault diagnosis method.

[0021] In some implementations of the first aspect, the SOP node is used to obtain diagnostic results based on the problem description and the agent corresponding to the SOP node, wherein the agent includes at least one of an application programming interface (API) agent, an LLM agent, and a RAG agent.

[0022] In this implementation, API agents make fault diagnosis steps repeatable, traceable, and progress trackable, thereby improving the automation and scalability of fault diagnosis. LLM agents can effectively process natural language data such as logs, code snippets, and problem descriptions, enabling fault diagnosis services to work collaboratively with other services. RAG agents can effectively integrate human experience and external databases to attempt to solve unlearned faults, thereby improving the automation and scalability of fault diagnosis methods.

[0023] Secondly, a fault diagnosis device is provided, which is applied to fault diagnosis services for continuous integration and continuous delivery / deployment (CI / CD) pipelines. The device includes: an acquisition module for acquiring the fault category and a first problem description of the CI / CD pipeline, as well as multiple Standard Operating Procedure (SOP) nodes of the fault diagnosis service, wherein the SOP nodes are used to obtain diagnostic results based on the problem description; a processing module for determining at least one first diagnostic result based on the first problem description and at least one first SOP node corresponding to the fault category among the multiple SOP nodes; outputting the first diagnostic result if at least one first diagnostic result can be used as a solution to the fault; or, if the first diagnostic result cannot be used as a solution to the fault, obtaining a second diagnostic result based on at least one first diagnostic result and a retrieval enhancement generation RAG agent and / or a large language model LLM agent of the fault diagnosis service, wherein the second diagnostic result is used as a solution to the fault; and outputting the second diagnostic result.

[0024] In some implementations of the second aspect, at least one first SOP node forms an SOP tree, and the processing module is further used for:

[0025] A third diagnostic result is obtained based on the second SOP node in the SOP tree, wherein the third diagnostic result is used to determine the first diagnostic result, or the third diagnostic result is used as the first diagnostic result, and the second SOP node is an SOP node in at least one of the first SOP nodes; based on the first condition, the second condition, the third condition, and the third diagnostic result, the first diagnostic result and whether the first diagnostic result can be used as a solution to the fault are determined, wherein the first condition is that the third diagnostic result can be used as a solution to the fault, the second condition is that the second SOP node is a leaf node of the SOP tree, and the third condition is that the third diagnostic result can be used as a problem description for the child node of the input second SOP node.

[0026] In some implementations of the second aspect, if the first and second conditions are not met: if the third condition is not met, the third diagnostic result is used as the first diagnostic result, wherein the first diagnostic result cannot be used as a solution to the fault; or, if the third condition is met, the processing module is further configured to: obtain the third diagnostic result based on the child nodes of the second SOP node and the second problem description, wherein the diagnostic result output by the second SOP node is used as the second problem description, and after determining the second problem description, the child nodes of the second SOP node are used as the second SOP node.

[0027] In some implementations of the second aspect, if the first condition is not met but the second condition is met, the third diagnostic result is used as the first diagnostic result, wherein the first diagnostic result cannot be used as a solution to the fault.

[0028] In some implementations of the second aspect, the third diagnostic result is used as the first diagnostic result when the first condition is met, wherein the first diagnostic result can be used as a solution to the fault.

[0029] In some implementations of the second aspect, the processing module is further configured to: generate at least one third SOP node based on the fault category and the first problem description, wherein the third SOP node corresponds to the fault category, and to obtain a second diagnostic result based on the first problem description.

[0030] In some implementations of the second aspect, the SOP node includes metadata of at least one knowledge unit corresponding to the same category as the SOP node, wherein the knowledge unit includes a description of a historical failure of the CI / CD pipeline and a solution, and the SOP node is obtained based on at least one knowledge unit.

[0031] In some implementations of the second aspect, the SOP node is used to obtain diagnostic results based on the problem description and the problem description and solution of at least one knowledge unit corresponding to the SOP node.

[0032] In some implementations of the second aspect, the SOP node is used to obtain diagnostic results based on the problem description and the agent corresponding to the SOP node, wherein the agent includes at least one of the application programming interface (API) agent, LLM agent, and RAG agent.

[0033] Thirdly, a computing device cluster is provided, the computing device cluster including at least one computing device, each computing device including a processor and a memory; the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the computing device cluster to perform the method as described in any implementation of the first aspect.

[0034] Fourthly, a computer program product containing instructions is provided, which, when run by a cluster of computing devices, cause the cluster of computing devices to perform the method as described in any implementation of the first aspect.

[0035] Fifthly, a computer-readable storage medium is provided, including computer program instructions that, when executed by a cluster of computing devices, perform the method as described in any implementation of the first aspect. Attached Figure Description

[0036] Figure 1 is a flowchart illustrating a fault diagnosis method provided in an embodiment of this application.

[0037] Figure 2 is a schematic block diagram of the architecture of a fault diagnosis service provided in an embodiment of this application.

[0038] Figure 3 is a flowchart illustrating a fault diagnosis method provided in an embodiment of this application.

[0039] Figure 4 is a schematic block diagram of a fault diagnosis device provided in an embodiment of this application.

[0040] Figure 5 is a schematic block diagram of a computing device provided in an embodiment of this application.

[0041] Figure 6 is a schematic block diagram of a computing device cluster provided in an embodiment of this application.

[0042] Figure 7 is a schematic block diagram of a computing device cluster provided in an embodiment of this application. Detailed Implementation

[0043] First, the technical terms related to this application will be explained.

[0044] Continuous Integration (CI) and Continuous Delivery / Deployment (CD), often abbreviated as CI / CD, are software development practices. Specifically, Continuous Integration emphasizes developers frequently committing code changes to the code repository, automating builds and tests to ensure code stability and reliability. Continuous Delivery further deploys this tested code to a test environment, ready for release to the production environment, although this step does not automatically deploy the code to production. Continuous Deployment, an extension of Continuous Delivery, automatically releases code to the production environment, enabling rapid iteration and delivery.

[0045] In the CI / CD process, standard operating procedures (SOPs) define a series of standard steps and specifications for performing continuous integration and continuous delivery / deployment. An SOP node is each node in these SOPs, representing a specific task or operational step, such as code commit, automated build, testing, or deployment. An SOP tree is a collection of SOP nodes organized in a tree structure. It visually illustrates the logical relationships and execution order between the nodes in the CI / CD process, ensuring the coherence and accuracy of the entire process.

[0046] In a Standard Operating Procedure (SOP) tree, the parent-child node relationship is the most fundamental and core relationship. A parent node represents a larger task or operation, while child nodes represent smaller, more specific tasks or operations broken down from that task or operation. For example, in a CI / CD process, a parent node might represent an overall build task, while its child nodes might represent specific build steps such as code fetching, compiling, and packaging. Sibling nodes are two or more nodes that share the same parent node. Sibling nodes are responsible for different tasks or operations, and these tasks or operations are performed within the framework of the larger task or operation represented by their common parent node. For example, in a CI / CD process, automated builds and automated tests might be two parallel sibling nodes. In an SOP tree, a node without a parent node is called the root node, which is usually used as the entry point for the SOP process. Nodes without child nodes are called leaf nodes. The SOP process ends after the task corresponding to the leaf node is completed. Other nodes in the SOP tree are usually called internal nodes.

[0047] In the CI / CD process, agents act as a bridge connecting Standard Operating Procedure (SOP) nodes and the actual execution of operations. Specifically, an agent is assigned to execute tasks within one or more SOP nodes. When the CI / CD process progresses to a particular SOP node, the corresponding agent is triggered and begins executing the specific operations defined for that node. Furthermore, agents can adaptively adjust based on environmental awareness and user interaction results, improving the flexibility and robustness of task execution.

[0048] Retrieval-augmented generation (RAG) is a model architecture that combines the advantages of information retrieval and text generation. Before generating text, it retrieves relevant documents or paragraphs from an external knowledge base using a retrieval system. This retrieved information is then used as contextual input to the generative model, thereby enhancing the accuracy and relevance of the generated content. RAG technology mainly consists of a retrieval phase and a generation phase. In the retrieval phase, the model uses a retrieval system (such as vector-based retrieval techniques) to retrieve documents or paragraphs relevant to the input query from a pre-established knowledge base. This knowledge base can be any type of document collection, such as Wikipedia, specialized databases, or collections of academic papers. In the generation phase, the generative model, such as a pre-trained Large Language Model (LLM), uses the retrieved information as contextual input, combined with the original query, to generate the final text content.

[0049] A Retrieval Enhanced Generative Agent (RAG agent) is an intelligent entity that combines RAG technology with the functionality and characteristics of an agent. The RAG agent can perceive its environment, process reasoning, make decisions, and execute tasks. By retrieving information from a large knowledge base and generating text based on that information, it achieves more accurate and diverse text content creation. It not only possesses powerful language understanding and generation capabilities but also enhances the accuracy of generated content based on real-time retrieved information, making text generation more intelligent and personalized.

[0050] Automated diagnostics is an intelligent tool that automatically monitors, analyzes, and diagnoses potential problems in the CI / CD process. By collecting and analyzing log information, performance metrics, and other relevant data in real time, automated diagnostics can quickly identify faults, configuration problems, or performance bottlenecks in the code and provide detailed diagnostic reports and solution recommendations.

[0051] A problem-knowledge base, also known as a question-answer database, is a way of organizing information stored in the form of key-value pairs. Each problem serves as a key, and the related answers, solutions, or fault information are stored as values, forming multiple problem-knowledge pairs, which are also called knowledge units. In a CI / CD pipeline, the problem-knowledge base usually汇集了来自历史故障、代码解析、日志记录及运维团队经验等多种来源的已知问题及对应解决方案。

[0052] The diversity and complexity of the tools involved in the CI / CD pipeline increase the probability of failures and the difficulty of troubleshooting and repair. Among them, the difficulties in fault location and diagnosis lie in the variety of fault types, complex logs, complex distributed architectures, and difficult team collaboration. To improve the efficiency of fault location and diagnosis for the CI / CD pipeline, it is necessary to optimize the automatic diagnosis service配套的自动诊断服务进行优化。

[0053] In view of this, this application provides a fault diagnosis method, which is applied to the fault diagnosis service of continuous integration and continuous delivery / deployment CI / CD pipeline. The method includes: obtaining the category of the fault of the CI / CD pipeline and the first problem description of the fault, as well as multiple standard operation instruction SOP nodes of the fault diagnosis service, where the SOP node is used to obtain a diagnosis result according to the problem description; determining at least one first diagnosis result according to the first problem description and at least one first SOP node corresponding to the category of the fault among the multiple SOP nodes. In the case where at least one first diagnosis result can be used as a solution to the fault, output the first diagnosis result; or, in the case where the first diagnosis result cannot be used as a solution to the fault, obtain a second diagnosis result according to at least one first diagnosis result and the retrieval-augmented generation RAG agent program and / or the large language model LLM agent program of the fault diagnosis service, and use the second diagnosis result as a solution to the fault; output the second diagnosis result.

[0054] The embodiment of this application determines the diagnosis result according to the SOP node of the fault diagnosis service of the CI / CD pipeline, which can improve the automation degree and scalability of the fault diagnosis method, so as to obtain the solution to the current fault according to the solution to the corresponding historical fault. When encountering a fault that cannot be solved by the first diagnosis result, the second diagnosis result can also be determined according to the RAG agent program and used as a solution to the fault, thereby提高对CI / CD流水线进行故障诊断的效率。

[0055] The following结合图1所示的故障诊断方法的流程示意图,说明该方法应用于CI / CD流水线的故障诊断服务的具体过程。

[0056] It should be noted that there are some inaccuracies or unclear parts in the original Chinese text that have been translated as accurately as possible while maintaining the integrity of the content. For example, the part "汇集了来自历史故障、代码解析、日志记录及运维团队经验等多种来源的已知问题及对应解决方案" and "配套的自动诊断服务进行优化" in the original text seem a bit incomplete or unclear in expression. And the "提高对CI / CD流水线进行故障诊断的效率" at the end of relevant paragraphs is repeated and may need to be adjusted according to the actual context for a more accurate translation in a more complete text.S110, obtain the category of the CI / CD pipeline fault and the first problem description of the fault, as well as multiple SOP nodes of the fault diagnosis service, wherein the SOP nodes are used to obtain the diagnosis results based on the problem description.

[0057] First, we will explain the methods for obtaining the categories of CI / CD pipeline faults and the problem descriptions of those faults. In some embodiments, when a user executes a CI / CD pipeline and the pipeline is interrupted due to a fault, the fault diagnosis service can directly obtain the CI / CD pipeline's fault logs. Alternatively, the user can also describe the cause, type, and possible solutions of the fault in natural language, and input the corresponding description through the fault diagnosis service's large language model, allowing the service to obtain more fault information.

[0058] In one possible implementation, after obtaining the fault logs of the CI / CD pipeline, the fault logs can be organized into fault categories and fault descriptions using the log parsing service in the fault diagnosis service. First, log filtering is performed to remove irrelevant logs, retaining only log entries directly related to the fault. Next, log analysis is conducted, using pattern matching and log classification techniques to determine the approximate fault category and core fault information. Finally, context extraction is performed, collecting contextual metadata related to the current task through the metadata of the CI / CD process and the execution machine, such as the identifier of the build machine, code version, build configuration, execution machine file directory, environment configuration, etc. The final problem description typically includes precise fault logs and contextual metadata, and may also include a natural language description from the user. The fault category can be represented by relevant metadata. It should be understood that the fault information classification method can include classification based on at least one combination of the aforementioned metadata; this application does not limit the classification method.

[0059] Before applying fault diagnosis services to a CI / CD pipeline, a Standard Operating Procedure (SOP) knowledge base needs to be set up for the fault diagnosis service. The following describes how to obtain multiple preset SOPs.

[0060] S210 derives multiple knowledge units based on the problem descriptions and solutions of historical failures in the CI / CD pipeline.

[0061] In some embodiments, the development team builds a question-and-answer database based on experience. First, it collects data such as problem descriptions (e.g., fault logs) and solutions (e.g., optimization strategies) of historical faults closely related to the CI / CD pipeline and stores them uniformly in the database. Second, it organizes these data according to categories. Finally, it clarifies the correspondence between the problem description and the solution for each fault, forming multiple knowledge units.

[0062] In one possible implementation, a knowledge unit comprises multiple metadata entries used to determine its classification. For example, metadata related to a problem description might include a problem identifier, a problem title providing a brief description or summary of the problem, fault logs, stages of the CI / CD testing process (build, test, deployment, etc.), problem types such as build failures, deployment failures, and dependency conflicts, and the problem's fault code. Metadata related to a solution (i.e., knowledge) might include an answer identifier, answer sources such as expert suggestions, historical records, and external documentation. Other metadata might include tags composed of keywords or phrases related to the problem or knowledge, and identifiers associated with related problems.

[0063] It should be understood that the classification method of knowledge units includes a combination of at least one of the following categories: stage of CI / CD testing process, problem type, problem fault code, answer source, problem label, etc. This application does not limit the selection of categories in the clustering algorithm.

[0064] In one possible implementation, clustering algorithms are used to divide multiple knowledge units in a question-and-answer database into multiple categories. Specifically, clustering algorithms group data points in a dataset according to a specific criterion (such as distance or similarity), ensuring that data points within the same cluster are similar to each other, while data points in different clusters are significantly different. This grouping process does not require predefined category labels; instead, the algorithm automatically discovers patterns and structures in the data. Common clustering algorithms include K-means, hierarchical clustering, and density-based spatial clustering with noise (DBSCAN).

[0065] S220, generate multiple SOP nodes based on multiple knowledge units and preset generation rules. Each SOP node includes metadata of at least one knowledge unit corresponding to the same category as the SOP node.

[0066] In some embodiments, an SOP node includes metadata, such as a node identifier, node name, node type (build, test, deployment, etc.), triggering conditions (code commit, specific time, dependency task completion, etc.), input parameters (code repository address, build parameters, environment variables, etc.), operation instructions (query static analysis configuration, obtain build configuration, etc.), expected outputs (build artifacts, test reports, etc.), execution environment (specific compiler version, testing framework, etc.), log output configuration, and dependencies. The metadata of an SOP node also includes metadata for at least one corresponding knowledge unit, ensuring that the SOP node includes as complete and comprehensive information as possible to determine the correspondence between knowledge units and SOP nodes.

[0067] In one possible implementation, algorithms such as word segmentation and stemming can be used to preprocess the metadata of knowledge units. Then, multiple SOP nodes can be generated based on predefined generation rules such as text similarity or keyword matching. The specific processes of word segmentation, stemming, and text similarity matching can be implemented using a Large Language Model (LLM). Alternatively, suitable prompt templates can be set for the LLM, enabling it to generate SOP nodes conforming to a certain format based on the knowledge unit's metadata. Finally, the generated SOP nodes may require further manual review and modification.

[0068] In this implementation, the SOP node includes metadata of the knowledge unit, which can more accurately and quickly determine the correlation between the SOP node and historical faults, thereby improving the automation and scalability of the fault diagnosis method.

[0069] S230: Obtain an SOP knowledge base based on multiple SOP nodes, and the SOP nodes in the SOP knowledge base form an SOP tree.

[0070] In some embodiments, multiple SOP trees are obtained based on multiple SOP nodes and multiple knowledge units. Metadata of multiple SOP nodes is read, including node ID, name, type, execution conditions, input parameters, and detailed content of the knowledge units they comprise, to understand the function and role of each node. Then, based on the metadata of partial types of SOP nodes, such as their function or position in the CI / CD process, they are categorized into different types. For example, nodes can be categorized into build, test, and deployment types. Next, the dependencies between SOP nodes are analyzed. Typically, the execution of one SOP node may depend on the completion of another node; this dependency determines the parent-child relationship in the SOP tree. Furthermore, other metadata such as triggering conditions, build parameters, and execution environment also indirectly include information about the dependencies between SOP nodes. Finally, based on the dependencies, the parent and child nodes of each SOP node are determined. The parent node is the node that triggers the execution of the child node, and the child node is the node that executes after the parent node. The preprocessing of the metadata of the knowledge units and the construction of the SOP tree according to preset matching rules are similar to the aforementioned embodiments and will not be repeated here. Obviously, depending on the different classification methods, this embodiment will generate multiple SOP trees organized in different ways.

[0071] In one possible implementation, SOP nodes and their corresponding knowledge units may not be stored in the same database. For example, according to the schematic diagram of the error diagnosis service shown in Figure 2, SOP nodes form multiple SOP trees based on categories and are stored in the SOP knowledge base, while knowledge units are stored in the question-and-answer database. Specifically, SOP node #101 corresponds to two knowledge units #101, and SOP node #203 corresponds to two knowledge units #203. It can be assumed that SOP node #101 includes two knowledge units #101, and SOP node #203 includes two knowledge units #203.

[0072] S120, based on the first problem description and at least one first SOP node among a plurality of SOP nodes corresponding to the category of the fault, determine at least one first diagnostic result.

[0073] In some embodiments, the representation of diagnostic results is similar to that of problem descriptions. Diagnostic results include diagnostic logs and contextual metadata, and may also include natural language output from the LLM. Furthermore, diagnostic results may include categories determined based on the contextual metadata. Therefore, in the SOP tree, the diagnostic results output by an SOP node can be used as problem descriptions input to its child nodes.

[0074] In some embodiments, at least one first SOP node forms an SOP tree as shown in FIG2. A third diagnostic result can be obtained based on a second SOP node in the SOP tree, wherein the third diagnostic result is used to determine the first diagnostic result, or the third diagnostic result is used as the first diagnostic result, and the second SOP node is an SOP node among at least one first SOP node. Then, based on a first condition, a second condition, a third condition, and a third diagnostic result, it is determined whether the first diagnostic result can be used as a solution to the fault, wherein the first condition is that the third diagnostic result can be used as a solution to the fault, the second condition is that the second SOP node is a leaf node of the SOP tree, and the third condition is that the third diagnostic result can be used as a problem description for inputting a child node of the second SOP node.

[0075] In this implementation, the structure of the SOP tree can guide the fault diagnosis service to execute the SOP node path, enabling the fault diagnosis service to determine the first diagnosis result and whether the first diagnosis result can be used as a solution to the fault, thereby reducing the possibility of manual intervention in the fault diagnosis process and improving the efficiency of fault diagnosis for CI / CD pipelines.

[0076] Specifically, SOP nodes can perform various tasks through agents corresponding to those SOP nodes in the fault diagnosis service. These agents can be of multiple types and can perform tasks in various ways. For example, an Application Programming Interface (API) agent can query open-source component information, obtain access control template details, and retrieve the build machine environment; these tasks are completed by calling the corresponding API. An LLM agent can call the API of a large language model to generate the required natural language and code snippets through preset or dynamically concatenated prompts. ARAGagent can retrieve and obtain relevant information from a knowledge base and combine it with the aforementioned LLM text generation capabilities to support task execution. The knowledge base can include knowledge bases from the public internet or the question-and-answer database provided in this application. It should be understood that the above description is only an example, and this application does not limit the way the agent performs SOPs.

[0077] In this implementation, API agents make fault diagnosis steps repeatable, traceable, and progress trackable, thereby improving the automation and scalability of fault diagnosis. LLM agents can effectively process natural language data such as logs, code snippets, and problem descriptions, enabling fault diagnosis services to work collaboratively with other services. RAG agents can effectively integrate human experience and external databases to attempt to solve unlearned faults, thereby improving the automation and scalability of fault diagnosis methods.

[0078] The following example, based on the flowchart shown in Figure 3, illustrates a method for determining at least one first diagnostic result and whether the first diagnostic result can be used as a solution to the fault.

[0079] In some embodiments, if the first and second conditions are not met but the third condition is met, a third diagnostic result is obtained based on the child nodes of the second SOP node and the second problem description. The diagnostic result output by the second SOP node is used as the second problem description, and after determining the second problem description, the child nodes of the second SOP node are used as the second SOP node. For example, multiple first SOP nodes form an SOP tree corresponding to category #1 as shown in Figure 2. The initial value of the second SOP node is the root node (i.e., SOP node #101), and the category corresponding to the root node is "build failure." Its specific function is to determine the specific reason for the build failure. Assuming the current CI / CD pipeline fails during the build phase because the API agent detects a high-risk vulnerability in version 5.6 of one of the open-source software packages A, causing the package to be intercepted (referred to as the vulnerability interception scenario), the diagnostic result output by SOP node #101 is similar to "(vulnerability interception, package A, 5.6, logs including vulnerability information)". Clearly, this diagnostic result cannot be directly used as a solution to the vulnerability interception problem; that is, this diagnostic result will not be used as the first diagnostic result, i.e., the first condition is not met. Furthermore, the root node is not a leaf node, i.e., the second condition is not met.

[0080] Next, it is determined whether the diagnostic results output by the second SOP node can be used as problem descriptions for its child nodes. For example, the specific function of a child node #102 of SOP node #101 is to determine the version number compatible with a package based on the specific version of the package. For example, based on the version 5.6 of package A, it is determined that versions 5.5 and 5.4 are compatible versions. In other words, the diagnostic results output by SOP node #101 are used as the second problem description, and SOP node #102 is used as the new second SOP node.

[0081] In this implementation, the SOP tree can break down a broad problem description into multiple problem descriptions and diagnostic results with logical and dependent relationships. The SOP nodes preset by the fault diagnosis service can correspond to specific functions to improve the efficiency of fault diagnosis of CI / CD pipelines.

[0082] In one possible implementation, the SOP node is used to obtain a diagnostic result based on the problem description, and the problem description and solution of at least one knowledge unit corresponding to the SOP node. Specifically, to determine whether the diagnostic result output by the second SOP node can be used as a problem description for a child node, an LLM agent can be used to calculate the text similarity between the log of the diagnostic result and the log of historical faults in the knowledge unit of the child node. Subsequently, the diagnostic result can be adjusted or a more reliable diagnostic result can be regenerated based on the text similarity to make the diagnostic result output by the second SOP node as usable as possible for a child node's problem description. Of course, the problem descriptions of historical faults and the current diagnostic result also include other information such as contextual metadata, which can also improve the reliability of the diagnostic result. This application does not limit the specific method for obtaining the diagnostic result.

[0083] In this implementation, knowledge units correspond to common faults in the CI / CD pipeline and difficult faults for which solutions mainly rely on human experience. The diagnostic results obtained from the knowledge units can further improve the automation and scalability of the fault diagnosis method.

[0084] In one possible implementation, if the first condition is met, the third diagnostic result is used as the first diagnostic result, which can be used as a solution to the problem. For example, if APIagent finds package A with version 5.5 in the package repository, SOP node #102 can output the diagnostic result "Change the version of package A to 5.5 and rebuild" through the LLM agent. If there are no high-risk vulnerabilities in versions 5.5 and 5.4 of package A, "Change the version of package A to 5.5 and rebuild" is used as the first diagnostic result and can be used as a solution to the problem.

[0085] In one possible implementation, if the first, second, and third conditions are not met, the third diagnostic result is used as the first diagnostic result, whereby the first diagnostic result cannot be used as a solution to the fault. For example, if versions 5.5 and 5.4 of software package A also have vulnerabilities similar to version 5.6, the diagnostic result obtained by SOP node #102 is "version 5.5 of software package A has a similar vulnerability." This diagnostic result is merely a description of the fault and obviously cannot be used as a solution to the fault, i.e., the first condition is not met. Furthermore, SOP node #103 is not a leaf node and therefore does not meet the second condition. Moreover, the specific function of child node #103 of SOP node #102 is set under the assumption that the diagnostic result of SOP node #102 is "change the version of software package A to 5.5 and rebuild," such as "mark and attempt to fix the different programming interfaces in the changed version of the software package compared to the original software package." Obviously, "version 5.5 of software package A has a similar vulnerability" cannot be used as a problem description input to SOP node #103. In conclusion, the SOP tree cannot further resolve the fault; "version 5.5 of package A has a similar vulnerability" is used as the first diagnostic result and cannot be used as a solution to the fault.

[0086] In some embodiments, the logging tool for software package C outputs a fault log to indicate that package C triggered a fault during the software build process; however, this log does not include the fault type and context information. Specifically, an appropriate prompt word template can be set for the LLM agent, causing it to output a natural language description indicating that the fault is related to package C. However, this fault description is incomplete and serves as the initial diagnostic result, not a solution. Furthermore, the LLM agent can obtain common faults that package C may encounter from a question-and-answer database for further reference by developers.

[0087] In one possible implementation, if the first condition is not met but the second condition is met, the third diagnostic result is used as the first diagnostic result, where the first diagnostic result cannot be used as a solution to the fault. Assuming the SOP tree corresponding to category #1 shown in Figure 2 does not include SOP node #103, then SOP node #102 is a leaf node, satisfying the second condition, "version 5.5 of software package A has a similar vulnerability" is used as the first diagnostic result, but cannot be used as a solution to the fault.

[0088] In this implementation, if the existing knowledge of the SOP tree cannot resolve the fault, the first diagnostic result can be used as input to the fault diagnosis service's agent program to obtain the second diagnostic result. The fault diagnosis service can dynamically learn the solution to the fault, thereby improving the efficiency of fault diagnosis for the CI / CD pipeline in subsequent use.

[0089] S130, if at least one first diagnostic result can be used as a solution to the fault, output the first diagnostic result.

[0090] In some embodiments of the aforementioned vulnerability interception scenario, "changing the version of package A to 5.5 and rebuilding" is used as the first diagnostic result, and this first diagnostic result can be used as a solution to the fault. After the fault diagnosis service outputs the first diagnostic result, the CI / CD pipeline's automated build service can automatically complete the solution without manual intervention, or wait for the developer to complete the steps requiring manual intervention before performing appropriate post-processing. For example, in the aforementioned vulnerability interception scenario, if the API agent corresponding to SOP node #102 finds package A with version 5.5 in the package repository, and this version of package A has no vulnerabilities, the automated build service can call the API agent to change the version of package A in the build instruction to 5.5 and re-execute the build task.

[0091] S140, if the first diagnostic result cannot be used as a solution to the fault, a second diagnostic result is obtained based on at least one first diagnostic result and the RAG agent program and / or LLM agent program of the fault diagnosis service, and the second diagnostic result is used as a solution to the fault; the second diagnostic result is output.

[0092] In some embodiments of the aforementioned vulnerability interception scenario, "version 5.5 of software package A has a similar vulnerability" is used as the first diagnostic result and cannot be used as a solution to the problem. The RAG agent can search the knowledge base for software packages with similar functionality to software package A, based on the functionality of the code to be built, to obtain the second diagnostic result "use software package B with the same functionality and similar code interface as software package A," and generate code with the same functionality using software package B based on the functionality of the existing code using software package A. It should be understood that the generated code snippets and natural language descriptions of the code can also be considered part of the second diagnostic result.

[0093] In some embodiments, at least one third SOP node is generated based on the fault category and the first problem description, wherein the third SOP node corresponds to the fault category and is used to obtain a second diagnostic result based on the first problem description. The specific methods for generating the third SOP node and storing it in the SOP knowledge base are similar to the methods described in S210, S220, and S230, and will not be elaborated here.

[0094] In this implementation, the RAG agent can dynamically learn solutions to faults based on existing knowledge and unlearned faults, thereby improving the efficiency of fault diagnosis for CI / CD pipelines in subsequent use.

[0095] This application also provides a fault diagnosis device, as shown in Figure 4, including:

[0096] The acquisition module is used to acquire the category of faults in the CI / CD pipeline and a first problem description of the fault, as well as multiple standard operating instructions (SOP) nodes for fault diagnosis services, wherein the SOP nodes are used to obtain diagnostic results based on the problem description.

[0097] The processing module is configured to: determine at least one first diagnostic result based on a first problem description and at least one first SOP node corresponding to the fault category among multiple SOP nodes; output the first diagnostic result if at least one first diagnostic result can be used as a solution to the fault; or, if the first diagnostic result cannot be used as a solution to the fault, generate a second diagnostic result based on at least one first diagnostic result and a fault diagnosis service retrieval enhancement to generate a RAG agent and / or a large language model LLM agent, the second diagnostic result being used as a solution to the fault; and output the second diagnostic result.

[0098] Both the processing module and the acquisition module can be implemented in software or hardware. For example, the implementation of the processing module will be described below. Similarly, the implementation of the acquisition module can be referenced from that of the processing module.

[0099] As an example of a software functional unit, a processing module may include code running on a computing instance. A computing instance may include at least one of a physical host (computing device), a virtual machine, or a container. Furthermore, the aforementioned computing instance may be one or more. For example, a processing module may include code running on multiple hosts / virtual machines / containers. It should be noted that the multiple hosts / virtual machines / containers used to run the code may be distributed within the same region or in different regions. Further, the multiple hosts / virtual machines / containers used to run the code may be distributed within the same availability zone (AZ) or in different AZs, each AZ comprising one or more geographically proximate data centers. Typically, a region may include multiple AZs.

[0100] Similarly, multiple hosts / virtual machines / containers used to run this code can be distributed within the same Virtual Private Cloud (VPC) or across multiple VPCs. Typically, a VPC is set up within a region. Communication between two VPCs within the same region, as well as between VPCs in different regions, requires a communication gateway to be set up within each VPC to enable interconnection between VPCs.

[0101] As an example of a hardware functional unit, a processing module may include at least one computing device, such as a server. Alternatively, a processing module may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). The aforementioned PLD may be implemented using a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), generic array logic (GAL), or any combination thereof.

[0102] The processing module includes multiple computing devices that can be distributed within the same region or in different regions. Similarly, the processing module includes multiple computing devices that can be distributed within the same Availability Zone (AZ) or in different AZs. Likewise, the processing module includes multiple computing devices that can be distributed within the same Virtual Private Cloud (VPC) or multiple VPCs. These multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.

[0103] This application also provides a computing device 1200. As shown in FIG5, the computing device 1200 includes: a bus 1202, a processor 1204, a memory 1206, and a communication interface 1208. The processor 1204, the memory 1206, and the communication interface 1208 communicate with each other via the bus 1202. The computing device 1200 can be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 1200.

[0104] Bus 1202 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, only one line is used in Figure 5, but this does not imply that there is only one bus or one type of bus. Bus 1202 can include pathways for transmitting information between various components of computing device 1200 (e.g., memory 1206, processor 1204, communication interface 1208).

[0105] The processor 1204 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).

[0106] The memory 1206 may include volatile memory, such as random access memory (RAM). The processor 1204 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD).

[0107] The memory 1206 stores executable program code, and the processor 1204 executes the executable program code to implement the functions of the aforementioned processing module and acquisition module, thereby realizing the fault diagnosis method. That is, the memory 1206 stores instructions for executing the fault diagnosis method.

[0108] The communication interface 1208 uses transceiver modules such as, but not limited to, network interface cards and transceivers to enable communication between the computing device 1200 and other devices or communication networks.

[0109] This application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.

[0110] As shown in Figure 6, the computing device cluster includes at least one computing device 1200. The memory 1206 of one or more computing devices 1200 in the computing device cluster may store the same instructions for executing fault diagnosis methods.

[0111] In some possible implementations, the memory 1206 of one or more computing devices 1200 in the computing device cluster may also store partial instructions for executing the fault diagnosis method. In other words, a combination of one or more computing devices 1200 can jointly execute the instructions for executing the fault diagnosis method.

[0112] It should be noted that the memory 1206 in different computing devices 1200 within the computing device cluster can store different instructions, each used to execute a portion of the functions of the fault diagnosis device. That is, the instructions stored in the memory 1206 of different computing devices 1200 can implement the functions of one or more modules in the processing module and the acquisition module.

[0113] In some possible implementations, one or more computing devices in a computing device cluster can be connected via a network. This network can be a wide area network (WAN) or a local area network (LAN), etc. Figure 7 illustrates one possible implementation. As shown in Figure 7, two computing devices 1200A and 1200B are connected via a network. Specifically, they are connected to the network through communication interfaces in each computing device. In this type of possible implementation, the memory 1206 of one or more computing devices 1200 in the computing device cluster can store the same instructions for executing fault diagnosis methods.

[0114] It should be understood that the functions of computing device 1200A shown in Figure 7 can also be performed by multiple computing devices 1200. Similarly, the functions of computing device 1200B can also be performed by multiple computing devices 1200.

[0115] This application also provides a computer program product containing instructions. The computer program product may be a software or program product containing instructions, capable of running on a computing device or stored on any usable medium. When the computer program product is run on at least one computing device, it causes the at least one computing device to perform a fault diagnosis method.

[0116] This application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device to perform a fault diagnosis method.

[0117] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A fault diagnosis method, characterized in that, The method is applied to fault diagnosis services for continuous integration and continuous delivery / deployment CI / CD pipelines, and the method includes: The fault category and first problem description of the CI / CD pipeline are obtained, as well as multiple standard operating procedure (SOP) nodes of the fault diagnosis service, wherein the SOP nodes are used to obtain diagnostic results based on the problem description; Based on the first problem description and at least one first SOP node among the plurality of SOP nodes corresponding to the category of the fault, at least one first diagnostic result is determined; If at least one of the first diagnostic results can be used as a solution to the fault, output the first diagnostic result; or... If the first diagnostic result cannot be used as a solution to the fault, a second diagnostic result is obtained based on the at least one first diagnostic result and the retrieval enhancement of the fault diagnosis service to generate a RAG agent and / or a large language model LLM agent, and the second diagnostic result is used as a solution to the fault; the second diagnostic result is output.

2. The method according to claim 1, characterized in that, The at least one first SOP node forms an SOP tree, and the determination of at least one first diagnostic result based on the first problem description and at least one first SOP node among the plurality of SOP nodes corresponding to the category of the fault includes: A third diagnostic result is obtained based on the second SOP node in the SOP tree, wherein the third diagnostic result is used to determine the first diagnostic result, or the third diagnostic result is used as the first diagnostic result, and the second SOP node is an SOP node among the at least one first SOP node; Based on the first condition, the second condition, the third condition, and the third diagnostic result, determine whether the first diagnostic result can be used as a solution to the fault, wherein the first condition is that the third diagnostic result can be used as a solution to the fault, the second condition is that the second SOP node is a leaf node of the SOP tree, and the third condition is that the third diagnostic result can be used as a problem description for inputting the child node of the second SOP node.

3. The method according to claim 2, characterized in that, If neither the first condition nor the second condition is met: If the third condition is not met, the third diagnostic result is used as the first diagnostic result, wherein the first diagnostic result cannot be used as a solution to the fault; or, When the third condition is met, obtaining the third diagnostic result includes: obtaining the third diagnostic result based on the child nodes of the second SOP node and the second problem description, wherein the diagnostic result output by the second SOP node is used as the second problem description, and after determining the second problem description, the child nodes of the second SOP node are used as the second SOP node.

4. The method according to claim 2 or 3, characterized in that, If the first condition is not met but the second condition is met, the third diagnostic result is used as the first diagnostic result, wherein the first diagnostic result cannot be used as a solution to the fault.

5. The method according to any one of claims 2 to 4, characterized in that, If the first condition is met, the third diagnostic result is used as the first diagnostic result, wherein the first diagnostic result can be used as a solution to the fault.

6. The method according to any one of claims 1 to 5, characterized in that, The method further includes: At least one third SOP node is generated based on the fault category and the first problem description, wherein the third SOP node corresponds to the fault category and is used to obtain the second diagnostic result based on the first problem description.

7. The method according to any one of claims 1 to 6, characterized in that, The SOP node includes metadata of at least one knowledge unit corresponding to the same category as the SOP node, wherein the knowledge unit includes a description of the problem and solution of the historical failure of the CI / CD pipeline, and the SOP node is obtained based on the at least one knowledge unit.

8. The method according to claim 7, characterized in that, The SOP node is used to obtain the diagnostic result based on the problem description and the problem description and solution of at least one knowledge unit corresponding to the SOP node.

9. The method according to any one of claims 1 to 8, characterized in that, The SOP node is used to obtain the diagnostic result based on the problem description and the agent program corresponding to the SOP node, wherein the agent program includes at least one of the application programming interface (API) agent program, LLM agent program, and RAG agent program.

10. A fault diagnosis device, characterized in that, The device is used for fault diagnosis services in continuous integration and continuous delivery / deployment CI / CD pipelines, and the device includes: The acquisition module is used to acquire the category of the fault in the CI / CD pipeline and the first problem description of the fault, as well as multiple standard operating instructions (SOP) nodes of the fault diagnosis service, wherein the SOP nodes are used to obtain the diagnosis result based on the problem description; The processing module is configured to determine at least one first diagnostic result based on the first problem description and at least one first SOP node among the plurality of SOP nodes corresponding to the category of the fault; and output the first diagnostic result if at least one first diagnostic result can be used as a solution to the fault; or, If the first diagnostic result cannot be used as a solution to the fault, a second diagnostic result is obtained based on the at least one first diagnostic result and the retrieval enhancement of the fault diagnosis service to generate a RAG agent and / or a large language model LLM agent, and the second diagnostic result is used as a solution to the fault; the second diagnostic result is output.

11. The apparatus according to claim 10, characterized in that, The at least one first SOP node forms an SOP tree, and the processing module is further configured to: A third diagnostic result is obtained based on the second SOP node in the SOP tree, wherein the third diagnostic result is used to determine the first diagnostic result, or the third diagnostic result is used as the first diagnostic result, and the second SOP node is an SOP node among the at least one first SOP node; Based on the first condition, the second condition, the third condition, and the third diagnostic result, determine whether the first diagnostic result can be used as a solution to the fault, wherein the first condition is that the third diagnostic result can be used as a solution to the fault, the second condition is that the second SOP node is a leaf node of the SOP tree, and the third condition is that the third diagnostic result can be used as a problem description for inputting the child node of the second SOP node.

12. The apparatus according to claim 11, characterized in that, If neither the first condition nor the second condition is met: If the third condition is not met, the third diagnostic result is used as the first diagnostic result, wherein the first diagnostic result cannot be used as a solution to the fault; or, If the third condition is met, the processing module is further configured to: obtain the third diagnostic result based on the child nodes of the second SOP node and the second problem description, wherein the diagnostic result output by the second SOP node is used as the second problem description, and after determining the second problem description, the child nodes of the second SOP node are used as the second SOP node.

13. The apparatus according to claim 11 or 12, characterized in that, If the first condition is not met but the second condition is met, the third diagnostic result is used as the first diagnostic result, wherein the first diagnostic result cannot be used as a solution to the fault.

14. The apparatus according to any one of claims 11 to 13, characterized in that, If the first condition is met, the third diagnostic result is used as the first diagnostic result, wherein the first diagnostic result can be used as a solution to the fault.

15. The apparatus according to any one of claims 10 to 14, characterized in that, The processing module is also used for: At least one third SOP node is generated based on the fault category and the first problem description, wherein the third SOP node corresponds to the fault category and is used to obtain the second diagnostic result based on the first problem description.

16. The apparatus according to any one of claims 10 to 15, characterized in that, The SOP node includes metadata of at least one knowledge unit corresponding to the same category as the SOP node, wherein the knowledge unit includes a description of the problem and solution of the historical failure of the CI / CD pipeline, and the SOP node is obtained based on the at least one knowledge unit.

17. The apparatus according to claim 16, characterized in that, The SOP node is used to obtain the diagnostic result based on the problem description and the problem description and solution of at least one knowledge unit corresponding to the SOP node.

18. The apparatus according to any one of claims 10 to 17, characterized in that, The SOP node is used to obtain the diagnostic result based on the problem description and the agent program corresponding to the SOP node, wherein the agent program includes at least one of the application programming interface (API) agent program, LLM agent program, and RAG agent program.

19. A computing device cluster, characterized in that, It includes at least one computing device, each computing device including a processor and memory; The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method as described in any one of claims 1 to 9.

20. A computer program product containing instructions, characterized in that, When the instruction is executed by the computing device cluster, the computing device cluster performs the method as described in any one of claims 1 to 9.

21. A computer-readable storage medium, characterized in that, It includes computer program instructions, which, when executed by a cluster of computing devices, perform the method as described in any one of claims 1 to 9.