Fault location method, apparatus, device, medium, and program product

By receiving user fault descriptions in a cloud-native environment and using a multi-agent collaborative mechanism for diagnostic analysis, the problem of low efficiency and insufficient accuracy in fault root cause localization in existing technologies has been solved. This achieves efficient and accurate fault localization, adapts to the dynamic characteristics of the cloud-native environment, and ensures the stable operation of financial services.

CN122309207APending Publication Date: 2026-06-30INDUSTRIAL AND COMMERCIAL BANK OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INDUSTRIAL AND COMMERCIAL BANK OF CHINA
Filing Date
2026-03-06
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Fault diagnosis in cloud-native environments relies on monitoring and alarm platforms, centralized log analysis, and distributed tracing. However, due to data heterogeneity and high dynamism, the efficiency and accuracy of root cause location are low, and it is highly dependent on human expert experience, making it difficult to quickly respond to the high-frequency fault requirements of financial business.

Method used

By receiving user fault descriptions, the system uses intent recognition agents to identify domain problems, matches diagnostic agents for analysis, and iterates through a multi-agent collaborative mechanism to generate accurate fault location reports. This reduces human intervention and adapts to the heterogeneous and dynamic characteristics of cloud-native environments.

Benefits of technology

It improves the efficiency and accuracy of root cause analysis, reduces reliance on expert experience, ensures the continuous and stable operation of the cloud-native environment, and enables rapid response to fault diagnosis needs in financial scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309207A_ABST
    Figure CN122309207A_ABST
Patent Text Reader

Abstract

This application provides a fault location method, apparatus, device, medium, and program product, which can be applied to the fields of artificial intelligence and fintech, involving the application of large models in fault diagnosis scenarios. The method includes: receiving a fault description statement input by a user; invoking an intent recognition agent to analyze the fault description statement to determine the domain problem; matching a target agent from a diagnostic agent group based on the domain problem; invoking the target agent to perform diagnostic analysis on the domain problem and generate a current diagnostic result; verifying the current diagnostic result, and if verification fails, continuously iterating the diagnostic analysis process until the current diagnostic result passes verification; and generating a fault location report based on the verified diagnostic result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the fields of artificial intelligence and fintech, and to the application of large models in fault diagnosis scenarios. More specifically, it relates to a fault location method, device, equipment, medium, and program product. Background Technology

[0002] The development of fintech has placed core demands on systems, including high concurrency, high availability, elastic scalability, and agile iteration. Cloud-native environments, with technologies such as microservices, containers, and service meshes at their core, are well-suited to the business innovation and computing power scheduling requirements of fintech, and support the stable operation of high-frequency businesses such as transaction settlement and supply chain finance. Financial businesses have stringent requirements for continuity and stability; failures in core processes such as transaction settlement can directly lead to economic losses, trigger regulatory compliance risks, and damage user trust. Therefore, efficient fault diagnosis capabilities in cloud-native environments are crucial to ensuring the smooth operation of financial businesses.

[0003] Currently, fault diagnosis in cloud-native environments mainly relies on diagnostic methods such as monitoring and alarm platforms, centralized log analysis, and distributed link tracing. However, the high complexity, heterogeneous data sources, and strong dynamism of cloud-native environments make these diagnostic methods highly dependent on expert experience, resulting in low efficiency in root cause location, insufficient diagnostic accuracy, excessive reliance on manual intervention, and significant difficulty in root cause location. Summary of the Invention

[0004] In view of the above problems, embodiments of this application provide a fault location method, apparatus, device, medium, and program product.

[0005] According to a first aspect of this application, a fault location method is provided, comprising: receiving a fault description statement input by a user; invoking an intent recognition agent to analyze the fault description statement to determine a domain problem; matching a target agent in a diagnostic agent group based on the domain problem; invoking the target agent to perform diagnostic analysis on the domain problem and generating a current diagnostic result; verifying the current diagnostic result, and if the verification fails, continuously iterating the diagnostic analysis process until the current diagnostic result passes verification; and generating a fault location report based on the verified diagnostic result.

[0006] According to an embodiment of this application, the step of invoking the target intelligent agent to perform diagnostic analysis on the domain problem and generate a current diagnostic result includes: decomposing the domain problem into sub-problems, and distributing the sub-problems to corresponding target intelligent agents based on a multi-agent collaboration mechanism; synchronizing and coordinating the state of the control flow and data flow of the target intelligent agents based on the multi-agent collaboration mechanism, invoking the target intelligent agents to analyze the sub-problems, and generating a single-step diagnostic summary corresponding to the target intelligent agent; and integrating the single-step diagnostic summaries corresponding to the target intelligent agents to generate the current diagnostic result.

[0007] According to an embodiment of this application, the step of invoking the target intelligent agent to analyze the sub-problem and generate a single-step diagnostic summary corresponding to the target intelligent agent includes: invoking the target intelligent agent to perform contextual reasoning on the sub-problem to obtain a diagnostic plan; based on the diagnostic plan, invoking and executing a security tool to obtain system operation data; standardizing the system operation data to obtain system operation results; and invoking the target intelligent agent to perform reflective reasoning analysis on the system operation results to generate a single-step diagnostic summary corresponding to the target intelligent agent.

[0008] According to an embodiment of this application, the step of invoking and executing security tools based on the diagnostic plan to obtain system operation data includes: invoking the security tools according to the processing order of the diagnostic plan; and obtaining the system operation data by acquiring multi-source observation data of the cloud-native environment through the security tools based on standard application programming interfaces.

[0009] According to an embodiment of this application, the step of invoking the target intelligent agent to perform contextual reasoning on the sub-problem and obtain a diagnostic plan includes: invoking the target intelligent agent, and using the plan generation model of the target intelligent agent to perform contextual reasoning on the sub-problem based on preset prompts to obtain the diagnostic plan; wherein, the plan generation model is a large language model trained based on historical expert diagnostic cases.

[0010] According to an embodiment of this application, the iterative diagnostic analysis process includes: adjusting the domain problem based on the current diagnostic result and the updated problem context; matching a new target agent based on the adjusted domain problem; and invoking the new target agent to perform diagnostic analysis on the adjusted domain problem.

[0011] According to an embodiment of this application, the method further includes: extracting diagnostic data during the diagnostic analysis process and storing the diagnostic data in a diagnostic case knowledge base; obtaining user feedback information from the fault location report; and updating the agents in the diagnostic agent group based on the diagnostic case knowledge base and the user feedback information.

[0012] According to a second aspect of this application, a fault location device is provided, comprising: an intent recognition module, configured to receive a fault description statement input by a user, and invoke an intent recognition agent to analyze the fault description statement to determine a domain problem; a target determination module, configured to match a target agent in a diagnostic agent group based on the domain problem; a diagnostic analysis module, configured to invoke the target agent to perform diagnostic analysis on the domain problem and generate a current diagnostic result; a verification module, configured to verify the current diagnostic result, and if the verification fails, to continuously iterate the diagnostic analysis process until the current diagnostic result passes verification; and a report generation module, configured to generate a fault location report based on the verified diagnostic result.

[0013] According to a third aspect of this application, an electronic device is provided, comprising: one or more processors; and a memory for storing one or more computer programs, wherein the one or more processors execute the one or more computer programs to implement the steps of the method described above.

[0014] According to a fourth aspect of this application, a computer-readable storage medium is also provided, on which a computer program or instructions are stored, wherein the computer program or instructions, when executed by a processor, implement the steps of the above-described method.

[0015] According to a fifth aspect of this application, a computer program product is also provided, including a computer program or instructions that, when executed by a processor, implement the steps of the above-described method.

[0016] In the embodiments of this application, the intent recognition agent accurately parses the fault description and locates the domain problem, then matches the target diagnostic agent in the diagnostic agent group for targeted analysis, and optimizes the diagnostic results through iterative verification, finally generating an accurate fault location report. This realizes intelligent fault location in the cloud-native environment. By utilizing multi-agent diagnostic analysis, it breaks through the high dependence on expert experience, improves the efficiency and accuracy of root cause localization, reduces human intervention, calls the corresponding agent to perform the corresponding diagnostic analysis, effectively adapts to the heterogeneous and dynamic characteristics of the cloud-native environment, quickly responds to the fault diagnosis needs of financial scenarios, and ensures the continuous and stable operation of the cloud-native environment. Attached Figure Description

[0017] The above-mentioned contents, other objects, features and advantages of this application will become clearer from the following description of embodiments with reference to the accompanying drawings, in which:

[0018] Figure 1 A schematic diagram illustrating an example environment in which the methods according to embodiments of this application can be applied;

[0019] Figure 2A flowchart illustrating a fault location method according to an embodiment of this application is shown schematically.

[0020] Figure 3 A flowchart illustrating a multi-agent diagnostic analysis process according to an embodiment of this application is shown schematically.

[0021] Figure 4 A schematic diagram illustrating a single-step diagnostic flowchart of an intelligent agent according to an embodiment of this application is shown.

[0022] Figure 5 A schematic diagram illustrating a container diagnostic flowchart according to an embodiment of this application is shown.

[0023] Figure 6 This diagram schematically illustrates the layered architecture of a fault location system according to an embodiment of the present application.

[0024] Figure 7 A schematic diagram of a fault location device according to an embodiment of this application is shown.

[0025] Figure 8 A block diagram schematically illustrates an electronic device suitable for implementing a fault location method according to an embodiment of this application. Detailed Implementation

[0026] The embodiments of this application will now be described with reference to the accompanying drawings. However, it should be understood that these descriptions are exemplary only and are not intended to limit the scope of this application. In the following detailed description, numerous specific details are set forth to provide a thorough understanding of the embodiments of this application for ease of explanation. However, it will be apparent that one or more embodiments may be implemented without these specific details. Furthermore, descriptions of well-known structures and technologies are omitted in the following description to avoid unnecessarily obscuring the concepts of this application.

[0027] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of this application. The terms “comprising,” “including,” etc., as used herein indicate the presence of features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.

[0028] All terms used herein (including technical and scientific terms) have the meanings commonly understood by those skilled in the art, unless otherwise defined. It should be noted that the terms used herein are to be interpreted in a manner consistent with the context of this specification, and not in an idealized or overly rigid way.

[0029] In one or more embodiments described herein, the term "large language model (large model)" can refer to a deep learning model with a large number of model parameters, which can include hundreds of millions, tens of billions, hundreds of billions, trillions, or even tens of trillions of model parameters. A large model can also be called a foundational model / base model. It is pre-trained using large-scale unlabeled corpora to produce a pre-trained model with hundreds of millions of parameters. Such models can adapt to a wide range of downstream tasks and have good generalization ability; examples include large language models and multimodal pre-trained models. It should be understood that in practical applications, large models only require a small number of samples to fine-tune the pre-trained model before being applied to different tasks. Large models can be widely used in natural language processing, computer vision, and other fields. Specifically, they can be applied to computer vision tasks such as visual question answering, image captioning, and image generation, as well as natural language processing tasks such as text-based sentiment classification, text summarization generation, and machine translation.

[0030] Cloud-native architecture is the core foundation for the digital transformation of fintech, but its microservice-based distributed deployment characteristics dramatically increase system complexity, make it difficult to integrate multi-source heterogeneous operational data, and result in large dynamic scheduling of containers and nodes. This directly leads to low efficiency and insufficient accuracy in locating the root cause of failures, and it heavily relies on the experience and judgment of human experts, easily creating time lags in fault handling. Fault diagnosis is crucial for the fintech field. Core financial businesses such as transaction settlement and account management have extremely high requirements for system continuity. Even minor faults in the cloud-native environment can trigger transaction chain interruptions and data anomalies, causing direct economic losses, regulatory compliance penalties, and even undermining user trust in financial institutions. At the same time, fintech cloud-native systems handle massive high-frequency transactions, amplifying the cascading effects of faults. Efficient fault diagnosis capabilities are a key support for ensuring the stable operation of financial businesses and building a solid financial security defense.

[0031] Currently, fault diagnosis in cloud-native environments mainly relies on the following technical solutions:

[0032] Monitoring and Alarm Platform: This platform uses the monitoring and alarm tools of the business systems to collect system metrics and trigger alarms based on preset static threshold rules. Maintenance personnel need to log into multiple systems to manually investigate based on the alarm information.

[0033] Centralized Log Analysis System: This system utilizes a tool stack including log collection, storage, analysis, and visualization tools to centrally collect and retrieve distributed logs. When a fault occurs, operations and maintenance personnel use keyword searches, pattern matching, and other methods to search for error clues within a large volume of logs.

[0034] Distributed tracing: Using distributed tracing tools, the complete call path and time of a request are recorded between distributed systems, which can be used to locate performance bottlenecks and specific points of call failure.

[0035] Early intelligent operation and maintenance tools: These tools attempted to apply machine learning algorithms, mainly focusing on anomaly detection of single data sources, such as time-series prediction of CPU utilization or anomaly clustering of log templates. The output of these tools was mostly "anomaly of a certain indicator" or "anomaly log found", rather than the root cause of the failure.

[0036] However, the existing fault diagnosis solutions mentioned above have the following significant drawbacks:

[0037] (1) Data fragmentation and lack of correlation: Metrics, logs, and link data are stored in different systems with different formats and dimensions. When troubleshooting, maintenance personnel need to manually switch between multiple monitoring systems, associate timestamps and tracking identifiers, which is cumbersome and error-prone, resulting in data silos.

[0038] (2) The contradiction between static rules and dynamic environment: The cloud-native environment is dynamic and ever-changing, service instances are frequently scaled up and down, and dependencies change in real time. Alarm systems based on static thresholds and fixed rules have extremely high false alarm and false negative rates, and cannot accurately reflect the true health status of complex systems.

[0039] (3) Root cause localization is highly dependent on expert experience: Most current link tracing tools stop at the presentation of symptoms, leaving massive amounts of raw data and discrete alarms to maintenance personnel. The efficiency and quality of localization depend heavily on an individual's familiarity with the system architecture and historical faults. Knowledge is difficult to accumulate and reuse, resulting in long average repair times and high labor costs.

[0040] (4) Existing intelligent operation and maintenance solutions lack sufficient intelligence: Most solutions are merely offline, passive statistical analysis models, lacking proactive decision-making entities with reasoning capabilities. They cannot simulate the expert reasoning process of hypothesis testing, nor can they conduct purposeful interactive exploration during the localization process. Essentially, they are still high-level pattern matching, and their intelligence level needs to be optimized.

[0041] This application provides a fault location method, comprising: receiving a fault description statement input by a user; invoking an intent recognition agent to analyze the fault description statement to determine the domain problem; matching a target agent in a diagnostic agent group based on the domain problem; invoking the target agent to perform diagnostic analysis on the domain problem and generate a current diagnostic result; verifying the current diagnostic result, and if the verification fails, continuously iterating the diagnostic analysis process until the current diagnostic result passes verification; and generating a fault location report based on the verified diagnostic result. In this application's embodiments, the intent recognition agent accurately parses the fault description and locates the domain problem, then matches a target diagnostic agent in the diagnostic agent group for targeted analysis, and iteratively verifies and optimizes the diagnostic result, ultimately generating an accurate fault location report. This achieves intelligent fault location in a cloud-native environment. Utilizing multi-agent diagnostic analysis, it overcomes the high dependence on expert experience, improves the efficiency and accuracy of root cause localization, reduces manual intervention, and calls the corresponding agent to perform the corresponding diagnostic analysis. This effectively adapts to the heterogeneous and dynamic characteristics of the cloud-native environment, quickly responds to the fault diagnosis needs of financial scenarios, and ensures the continuous and stable operation of the cloud-native environment.

[0042] It should be noted that the fault location method and apparatus of this application can be used in the fields of artificial intelligence and fintech, involving the application of large models in fault diagnosis scenarios, and can also be used in any field other than artificial intelligence and fintech. The application fields of the fault location method and apparatus of this application are not limited.

[0043] Figure 1 A schematic diagram of an example environment 100 to which the method according to an embodiment of this application can be applied is shown. In this example environment 100 (such as a cloud-native environment), an application 125 is installed on a terminal device 110. A user 140 can interact with the application 125 via the terminal device 110 and / or an attached device of the terminal device 110.

[0044] In some embodiments, application 125 can be downloaded and installed on terminal device 110. In some embodiments, application 125 can also be accessed in other ways, such as through a web page. Figure 1 In environment 100, in response to application 125 being launched, terminal device 110 can display the interface 150 of application 125.

[0045] In some embodiments, terminal device 110 can communicate with server 130 to provide services to application 125. Terminal device 110 can be any type of mobile terminal, fixed terminal, or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, personal communication system (PCS) devices, personal navigation devices, personal digital assistants (PDAs), audio / video players, digital cameras / camcorders, television receivers, radio receivers, e-book devices, gaming devices, or any combination thereof, including accessories and peripherals of these devices or any combination thereof. In some embodiments, terminal device 110 can also support any type of user-facing interface. Application 125 can be various types of computing systems / servers capable of providing computing power, including but not limited to mainframes, edge computing nodes, and computing devices in cloud environments.

[0046] In some embodiments, application 125 can provide interaction capabilities with an intelligent agent. Application 125 may include an application specifically designed to provide services to an intelligent agent, or an application integrated with an intelligent agent. Although Figure 1 The image shows a single application, but in reality, multiple applications can be installed on the terminal device 110.

[0047] In embodiments of this application, multiple intelligent agents 160 can be deployed locally on terminal device 110 or remotely. In the case of remote deployment, terminal device 110 can directly invoke the intelligent agents, or it can invoke the intelligent agents via server 130. Exemplarily, intelligent agents 160 may have intelligent dialogue and task processing capabilities. Terminal device 110 provides an interface 150 that can present interactions with intelligent agents 160. In interface 150, user 140 can initiate task requests to intelligent agents 160 by inputting natural language (e.g., text input or voice input). Optionally, user 140 can upload online or offline file dialogues to instruct intelligent agents 160 to assist in completing various tasks.

[0048] In the embodiments of this application, during interaction with user 140, intelligent agent 160 can respond to user 140's requests and handle tasks instructed by the user. In some embodiments, during task processing, intelligent agent 160 can invoke one or more tools 165 to assist in task execution and the provision of task results as needed. These tools 165 can be any type of tool, such as text generation tools, file reading tools, information search tools, online or offline databases, image processing tools, chart generation tools, web page creation tools, etc.

[0049] In some embodiments, environment 100 may further include a management node for multiple agents 160, which can interact with the agents 160. In some examples, the management node may, in response to a task request from user 140, determine the task requirements corresponding to the task request. The management node may then, based on the task requirements, assign the task request to the agent 160 that matches the task requirements, requesting that agent 160 to perform the task. In other examples, the management node may also determine an execution plan for the task based on the task requirements. The execution plan may indicate one or more subtasks required to complete the task. The management node may assign these one or more subtasks to one or more agents 160, which will then execute their respective subtasks. Regarding the management node, in some examples, the management node may be implemented by one of the multiple agents 160. In other examples, the management node may be implemented by a machine learning model, such as a language model.

[0050] In some embodiments, agent 160 may be constructed based on one or more machine learning models. In some embodiments, the machine learning model on which agent 160 is based may include at least a language model, such as a large language model. In some embodiments, the machine learning model on which agent 160 is based may include a multimodal model capable of handling multiple modal inputs, such as text input, visual input (e.g., images, videos), audio input, etc. These machine learning models may include content-generating models capable of generating corresponding outputs based on model inputs. In some embodiments, the machine learning model may receive text-modal model inputs (e.g., natural language and / or machine language) and / or non-text-modal model inputs (e.g., images, speech, videos, etc.), and may obtain corresponding model outputs based on model inputs and prompts, thereby completing the task execution.

[0051] It should be understood that the structure and function of the various elements in environment 100 are described for illustrative purposes only and are not intended to limit the scope of this application in any way.

[0052] Figure 2 A flowchart illustrating a fault location method according to an embodiment of this application is shown schematically. Figure 2 As shown, the fault location method 200 according to the embodiments of this application may include steps S210 to S250.

[0053] In step S210, a fault description statement input by the user is received, and the intent recognition agent is invoked to analyze the fault description statement in order to determine the domain problem.

[0054] It provides users with a natural language interaction interface, receiving fault descriptions input by users in text form. These descriptions are intuitive expressions of various fault phenomena in the cloud-native environment, with no fixed format requirements. They only need to reflect abnormal issues in components such as microservices and containers, providing core input for subsequent fault diagnosis. Users can directly describe problems using business language, without needing to understand professional monitoring terminology or query syntax, greatly reducing the barrier to entry and enabling business developers and junior operations personnel to efficiently initiate accurate cloud-native fault root cause localization requests.

[0055] Users can describe faults existing in the cloud-native environment using natural language to obtain fault description statements. The system receives fault description statements input by users (such as "container memory overflow"), and a multi-agent collaborative mechanism triggers an intent recognition agent. This intent recognition agent, based on a pre-trained semantic understanding model, parses the fault scenario and core requirements in the statement. It determines the domain of the problem (such as container orchestration or database) through domain keyword matching (e.g., container, cluster), i.e., the domain problem itself. Then, through the routing mechanism of the language model graph orchestration framework, the task is assigned to the corresponding diagnostic agent group.

[0056] It is worth noting that domain issues refer to the technical field and core problem type extracted and determined from the fault description statement by the intent recognition agent, such as abnormal problems in the fields of containers, clusters, databases, and operating systems. This is the key basis for matching the corresponding diagnostic agent and carrying out accurate fault location.

[0057] Cloud-native environments refer to distributed application runtime environments built on containers, microservices, service meshes, and declarative application programming interfaces, characterized by elastic scaling, dynamic orchestration, and loose coupling.

[0058] Based on intent recognition agents and diagnostic agent swarms, a multi-agent collaborative mechanism is constructed using a language model graph orchestration framework. This mechanism consists of multiple agents with specific capabilities working together under a central control and scheduling system. Through message passing and collaborative work, these agents jointly complete complex fault root cause localization tasks. The multi-agent collaborative mechanism simulates the working mode of an expert team consultation, achieving progressive and interpretable fault root cause localization through multiple rounds of "plan-execution-summary-verification" cycles.

[0059] The intent recognition agent is specifically designed to analyze user natural language input (such as descriptions of fault phenomena) and understand their deeper intents (e.g., performance issues, availability issues, configuration issues, etc.). This facilitates the subsequent categorization and distribution of issues to the corresponding specialized diagnostic agents. The intent recognition agent is trained on a natural language corpus from the operations and maintenance (O&M) domain, incorporating labeled data such as fault descriptions and domain keywords. It learns semantic understanding and domain matching capabilities, and combines this with an O&M scenario-based dialogue optimization model. After training, it can accurately parse natural language requests and determine the technical field to which the issue belongs.

[0060] The Language Model Graph Orchestration Framework (LMAG) is a framework for building complex, stateful multi-agent applications. It defines control and data flow between agents through a graph structure, supporting loops, branches, and persistent state management. It forms the core technological foundation for implementing multi-agent collaborative architectures. Based on the LMAG, it facilitates the addition of specialized diagnostic agents (such as agents targeting specific middleware). Adding an agent is like adding a node to the graph, offering good scalability and adaptability to changes in the technology stack.

[0061] In step S220, based on the domain problem, the target intelligent agent in the group of intelligent agents is matched and diagnosed.

[0062] Once the domain problem is identified, the agent registry (which stores the domain adaptation scope and professional capability tags of each diagnostic agent) is queried. Through keyword matching and capability scoring mechanisms, a target agent suitable for the current domain problem is matched from the diagnostic agent group (such as container orchestration system cluster diagnostic agents, virtual machine diagnostic agents, operating system diagnostic agents, and database diagnostic agents), ensuring the professional matching of the diagnostic subject. It is worth noting that both the intent recognition agent and each agent in the diagnostic agent group are based on a general large model, specifically fine-tuned using cloud-native operational domain corpora to enhance fault semantic understanding and reasoning capabilities. Combined with tool invocation, process orchestration, and other functional encapsulation, specialized agents are formed, accurately adapting to the needs of scenarios such as fault analysis, domain determination, and root cause diagnosis.

[0063] The container orchestration system cluster diagnostic agent uses expert cases of container orchestration cluster fault diagnosis as its core training corpus, incorporates fault diagnosis experience in cluster scheduling, resource scheduling, node communication, etc., learns the fault reasoning logic in the field of container orchestration, and can dynamically generate adapted cluster fault diagnosis plans after training, and accurately call relevant tools to analyze problems.

[0064] The virtual machine diagnostic agent is trained based on historical cases and expert experience in virtual machine fault diagnosis. It covers scenarios such as virtual machine startup and shutdown, resource allocation, and image anomalies. It learns fault reasoning rules for virtual machine hardware adaptation and system operation. After training, it can specifically analyze the causes of faults at the virtual machine level and generate exclusive diagnostic steps.

[0065] The operating system diagnostic agent is trained based on troubleshooting cases of various operating system faults and knowledge of system kernel and process operation. It covers scenarios such as process anomalies, port occupancy, and system service failures. It learns the fault analysis logic at the system level and can accurately locate cloud-native environment problems caused by the operating system after training.

[0066] The database diagnostic agent is trained based on expert cases in database fault diagnosis and operation and maintenance knowledge, covering scenarios such as connection anomalies, query lag, and data corruption. It incorporates expert experience in database tuning and log analysis. After training, it can generate diagnostic plans for database-level problems and analyze database logs and indicators to locate the root cause.

[0067] In step S230, the target intelligent agent is invoked to perform diagnostic analysis on the domain problem and generate the current diagnostic results.

[0068] The target agent invokes a plan generation model trained on historical expert cases, combines prompts to constrain the reasoning direction, and generates a structured diagnostic plan. Following the diagnostic plan, security tools (such as monitoring and query tools and log analysis tools) are invoked to obtain multi-dimensional system data. This data is then analyzed and reasoned using a built-in expert experience model, outputting a single-step diagnostic summary (i.e., the current diagnostic result) that includes hypothesis verification results and current evidence.

[0069] In step S240, the current diagnostic result is verified, and if the verification fails, the diagnostic analysis process is iterated until the current diagnostic result is verified.

[0070] The current diagnostic results are verified through a reflection mechanism. First, the integrity of the evidence chain is checked, verifying whether evidence such as tool call records, system operation data, and key indicators cover all core dimensions of all sub-problems and whether there is any missing key data. Second, the consistency of the conclusions is verified by comparing the logical connections of the single-step summaries of each target agent and using domain knowledge graph matching rules to identify contradictions or logical gaps in the conclusions. Finally, the hypothesis verification status is evaluated to determine whether the current evidence is sufficient to confirm the hypothesis and whether there are any potential reasons for unverification. During the verification process, an anomaly marking mechanism is automatically triggered to generate specific feedback information for items that fail (such as insufficient evidence or conflicting conclusions).

[0071] If the verification fails (e.g., the problem is not resolved or the chain of evidence is incomplete), the target agent reconstructs or adjusts the diagnostic plan based on the latest diagnostic context, calls the security tool again to collect supplementary data, and repeats the cycle of planning-execution-reflection until the root cause of the failure is initially clear, that is, the current diagnostic result is verified.

[0072] According to embodiments of this application, the iterative diagnostic analysis process includes: adjusting the domain problem based on the current diagnostic results and the updated problem context; matching a new target agent based on the adjusted domain problem; and invoking the new target agent to perform diagnostic analysis on the adjusted domain problem.

[0073] During iterative diagnostic analysis, the domain problem is first re-examined and adjusted based on the current diagnostic results and the updated problem context to better align with the actual fault scenario and achieve more precise problem localization. Then, based on the updated domain problem, a more suitable target agent is re-matched in the agent registry. Next, through a multi-agent collaboration mechanism, the new target agent is invoked to conduct diagnostic analysis, generating new diagnostic results which then enter the verification phase. This process is repeated until the diagnostic results are verified and the root cause is accurately located. The updated problem context refers to information such as newly added tool-returned data, log metrics, and intermediate conclusions during the diagnostic process; it is continuously enriched contextual information throughout the diagnostic process.

[0074] In the embodiments of this application, based on the current diagnostic results and the updated context, the domain problem is adjusted, and a new target agent is invoked to re-perform the diagnostic analysis. This continuous iteration mechanism allows the diagnostic results to be dynamically optimized based on actual feedback, continuously correcting diagnostic biases and gradually converging to the true root cause of the fault. This not only solves the limitations of single diagnosis and improves the accuracy of root cause localization, but also adapts to complex and unknown fault scenarios.

[0075] In step S250, a fault location report is generated based on the verified diagnostic results.

[0076] Collect diagnostic reasoning data such as reasoning steps, tool call records, and key evidence (log fragments, abnormal indicators) during diagnosis. Based on the verified diagnostic results and diagnostic reasoning data, integrate them in a pre-set template to generate a fault location report. This report clearly presents the diagnostic logic and data support, making it easy for operation and maintenance personnel to verify.

[0077] The generated fault location report is not just a conclusion, but also includes clear reasoning steps, specific tools used, and key data evidence obtained. This white-box, auditable diagnostic report enables operations and maintenance personnel to quickly understand, verify, and trust the agent's judgment, effectively resolving the trust crisis caused by the black-box operation of traditional large models and facilitating human-machine collaboration.

[0078] In the embodiments of this application, the intent recognition agent accurately parses the fault description and locates the domain problem, then matches the target diagnostic agent in the diagnostic agent group for targeted analysis, and optimizes the diagnostic results through iterative verification, finally generating an accurate fault location report. This realizes intelligent fault location in the cloud-native environment. By utilizing multi-agent diagnostic analysis, it breaks through the high dependence on expert experience, improves the efficiency and accuracy of root cause localization, reduces human intervention, calls the corresponding agent to perform the corresponding diagnostic analysis, effectively adapts to the heterogeneous and dynamic characteristics of the cloud-native environment, quickly responds to the fault diagnosis needs of financial scenarios, and ensures the continuous and stable operation of the cloud-native environment.

[0079] According to an embodiment of this application, after generating a fault location report, the fault location method further includes: extracting diagnostic data from the diagnostic analysis process and storing the diagnostic data in a diagnostic case knowledge base; obtaining user feedback information from the fault location report; and updating the agents in the training diagnostic agent group based on the diagnostic case knowledge base and the user feedback information.

[0080] After each successful diagnosis, the complete chain of problem context → diagnostic plan → execution evidence → final conclusion is automatically saved as a structured case and stored in the knowledge base as reference and training data for future similar fault diagnoses. Diagnostic data includes diagnostic analysis plans, execution status of security tools, system operation data, system operation results, and single-step diagnostic summaries.

[0081] The confirmation or correction feedback from maintenance personnel on the fault location report can be directly used to fine-tune the decision-making model of the relevant intelligent agent, thereby achieving continuous online optimization of the intelligent agent.

[0082] Structured diagnostic cases are retrieved from the knowledge base, and user feedback information is preprocessed (e.g., labeled and cleaned). The preprocessed feedback and diagnostic cases are then input into the corresponding domain-specific intelligent agent model. Decision parameters are fine-tuned, training is completed, and the effectiveness is verified. Once the model meets the standards, it is deployed online. The trained intelligent agents are updated, solidifying the diagnostic methodologies of top experts into the instincts of intelligent agents in different professional domains. This standardizes and scales the diagnostic process across the enterprise, solving the problem of operational capabilities relying on individuals and varying skill levels.

[0083] In the embodiments of this application, extracting diagnostic data and storing it in a knowledge base enables the structured retention of diagnostic and maintenance knowledge. Combining user feedback information from reports with the training of diagnostic intelligent agents, user feedback allows the training of intelligent agents to fit actual application scenarios, continuously optimizes the diagnostic capabilities of intelligent agents, promotes interactive learning and evolution of the system, and enables the diagnostic strategy to be continuously iterated and upgraded.

[0084] Figure 3 A flowchart illustrating a multi-agent diagnostic analysis process according to an embodiment of this application is shown schematically. Figure 3As shown, according to an embodiment of this application, step S230, which calls the target intelligent agent to perform diagnostic analysis on the domain problem and generates the current diagnostic result, includes steps S310 to S330.

[0085] In step S310, the domain problem is decomposed to obtain sub-problems, and the sub-problems are distributed to the corresponding target agents based on the multi-agent cooperation mechanism.

[0086] By leveraging a knowledge graph from the cloud-native fault domain, complex domain problems are broken down into several independent sub-problems, each corresponding to a single technical dimension. A multi-agent collaboration mechanism matches target agents with corresponding expertise within the diagnostic agent group based on the domain attributes of the sub-problems. Then, through the task distribution module of the language model graph orchestration framework, the decomposed sub-problems are accurately distributed to the appropriate target agents, ensuring that each sub-problem is analyzed by a specialized agent. The decomposition process follows the logical hierarchy of fault diagnosis, avoiding omissions or duplication of sub-problems.

[0087] If the domain problem is decomposed into only one sub-problem, the multi-agent collaboration mechanism will match the target agent with the corresponding professional capability in the diagnostic agent group according to the domain attribute of the sub-problem, and send the domain problem to the target agent.

[0088] In step S320, based on the multi-agent collaboration mechanism, the target agent performs state synchronization and collaborative interaction of control flow and data flow, calls the target agent to analyze sub-problems, and generates a single-step diagnostic summary corresponding to the target agent.

[0089] The multi-agent collaborative mechanism is based on a language model graph orchestration framework. It establishes control flow and data flow channels among the target agents to achieve global state synchronization. Each target agent can share the diagnostic context and sub-problem analysis progress in real time. According to preset collaborative interaction rules, each target agent is scheduled to analyze the corresponding sub-problems in parallel. Each agent calls the plan generation model and security tools to complete the investigation, data collection, and reasoning analysis of the sub-problems. Finally, a single-step diagnostic summary containing the sub-problem analysis process, evidence, and preliminary conclusions is generated and synchronously transmitted back to the shared data center of the collaborative system to ensure the orderliness and relevance of the diagnostic process.

[0090] The multi-agent collaboration mechanism, based on a language model graph orchestration framework, employs a graph structure to manage agent states and workflows. This enables flexible task routing, context persistence, and complex control flow, supporting non-linear diagnostic paths and resulting in highly scalable systems (expandable to include other agents). The multi-agent collaboration mechanism allows different specialized diagnostic tasks (such as network troubleshooting, code-level log analysis, and resource configuration verification) to be executed in parallel or pipelined under scheduling and coordination, compressing diagnostic timelines and improving diagnostic efficiency.

[0091] In step S330, the single-step diagnostic summary corresponding to the target intelligent agent is integrated to generate the current diagnostic result.

[0092] The single-step diagnostic summaries returned by all target intelligent agents are extracted from the shared data center. Combined with the overall logic of cloud-native fault diagnosis, the single-step diagnostic summaries are structured and integrated according to the association hierarchy between sub-problems and original domain problems. Logical connection information between sub-problems is supplemented, and finally a complete and coherent current diagnostic result covering the analysis results of all sub-problems is formed, providing a comprehensive analytical basis for subsequent verification.

[0093] In the embodiments of this application, the domain problem is first decomposed and distributed to the corresponding target intelligent agent. Then, the control flow and data flow between intelligent agents are synchronized and interacted through a multi-agent collaboration mechanism. Each intelligent agent generates a single-step diagnostic summary and then integrates the diagnostic results to achieve parallel collaborative diagnosis of multiple intelligent agents, compress the diagnostic time sequence, ensure the interoperability of diagnostic data, and provide a complete chain of evidence to support the diagnostic conclusions. This greatly improves the efficiency and accuracy of root cause localization and realizes a standardized diagnostic process for intelligent agents.

[0094] Figure 4 A schematic diagram illustrating a single-step diagnostic flowchart for an intelligent agent according to an embodiment of this application is provided. Figure 4 As shown, according to an embodiment of this application, step S320, which calls the target agent to analyze the sub-problem and generates a single-step diagnostic summary corresponding to the target agent, includes steps S410 to S440.

[0095] In step S410, the target intelligent agent is invoked to perform contextual reasoning on the sub-problem to obtain a diagnostic plan.

[0096] After receiving the corresponding sub-problem, the target agent retrieves the current fault context and the built-in expert experience base, and performs contextual reasoning using a plan generation model trained on historical diagnostic cases. Simultaneously, preset prompts guide the reasoning direction, preventing deviation from the core of the sub-problem. During the reasoning process, the model matches the corresponding fault investigation logic based on the sub-problem type, ultimately generating a dynamic diagnostic plan that includes specific diagnostic steps, tool call sequences, and expected verification targets. The safety tool calls and step progression in the plan align with the expert troubleshooting approach in actual operations and maintenance, ensuring professionalism and relevance.

[0097] A diagnostic plan is a dynamic execution scheme generated by the target agent (such as a cluster diagnostic agent or a virtual machine diagnostic agent) based on sub-problems, the current fault context, and an expert experience base. It includes specific diagnostic steps, tool call sequences, and expected verification targets.

[0098] In step S420, based on the diagnostic plan, security tools are invoked and executed to obtain system operation data.

[0099] According to the diagnostic plan, security tools can be used to obtain system operation data through an observability system. The observability system consists of three core data types: indicators, logs, and link tracing. It is a comprehensive technical framework used to systematically monitor, diagnose, and understand the internal state of complex systems.

[0100] Security tools include native tools for container orchestration systems, monitoring metric query tools, log query tools, container command execution tools, and dependency analysis tools.

[0101] Native tools for container orchestration systems: query the running status of container instances, cluster nodes, and deployment instances.

[0102] Monitoring metrics query tool: Obtain time-series data such as CPU, memory, and virtual machine stack.

[0103] Log query tool: Performs retrieval and filtering operations on critical error logs.

[0104] Container command execution tool: Securely execute diagnostic commands such as stack analysis and system process viewing in a sandbox environment.

[0105] Link dependency analysis tool: Investigate the impact of upstream and downstream connections on services.

[0106] The diagnostic process mandates that the agent invoke multiple security tools to obtain direct and objective system operation status data as evidence, rather than relying solely on the internal knowledge of a large model or a single data source, thus avoiding illusionary reasoning.

[0107] The intelligent agent automatically performs various probing actions, from data querying to command execution, and calls and executes security tools, replacing the tedious work of operations and maintenance personnel manually switching between multiple consoles and entering complex query statements. This reduces the average diagnosis time from hours to minutes, significantly reducing the reliance on immediate responses from senior operations and maintenance personnel.

[0108] According to an embodiment of this application, step S420, which involves calling and executing a security tool based on a diagnostic plan to obtain system operation data, includes: calling the security tool according to the processing order of the diagnostic plan; and obtaining multi-source observation data of the cloud-native environment through the security tool based on a standard application programming interface to obtain system operation data.

[0109] Following the processing sequence defined in the diagnostic plan, appropriate security tools are invoked sequentially. These tools include native container orchestration tools and monitoring metric query tools. All security tools encapsulate atomic operations into high-order actions and interface with the cloud-native environment through standard application programming interfaces (APIs), acquiring multi-source observational data such as metrics, logs, and trace data from the observable system. API calls adhere to unified specifications to ensure data compatibility, while the acquisition of data from multiple tools and dimensions avoids the limitations of a single data source, providing objective and comprehensive system operation data for diagnostics.

[0110] The ecosystem of security tools has been optimized for intelligent agents. The composite security tools encapsulate multiple atomic operations into high-order actions that conform to the expert's thinking mode, which significantly improves the efficiency and accuracy of tool invocation. At the same time, through carefully managed context strategies, the signal-to-noise ratio of interaction with large models has been optimized.

[0111] In the embodiments of this application, security tools are invoked in the order of the diagnostic plan to ensure that data acquisition conforms to the diagnostic logic. Then, multi-source observation data of the cloud-native environment is obtained through standard interfaces to ensure the standardization and compatibility of data acquisition. This ensures that the multi-source observation data covers all dimensions, avoiding the limitations of a single data source and providing rich and objective system data for diagnosis.

[0112] In step S430, the system operation data is standardized to obtain the system operation results.

[0113] The system standardizes and unifies multi-source, heterogeneous system operation data, including data format normalization, field alignment, invalid data cleaning, and anomaly data marking. For example, it converts unstructured log text into key-value pair structured data and standardizes the sampling frequency and units of time-series metrics. The processed system operation results can then be directly parsed by the target intelligent agent.

[0114] In step S440, the target intelligent agent is invoked to perform reflective reasoning analysis on the system operation results and generate a single-step diagnostic summary corresponding to the target intelligent agent.

[0115] The target agent retrieves the system's operational results, initiates a built-in reflective reasoning mechanism for analysis, first verifying whether the system's operational results match the expected verification objectives of the diagnostic plan, and then assessing whether the current data evidence is sufficient. The reasoning process incorporates the fault analysis logic of domain experts, combines the evidence chain to deduce the current investigation conclusions of sub-problems, and finally generates a single-step diagnostic summary containing data evidence, the reasoning process, the current conclusions, and hypothesis verification results, which is synchronously transmitted back to the shared data center of the multi-agent collaborative mechanism.

[0116] The target agent employs a "reasoning-action" interaction model. In this model, the agent generates internal reasoning chains to explain its thought process and determines its next action (such as invoking security tools), forming a "think-action-observation" cycle. This "reasoning-action" interaction model requires the agent to reason and summarize the results (observations) of each step of the system's operation, and then verify the conclusions of each step. This gradual convergence approach mimics the rigorous process of scientific empirical evidence, ensuring that the final root cause conclusion is supported by a complete chain of evidence and has high credibility.

[0117] In the embodiments of this application, an intelligent agent is used to perform contextual reasoning to formulate a targeted diagnostic plan. Then, according to the analysis plan, security tools are called to obtain real system operation data. After standardization processing to ensure data uniformity, a single-step diagnostic summary is generated through intelligent agent reflective reasoning. This makes the diagnostic analysis process planned, the data objective and verifiable, and forms a complete evidence chain for the intelligent agent's single-step diagnosis, avoiding model illusion and improving the accuracy of the intelligent agent's single-step diagnosis.

[0118] According to an embodiment of this application, in step S440, calling the target agent to perform contextual reasoning on the sub-problem to obtain a diagnostic plan includes: calling the target agent, and using the plan generation model of the target agent to perform contextual reasoning on the sub-problem based on preset prompts to obtain a diagnostic plan; wherein, the plan generation model is a large language model trained based on historical expert diagnostic cases.

[0119] The target agent is invoked by first inputting the sub-problem, the current fault context (including previous diagnostic progress and acquired data), and preset prompts into its built-in plan generation model. This plan generation model is trained on a massive database of historical expert diagnostic cases, encoding implicit investigation experience and explicit strategies from domain experts. It constrains the inference boundary through prompts to avoid deviating from the core of the sub-problem, while dynamically analyzing the fault scenarios associated with the sub-problem within the context. Key verification nodes are decomposed according to expert investigation logic, matching corresponding security tool call sequences, clarifying the diagnostic objectives and execution order of each step, and ultimately generating a structured diagnostic plan.

[0120] The "reasoning-action" paradigm, which deeply integrates expert experience into the intelligent agent, internalizes the troubleshooting ideas of operation and maintenance experts into the intelligent agent through prompt word engineering and plan generation model. This enables the intelligent agent not only to execute, but also to think and plan, achieving a leap from rule-driven to goal-driven.

[0121] In the embodiments of this application, a plan generation model trained with historical expert diagnostic cases is invoked within the target intelligent agent. This model, combined with prompts, performs contextual reasoning to generate a diagnostic plan. The plan generation model encodes implicit expert diagnostic experience into explicit strategies, ensuring the diagnostic plan aligns with actual operational scenarios and possesses both professionalism and relevance. Simultaneously, prompts guide the reasoning direction, preventing deviation from the core of the fault problem and ensuring the generated diagnostic plan is logically clear and its steps are reasonable.

[0122] Taking the scenario of diagnosing abnormal container states as an example, Figure 5 A schematic diagram illustrating a container diagnostic flowchart according to an embodiment of this application is provided. Figure 5 As shown, the container diagnostic process is not a fixed script, but a dynamically generated, iteratively verified intelligent loop. Its core process strictly follows the "reasoning-action" interaction model. Users initiate requests through natural language, and the intent identification agent analyzes the input, determines it to be a container cluster domain problem (such as "analyze the cause of the container instance's memory overflow"), and distributes it to the corresponding professional agent in the diagnostic agent group.

[0123] The specialized diagnostic agent receiving the task (such as the container diagnostic agent) uses its internalized expert experience to generate a plan based on the current problem context, and dynamically generates a preliminary, structured diagnostic plan that explicitly lists the sequence of steps required to verify the hypothesis.

[0124] According to the plan, the professional diagnostic agent invokes various tools provided by the security tool layer (such as native tools of container orchestration systems, monitoring metric query tools, log query tools, container command execution tools, etc.) to obtain real system data, such as container, host machine, system metrics, business metrics, logs, command interpreter (shell command execution) data, etc. The security tools are designed following the agent-friendly principle, going beyond simple application programming interface encapsulation to provide composite capabilities that conform to the operational logic of human experts. The agent invokes and executes the security tools and returns structured system operation results (observations) to the diagnostic agent.

[0125] The professional diagnostic agent analyzes and reasons about the system's operating results, generating a single-step diagnostic summary. This process incorporates a reflection mechanism to assess whether the current evidence is sufficient and whether the hypothesis has been verified or falsified.

[0126] The evaluation / validation of the professional diagnostic agent's single-step summary determines that the problem remains unresolved. Based on the latest context, the agent then regenerates or adjusts the subsequent diagnostic plan, initiating the next "plan-execute-reflect" cycle. Multi-turn dialogue capabilities enable the diagnostic process to delve deeper into the clues, simulating the expert's progressive investigation approach.

[0127] Once the chain of evidence is complete and the root cause of the failure is confirmed (e.g., the container failure is located to be caused by an abnormality in the host device's container engine), all inference steps, tool call records, and key evidence (such as stack traces, error logs, and anomaly indicators) will be formatted and integrated into a clear and interpretable final failure location report, which will be output to the user.

[0128] For example, a fault location system is constructed based on fault location. Figure 6 A hierarchical architecture diagram of a fault location system according to an embodiment of this application is illustrated schematically. Figure 6 As shown, the fault location system includes an infrastructure layer, a data layer, an intelligent diagnosis layer (AI Agent), and a business scenario layer.

[0129] The infrastructure layer serves as the carrier for system deployment and operation, supporting multi-container orchestration system clusters and various data storage middleware, such as columnar databases, search engines, graph databases, monitoring systems, virtual machine systems, and cloud-native databases.

[0130] Data Layer: Integrates multi-source data and knowledge bases, and connects through standardized application programming interfaces. The system integrates and uniformly processes multi-dimensional observable data from the cloud-native environment to achieve environmental awareness. The observable data includes: monitoring indicators (system / business indicators), distributed links, service topology, application and system logs, etc.

[0131] Intelligent Diagnosis Layer: Based on a language model graph orchestration framework, a multi-agent collaborative mechanism is constructed, rather than a single-processor model. This multi-agent collaborative mechanism includes various specialized agents, each with its own specific function. The intent recognition agent is responsible for understanding the user's fault description or inquiry (such as "service slowed down" or "container frequently restarts") input in natural language, accurately determining the problem type and domain, and routing the task to the appropriate diagnostic or consultation agent. The diagnostic agent group consists of multiple domain expert agents, such as: container orchestration system cluster diagnostic agents, virtual machine diagnostic agents, operating system diagnostic agents, database diagnostic agents, etc. Each agent embeds the experience of senior experts in a specific domain. The agents collaborate in a stateful manner through the control flow and data flow defined by the language model graph orchestration framework, sharing the diagnostic context to achieve relay and collaboration of complex tasks.

[0132] Business scenario layer: It connects to systems such as cloud platforms, event alarm platforms, and operation and maintenance diagnosis platforms, and actively reports anomalies through standard interfaces, synchronizing various information such as abnormal container status and abnormal service status.

[0133] The fault location system possesses intelligent operation and maintenance decision-making capabilities and autonomous evolution capabilities. From multiple dimensions such as accuracy, efficiency, knowledge management, ease of use, and security, the system systematically addresses the core challenges faced by operation and maintenance in the cloud-native era, enabling the transformation from automated to intelligent operation and maintenance. By introducing a multi-agent collaboration mechanism and a "plan-execute-verify" loop, the system achieves truly dynamic planning and autonomous reasoning-based intelligent diagnosis in the operation and maintenance field. No longer relying on fixed "conditional judgment statements," it acts like a seasoned expert, dynamically generating diagnostic paths for new and complex fault scenarios and adjusting strategies in real time based on feedback during execution. This gives the system the ability to handle unknown fault modes, elevating its intelligence level from simple anomaly detection to high-order problem-solving.

[0134] Based on the above-described fault location method, embodiments of this application also provide a fault location device. The following will be combined with... Figure 7 The device is described in detail.

[0135] Figure 7 A schematic block diagram of a fault location device according to an embodiment of this application is shown.

[0136] like Figure 7 As shown, the fault location device 1000 of this embodiment includes an intent recognition module 1010, a target determination module 1020, a diagnostic analysis module 1030, a verification module 1040, and a report generation module 1050.

[0137] The intent recognition module 1010 is used to receive a fault description statement input by the user, and to call the intent recognition agent to analyze the fault description statement in order to determine the domain problem. In one embodiment, the intent recognition module 1010 can be used to perform step S210 described above, which will not be repeated here.

[0138] The target determination module 1020 is used to match target agents in the diagnostic agent group based on the domain problem. In one embodiment, the target determination module 1020 can be used to perform step S220 described above, which will not be repeated here.

[0139] The diagnostic analysis module 1030 is used to invoke the target intelligent agent to perform diagnostic analysis on the domain problem and generate the current diagnostic result. In one embodiment, the diagnostic analysis module 1030 can be used to execute step S230 described above, which will not be repeated here.

[0140] The verification module 1040 is used to verify the current diagnostic result, and if the verification fails, it continues to iterate the diagnostic analysis process until the current diagnostic result passes verification. In one embodiment, the verification module 1040 can be used to execute step S240 described above, which will not be repeated here.

[0141] The report generation module 1050 is used to generate a fault location report based on the verified diagnostic results. In one embodiment, the report generation module 1050 can be used to perform step S250 described above, which will not be repeated here.

[0142] According to an embodiment of this application, the diagnostic analysis module 1030 includes: a problem distribution submodule, used to decompose the domain problem into sub-problems, and distribute the sub-problems to corresponding target agents based on a multi-agent collaboration mechanism; an agent processing submodule, used to synchronize the state of control flow and data flow of the target agents and perform collaborative interaction based on the multi-agent collaboration mechanism, call the target agents to analyze the sub-problems, and generate a single-step diagnostic summary corresponding to the target agents; and a diagnostic result generation submodule, used to integrate the single-step diagnostic summary corresponding to the target agents and generate the current diagnostic result.

[0143] According to an embodiment of this application, the intelligent agent processing submodule includes: a context reasoning unit, used to invoke the target intelligent agent to perform context reasoning on the sub-problem to obtain a diagnostic plan; a data acquisition unit, used to invoke and execute security tools based on the diagnostic plan to obtain system operation data; a data standardization unit, used to standardize the system operation data to obtain system operation results; and a reflective reasoning unit, used to invoke the target intelligent agent to perform reflective reasoning analysis on the system operation results to generate a single-step diagnostic summary corresponding to the target intelligent agent.

[0144] According to an embodiment of this application, the data acquisition unit includes: a tool invocation subunit, used to invoke the security tool according to the processing order of the diagnostic plan; and a multi-source data acquisition subunit, used to acquire multi-source observation data of the cloud-native environment through the security tool based on a standard application programming interface, and obtain the system operation data.

[0145] According to an embodiment of this application, the context reasoning unit includes: a plan generation subunit, used to invoke the target agent, and through the plan generation model of the target agent, perform context reasoning on the sub-problem based on preset prompts to obtain the diagnostic plan; wherein, the plan generation model is a large language model trained based on historical expert diagnostic cases.

[0146] According to an embodiment of this application, the verification module 1040 includes: a problem adjustment unit, configured to adjust the domain problem based on the current diagnostic result and the updated problem context; an agent update unit, configured to match a new target agent based on the adjusted domain problem; and an analysis unit, configured to invoke the new target agent to perform diagnostic analysis on the adjusted domain problem.

[0147] According to an embodiment of this application, the device 1000 further includes: an update training module, configured to extract diagnostic data during the diagnostic analysis process and store the diagnostic data in a diagnostic case knowledge base; obtain user feedback information from the fault location report; and update the training agents in the diagnostic agent group based on the diagnostic case knowledge base and the user feedback information.

[0148] According to embodiments of this application, any multiple modules among the intent recognition module 1010, target determination module 1020, diagnostic analysis module 1030, verification module 1040, report generation module 1050, and update training module can be merged into one module, or any one of these modules can be split into multiple modules. Alternatively, at least some of the functions of one or more of these modules can be combined with at least some of the functions of other modules and implemented in one module. According to embodiments of this application, at least one of the intent recognition module 1010, target determination module 1020, diagnostic analysis module 1030, verification module 1040, report generation module 1050, and update training module can be at least partially implemented as hardware circuitry, such as field-programmable gate arrays, programmable logic arrays, systems-on-a-chip, systems-on-a-substrate, systems-on-package, application-specific integrated circuits, or any other reasonable means of integrating or packaging circuitry, or implemented in software, hardware, or firmware, or in any suitable combination of any of these three implementation methods. Alternatively, at least one of the intent recognition module 1010, target determination module 1020, diagnostic analysis module 1030, verification module 1040, report generation module 1050, and update training module can be implemented at least partially as a computer program module, which can perform corresponding functions when the computer program module is run.

[0149] Figure 8 A block diagram schematically illustrates an electronic device suitable for implementing a fault location method according to an embodiment of this application.

[0150] like Figure 8 As shown, an electronic device 1200 according to an embodiment of this application includes a processor 1201, which can perform various appropriate actions and processes according to a program stored in a read-only memory 1202 or a program loaded from a storage portion 1208 into a random access memory 1203. The processor 1201 may include, for example, a general-purpose microprocessor, an instruction set processor and / or an associated chipset and / or a dedicated microprocessor. The processor 1201 may also include onboard memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for executing different steps of the method flow according to an embodiment of this application.

[0151] Random access memory 1203 stores various programs and data required for the operation of electronic device 1200. Processor 1201, read-only memory 1202, and random access memory 1203 are interconnected via bus 1204. Processor 1201 executes various steps of the method flow according to embodiments of this application by executing programs in read-only memory 1202 and / or random access memory 1203. It should be noted that the programs may also be stored in one or more memories other than read-only memory 1202 and random access memory 1203. Processor 1201 may also execute various steps of the method flow according to embodiments of this application by executing programs stored in said one or more memories.

[0152] According to embodiments of this application, the electronic device 1200 may further include an input / output interface 1205, which is also connected to the bus 1204. The electronic device 1200 may also include one or more of the following components connected to the input / output interface 1205: an input section 1206 including a keyboard, mouse, etc.; an output section 1207 including a cathode ray tube, liquid crystal display, etc., and a speaker, etc.; a storage section 1208 including a hard disk, etc.; and a communication section 1209 including a network interface card, such as a local area network card, modem, etc. The communication section 1209 performs communication processing via a network such as the Internet. A drive 1210 is also connected to the input / output interface 1205 as needed. A removable medium 1211, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 1210 as needed so that computer programs read from it can be installed into the storage section 1208 as needed.

[0153] Embodiments of this application also provide a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The computer-readable storage medium carries one or more programs, which, when executed, implement the method according to the embodiments of this application.

[0154] According to embodiments of this application, the computer-readable storage medium can be a non-volatile computer-readable storage medium, such as including but not limited to: portable computer disks, hard disks, random access memory, read-only memory, erasable programmable read-only memory, portable compact disk read-only memory, optical storage devices, magnetic storage devices, or any suitable combination thereof. In embodiments of this application, the computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of this application, the computer-readable storage medium may include the read-only memory 1202, and / or random access memory 1203, and / or one or more memories other than read-only memory 1202 and random access memory 1203 described above.

[0155] Embodiments of this application also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowchart. When the computer program product is run on a computer system, the program code is used to cause the computer system to implement the methods provided in the embodiments of this application.

[0156] In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device or a magnetic storage device. In another embodiment, the computer program may also be transmitted and distributed in the form of signals over a network medium, and may be downloaded and installed via the communication section 1209, and / or installed from the removable medium 1211. The program code contained in the computer program can be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination thereof.

[0157] In embodiments of this application, the computer program can be downloaded and installed from a network via communication section 1209, and / or installed from removable medium 1211. When the computer program is executed by processor 1201, it performs the functions defined in the system of this application embodiment. According to embodiments of this application, the systems, devices, apparatuses, modules, units, etc., described above can be implemented by computer program modules.

[0158] According to embodiments of this application, program code for executing the computer programs provided in the embodiments of this application can be written in any combination of one or more programming languages. Specifically, these computational programs can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. The program code can be executed entirely on the user's computing device, partially on the user's device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).

[0159] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0160] Those skilled in the art will understand that the features described in the various embodiments of this application can be combined and / or combined in various ways, even if such combinations or combinations are not explicitly described in this application. In particular, the features described in the various embodiments of this application can be combined and / or combined in various ways without departing from the spirit and teachings of this application. All such combinations and / or combinations fall within the scope of this application.

Claims

1. A fault location method, characterized in that, The method includes: Receive a fault description statement input by the user, and invoke an intent recognition agent to analyze the fault description statement in order to determine the domain problem; Based on the aforementioned domain problem, target agents are matched and diagnosed within a group of intelligent agents. The target intelligent agent is invoked to perform diagnostic analysis on the domain problem and generate the current diagnostic result. Verify the current diagnostic result, and if verification fails, continuously iterate the diagnostic analysis process until the current diagnostic result passes verification; and Based on the validated diagnostic results, a fault location report is generated.

2. The method according to claim 1, characterized in that, The step of invoking the target intelligent agent to perform diagnostic analysis on the domain problem and generate the current diagnostic result includes: The domain problem is decomposed into sub-problems, and the sub-problems are distributed to the corresponding target agents based on a multi-agent cooperation mechanism; Based on the aforementioned multi-agent collaboration mechanism, the target agent's control flow and data flow are synchronized and interact collaboratively. The target agent is invoked to analyze the sub-problem, generating a single-step diagnostic summary corresponding to the target agent. The current diagnostic result is generated by integrating the single-step diagnostic summaries corresponding to the target intelligent agent.

3. The method according to claim 2, characterized in that, The step of calling the target agent to analyze the sub-problem and generating a single-step diagnostic summary corresponding to the target agent includes: The target intelligent agent is invoked to perform contextual reasoning on the sub-problem to obtain a diagnostic plan; Based on the diagnostic plan, security tools are invoked and executed to obtain system operation data; Standardize the system operation data to obtain the system operation results; and The target intelligent agent is invoked to reflect on and analyze the system's operating results, generating a single-step diagnostic summary corresponding to the target intelligent agent.

4. The method according to claim 3, characterized in that, The step of invoking and executing security tools based on the diagnostic plan to obtain system operation data includes: The security tools are invoked according to the processing order of the diagnostic plan; and The security tool obtains multi-source observation data from the cloud-native environment based on standard application programming interfaces, thereby acquiring the system's operational data.

5. The method according to claim 3, characterized in that, The step of invoking the target intelligent agent to perform contextual reasoning on the sub-problem to obtain a diagnostic plan includes: The target intelligent agent is invoked, and the diagnostic plan is obtained by performing contextual reasoning on the sub-problems based on preset prompts through the plan generation model of the target intelligent agent; wherein, the plan generation model is a large language model trained based on historical expert diagnostic cases.

6. The method according to claim 1, characterized in that, The iterative diagnostic analysis process includes: Based on the current diagnostic results and the updated problem context, adjust the domain problem; Based on the adjusted domain problem, a new target agent is matched; and The new target intelligent agent is invoked to perform diagnostic analysis on the adjusted domain problem.

7. The method according to claim 1, characterized in that, The method further includes: Extract diagnostic data from the diagnostic analysis process and store the diagnostic data in a diagnostic case knowledge base; Obtain user feedback information from the fault location report; and The agents in the diagnostic agent swarm are updated and trained based on the diagnostic case knowledge base and the user feedback information.

8. A fault location device, characterized in that, The device includes: The intent recognition module is used to receive fault description statements input by the user, and call the intent recognition intelligent agent to analyze the fault description statements in order to determine the domain problem; The target determination module is used to match target agents in the diagnostic agent group based on the domain problem. The diagnostic analysis module is used to invoke the target intelligent agent to perform diagnostic analysis on the domain problem and generate the current diagnostic results. A verification module is used to verify the current diagnostic result, and if the verification fails, to continuously iterate the diagnostic analysis process until the current diagnostic result passes verification; and The report generation module is used to generate fault location reports based on the validated diagnostic results.

9. An electronic device, comprising: One or more processors; Memory, used to store one or more computer programs. The characteristic feature is that the one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program or instructions stored thereon, characterized in that, When the computer program or instructions are executed by a processor, they implement the steps of the method according to any one of claims 1 to 7.

11. A computer program product, comprising a computer program or instructions, characterized in that, When the computer program or instructions are executed by a processor, they implement the steps of the method according to any one of claims 1 to 7.