A method and system for monitoring and troubleshooting based on digital employee network failure

By integrating the large model with the RAG dynamic empowerment algorithm, a two-way data interaction mechanism and strategy optimization algorithm are constructed, which solves the problems of cumbersome processes and lagging knowledge base in troubleshooting, and realizes efficient and accurate fault analysis and intelligent closed-loop operation and maintenance.

CN122204643APending Publication Date: 2026-06-12CHINA UNITECHS

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA UNITECHS
Filing Date
2026-04-01
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing troubleshooting technologies suffer from cumbersome troubleshooting processes, outdated and slow-to-update knowledge bases, resulting in low troubleshooting efficiency. Furthermore, the large models are disconnected from the equipment layer, making it impossible to achieve efficient and accurate fault analysis.

Method used

By integrating large models with the RAG dynamic empowerment algorithm, a two-way data interaction mechanism and strategy optimization algorithm are constructed. Combined with a static basic library and a dynamic incremental library, dynamic matching of fault scenarios and knowledge updates are achieved, thus opening up the data flow between decision-making, execution and knowledge.

Benefits of technology

It improved the efficiency and accuracy of troubleshooting, shortened the troubleshooting time, reduced the frequency of manual intervention, and achieved an intelligent closed loop in the operation and maintenance mode.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122204643A_ABST
    Figure CN122204643A_ABST
Patent Text Reader

Abstract

The application discloses a method and system for fault monitoring and troubleshooting based on a digital employee network, wherein the method comprises fault access and intention analysis, prompt word matching and task disintegration, real-time execution and data feedback, dynamic decision-making and strategy output, and knowledge closed-loop updating. Through the fusion of bidirectional data interaction mechanism, strategy optimization algorithm and RAG dynamic empowerment algorithm, the method and system solve the problems of disconnection between large models and device layers, MCP protocol as a single data transmission channel, lack of deep linkage with intelligent decision-making, inability to dynamically select interfaces and execution logic according to fault scenarios, lagging updating of traditional RAG knowledge base, and low matching degree of search results and fault scenarios, which cannot support accurate analysis of complex faults. The method and system break through the data flow of "decision-making-execution-knowledge", solve the defects of cooperativity and adaptability of general solutions, improve the troubleshooting efficiency and accuracy, shorten the troubleshooting time, improve the accuracy of root cause analysis, update the knowledge base in time, and reduce the frequency of manual intervention.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of troubleshooting technology, and in particular to a method and system for troubleshooting and monitoring faults in digital employee networks. Background Technology

[0002] Existing troubleshooting technologies suffer from two significant drawbacks: First, when equipment triggers a fault alarm, the troubleshooting process is cumbersome and redundant, requiring manual execution of commands to locate the root cause of the fault, resulting in low efficiency due to the lack of dynamic strategy optimization mechanisms. Second, the supporting knowledge base suffers from outdated content, delayed updates, and limited coverage, and knowledge acquisition is highly dependent on human intervention, making it difficult to quickly adapt to the iteration of the current network architecture and the evolution of fault diagnosis approaches. This traditional model is not only time-consuming and labor-intensive but also prolongs the duration of faults, causing dual losses to production business continuity and human resource allocation.

[0003] The general-purpose large model, MCP, and RAG technologies have significant shortcomings in adaptability to network fault diagnosis scenarios: While the large model has strong semantic understanding capabilities, it lacks in-depth professional knowledge in the operation and maintenance field and is prone to outputting generalized conclusions; MCP can only achieve basic real-time device interaction and lacks the ability to decompose and schedule fault-based tasks in various scenarios; RAG generally suffers from problems such as knowledge matching relying on text similarity, lagging updates, and insufficient adaptability. None of the three technologies can meet the needs of efficient troubleshooting when applied alone. Summary of the Invention

[0004] To address the disconnect between general-purpose large models and device layers, the fact that the MCP protocol often serves as a single data transmission channel without deep integration with intelligent decision-making, and the inability to dynamically select interfaces and execution logic based on fault scenarios, coupled with the lagging updates of traditional RAG knowledge bases and low matching degree between search results and fault scenarios, making it difficult to support accurate analysis of complex faults, this invention provides a method and system for fault monitoring and troubleshooting based on digital employee networks. Through the innovative integration of bidirectional data interaction mechanisms, strategy optimization algorithms, and RAG dynamic empowerment algorithms, it streamlines the data flow from "decision-execution-knowledge," improving troubleshooting efficiency and accuracy. Ultimately, this achieves the technical effects of shortening troubleshooting time, improving the accuracy of root cause analysis, timely updating the knowledge base, and reducing the frequency of manual intervention.

[0005] To achieve the above objectives, the present invention adopts the following technical solution:

[0006] In one embodiment of the present invention, a method for troubleshooting and monitoring faults in a digital employee network is proposed, the method comprising:

[0007] After a device alarm is triggered, the big data model extracts the device IP, port, and alarm type from the natural language alarm. Using the RAG dynamic empowerment algorithm, it retrieves the corresponding historical troubleshooting cases in the RAG static library and the corresponding latest troubleshooting cases in the RAG dynamic library based on the alarm type to initially match the fault type. At the same time, based on the retrieved data of similar scenarios, it initializes the decision weights of the strategy optimization algorithm and finally generates a structured task instruction that conforms to the two-way data interaction mechanism.

[0008] Using the RAG dynamic empowerment algorithm, based on the alarm type scenario, the RAG static library and the RAG dynamic library are retrieved, and the corresponding prompt word information is matched; the large model combines the prompt word information and, based on the initial weight of the strategy optimization algorithm, decomposes the troubleshooting task into multiple sub-tasks, generates a fault strategy tree, and the format of the sub-tasks conforms to the instruction specification of the two-way data interaction mechanism.

[0009] Upon receiving the structured task instructions, the system selects the optimal device interface to log in to the device according to the device interface adaptation rules, and executes the instructions corresponding to multiple sub-tasks in sequence. After execution, the system encapsulates the results in a standardized format, embeds data verification bits, and sends them back to the large model.

[0010] The large model combines the retrieved historical troubleshooting cases with the returned execution results to drive the strategy optimization algorithm to dynamically adjust the priority and execution order of the subtasks, ultimately determining the root cause, generating the adjusted troubleshooting strategy and converting it into standardized instructions, which are then pushed to the operations and maintenance personnel to execute the instructions and provide feedback on the verification results.

[0011] The structured text information of this fault is automatically extracted and pushed to the RAG dynamic library, while triggering the Dify platform to complete the text slicing and vector index update; the adjustment logic and execution data of the strategy optimization algorithm are synchronized to the RAG dynamic library.

[0012] Furthermore, the method also includes:

[0013] If a new device model or an unrecorded fault type is involved, the Dify platform will supplement the text details and trigger a secondary synchronization to the RAG dynamic library, while simultaneously updating the scene weight training data of the strategy optimization algorithm.

[0014] Furthermore, the large model obtains the search popularity and matching accuracy of newly added cases through the search feedback interface of the Dify platform, and optimizes the decision weights of its internal self-learning module in reverse.

[0015] In one embodiment of the present invention, a system for fault monitoring and troubleshooting based on a digital employee network is also proposed, the system comprising:

[0016] The fault access and intent parsing module is used to extract the device IP, port, and alarm type from the natural language alarm after a device alarm is triggered. Using the RAG dynamic empowerment algorithm, it retrieves the corresponding historical troubleshooting cases in the RAG static library and the corresponding latest troubleshooting cases in the RAG dynamic library according to the alarm type to initially match the fault type. At the same time, based on the retrieved data of similar scenarios, it initializes the decision weights of the strategy optimization algorithm and finally generates a structured task instruction that conforms to the two-way data interaction mechanism.

[0017] The prompt word matching and task decomposition module is used to utilize the RAG dynamic empowerment algorithm to retrieve the RAG static library and the RAG dynamic library according to the alarm type scenario, and match the corresponding prompt word information; the large model combines the prompt word information and, based on the initial weights of the strategy optimization algorithm, decomposes the troubleshooting task into multiple sub-tasks to generate a fault strategy tree, and the format of the sub-tasks conforms to the instruction specification of the two-way data interaction mechanism;

[0018] The real-time execution and data feedback module is used to receive the structured task instructions, select the optimal device interface to log in to the device according to the device interface adaptation rules, and execute the instructions corresponding to multiple sub-tasks in sequence; after the execution is completed, the results are packaged in a standardized format, embedded with data verification bits, and sent back to the large model.

[0019] The dynamic decision-making and strategy output module is used by the large model to combine the retrieved historical troubleshooting cases and the returned execution results to drive the strategy optimization algorithm to dynamically adjust the priority and execution order of the sub-tasks, ultimately determine the root cause, generate the adjusted troubleshooting strategy and convert it into standardized instructions, push it to the operation and maintenance personnel to execute the instructions and provide feedback on the verification results.

[0020] The knowledge closed-loop update module is used to automatically organize the standardized text information of this fault and push it to the RAG dynamic library, while triggering the Dify platform to complete the text slicing and vector index update; it also synchronizes the adjustment logic and effect data of the strategy optimization algorithm to the RAG dynamic library; if a new device model or an unrecorded fault type is involved, the Dify platform is used to supplement the text details and trigger a second synchronization to the RAG dynamic library, while simultaneously updating the scene weight training data of the strategy optimization algorithm.

[0021] Furthermore, the RAG static library is used to store basic text data, including historical troubleshooting cases, and is initialized through batch text upload and related import supported by the Dify platform; by setting a cron scheduled task, the updated basic text data is automatically pulled and the Dify platform is triggered to complete text re-slicing and vector index update;

[0022] The RAG dynamic library is used to store dynamic data, including the latest device information and the latest troubleshooting cases, and is synchronized through the API interface opened by the Dify platform. The API interface is called based on the status after the troubleshooting process is completed, and the latest troubleshooting cases are written into the RAG dynamic library. For new device models or alarm types not stored in the RAG static library, the RAG dynamic library receives the latest troubleshooting cases in real time through the API interface and completes the knowledge base update within a predetermined time.

[0023] In one embodiment of the present invention, a computer device is also proposed, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it realizes the aforementioned fault monitoring and troubleshooting based on the digital employee network.

[0024] In one embodiment of the present invention, a computer-readable storage medium is also provided, which stores a computer program that performs fault monitoring and troubleshooting based on a digital employee network.

[0025] Beneficial effects:

[0026] 1. Core Architecture Integration and Innovation: This invention deeply integrates large-scale model semantic understanding, MCP real-time interaction and enhanced RAG technology to build a collaborative architecture of "large-scale model-led decision-making - MCP precise execution - RAG dynamic empowerment", which fundamentally connects the data flow of "decision-execution-knowledge" and lays the foundation for full-process automation.

[0027] 2. Innovative Customized Two-Way Data Interaction Mechanism: This invention adopts a dual-track design of "standardized instructions + structured feedback" and is equipped with a data verification mechanism to achieve precise linkage between large model decision-making and MCP execution, thus completely solving the core pain point of "disconnect between decision-making and execution" in general solutions.

[0028] 3. Innovation in dual-dimensional strategy optimization algorithm: This invention constructs an optimization model of "scene weight + real-time data correction", and dynamically adjusts the priority of obstacle removal sub-tasks through gradient descent algorithm, which solves the problem of "rigid obstacle removal strategy" in traditional solutions and improves strategy adaptability and execution efficiency.

[0029] 4. RAG Dynamic Empowerment Algorithm Innovation: This invention relies on a dual-library architecture of "static basic library + dynamic incremental library" and integrates a scene weight retrieval algorithm to achieve "static accumulation + dynamic update" of knowledge, solving the defects of general RAG knowledge lag and fuzzy matching, and improving the matching degree.

[0030] 5. Innovative Two-Way Linkage Between Dify and MCP Plugins: This invention introduces a dedicated Dify MCP plugin (including instruction parsing, interface matching, and data encapsulation modules, implemented based on the Dify plugin development framework). Within 3 seconds of the MCP execution result being returned, data push and index update are completed. Two-way communication is achieved through visual configuration, ensuring the dynamic updating of data sources for the RAG knowledge base and connecting the entire "decision-execution-knowledge" chain to avoid data silos.

[0031] 6. Advantages of low-code adaptation and multi-model compatibility: This invention is built on a low-code platform architecture, allowing for process orchestration by dragging and dropping nodes, and enabling rapid adjustments to adapt to complex network environments; it supports multi-model embedding, and natural language interaction lowers the operational threshold and improves implementation efficiency.

[0032] 7. Full-process closed-loop optimization capability: This invention realizes an automated closed loop of "fault handling - text accumulation - knowledge base update - model self-learning", achieving the effect of "one troubleshooting, one capability improvement", which shortens the troubleshooting time compared with traditional solutions and promotes the upgrade of operation and maintenance mode to "intelligent closed loop". Attached Figure Description

[0033] Figure 1 This is a schematic diagram of the method for troubleshooting and monitoring faults in a digital employee network according to the present invention.

[0034] Figure 2 This is a schematic diagram of the system structure of the present invention based on digital employee network fault monitoring and troubleshooting;

[0035] Figure 3 This is a schematic diagram of the computer device structure of the present invention. Detailed Implementation

[0036] The principles and spirit of the present invention will now be described with reference to several exemplary embodiments. It should be understood that these embodiments are provided merely to enable those skilled in the art to better understand and implement the present invention, and are not intended to limit the scope of the present invention in any way. Rather, these embodiments are provided to make this disclosure more thorough and complete, and to fully convey the scope of this disclosure to those skilled in the art.

[0037] Those skilled in the art will recognize that embodiments of the present invention can be implemented as a system, device, method, or computer program product. Therefore, this disclosure can be specifically implemented in the following forms: entirely hardware, entirely software (including firmware, resident software, microcode, etc.), or a combination of hardware and software.

[0038] According to an embodiment of the present invention, a method for troubleshooting and monitoring faults based on a digital employee network is proposed. Through the innovative integration of a two-way data interaction mechanism, a strategy optimization algorithm, and a RAG dynamic empowerment algorithm, this method addresses the problems of disconnect between the general large model and the device layer, the MCP protocol often serving as a single data transmission channel without deep integration with intelligent decision-making, the inability to dynamically select interfaces and execution logic based on fault scenarios, the lagging updates of the traditional RAG knowledge base, and the low matching degree between retrieval results and fault scenarios, making it difficult to support accurate analysis of complex faults. This method streamlines the data flow of "decision-execution-knowledge," solves the defects in the synergy and adaptability of general solutions, improves troubleshooting efficiency and accuracy, and ultimately achieves the technical effects of shortening troubleshooting time, improving the accuracy of root cause analysis, timely updating of the knowledge base, and reducing the frequency of manual intervention.

[0039] The principles and spirit of the present invention will be explained in detail below with reference to several representative embodiments.

[0040] Figure 1 This is a schematic diagram of the method for fault monitoring and troubleshooting based on a digital employee network according to the present invention. Figure 1 As shown, the method achieves its core objective through a "three-layer collaboration + two-way feedback" architecture, with the specific logic as follows:

[0041] 1. Core technology integration design

[0042] (1) Intelligent upgrade of MCP protocol: Develop an intelligent scheduling submodule of MCP so that it can receive structured task instructions of large models, autonomously match the optimal device interface according to the "resource requirements" (such as log query and parameter adjustment) in the instructions, and support parallel calls of multiple interfaces. At the same time, it will standardize the execution results and send them back to the large model in real time, thus solving the limitation of the traditional MCP "passive transmission".

[0043] (2) Enhanced RAG Knowledge Base Construction: A dual-library architecture of "static basic library + dynamic incremental library" is adopted, relying on the Dify platform to complete text upload and synchronization management. The static library stores basic text data such as historical troubleshooting cases, and completes the initialization through batch text upload and association import supported by the Dify platform; the dynamic library focuses on dynamic data such as the latest equipment information and the latest troubleshooting cases, and realizes synchronization through the API interface opened by the Dify platform. The core synchronization mechanism is designed as follows: First, for static text, a cron scheduled task is set (the cron expression is "0 2 * * *", the synchronization script is written in Python, and the batch update interface is called through the Dify SDK (Dify Software Development Kit)). The updated basic text data is automatically pulled at 2 am every day, and the text re-slicing and vector index update of the Dify platform are triggered to ensure the timeliness of static knowledge; Second, for dynamic text, the API interface is triggered by the status after the troubleshooting process is completed (there will be a status mark after the troubleshooting process is completed. If the mark is completed, the API interface is called) to write the latest troubleshooting cases output after the troubleshooting process is completed into the RAG dynamic library in real time. Simultaneously, a "scene weight retrieval algorithm" is introduced, combined with Dify vector retrieval capabilities, to improve the matching degree between retrieval results and current faults based on dimensions such as alarm type and device model, solving the problems of "outdated knowledge and inaccurate matching" and synchronization lag after text upload in traditional RAG. For new device fault scenarios (not included in the static library), the RAG dynamic library receives fault data in real time via API, completes knowledge updates within 10 minutes, and generates adaptation solutions by combining the large model with the dynamic library information.

[0044] 2. Implementation of closed-loop business processes

[0045] (1) Fault access and intent parsing: After a device alarm is triggered, the big model extracts core information such as device IP (10.10.0.1), port (GigabitEthernet0 / 0 / 0) and alarm type (port DOWN) from the natural language alarm (such as "diagnose port DOWN fault of GigabitEthernet0 / 0 / 0 port of device 10.10.0.1"). Using the RAG dynamic empowerment algorithm, the model retrieves the corresponding historical troubleshooting cases in the RAG static library and the corresponding latest troubleshooting cases in the RAG dynamic library according to the alarm type to initially match the fault type. At the same time, based on the retrieved data of similar scenarios, the model initializes the decision weight of the strategy optimization algorithm and finally generates a structured task instruction that conforms to the two-way data interaction mechanism.

[0046] (2) Prompt word matching and task decomposition: Using the RAG dynamic empowerment algorithm, according to the alarm type scenario, the RAG static library and the RAG dynamic library are retrieved to match the corresponding prompt word information; the large model combines the prompt word information and, based on the initial weight of the strategy optimization algorithm, decomposes the troubleshooting task into three sub-tasks: "query port status → detect transmit and receive power → verify configuration items", and generates a fault strategy tree. The format of the sub-tasks conforms to the instruction specifications of the two-way data interaction mechanism (including fields such as device ID, operation type, parameter requirements and timeout threshold).

[0047] (3) MCP real-time execution and data feedback: The MCP intelligent scheduling submodule receives the structured task instructions, selects the optimal device SSH interface to log in to the device according to the device interface adaptation rules, and executes the instructions corresponding to multiple subtasks in sequence to query parameters such as port status, optical power and bit error rate; after the execution is completed, the results are packaged in a standardized format, embedded with data verification bits (the output results of the node) and sent back to the large model to provide real-time correction data for the strategy optimization algorithm, and the execution results are synchronized to the RAG dynamic library for subsequent archiving;

[0048] (4) Dynamic decision-making and strategy output: The large model combines the retrieved historical troubleshooting cases and the returned execution results to drive the strategy optimization algorithm to dynamically adjust the priority and execution order of the subtasks through the gradient descent algorithm (learning rate η=0.01, number of iterations ≤50, convergence threshold ε=0.001). The subtask order is optimized to "detect receive and receive power → query port status → verify configuration parameters". Finally, the root cause is determined to be "received power is too high". The adjusted troubleshooting strategy is generated, which includes fault delimitation, root cause analysis and operation steps. It is converted into standardized instructions and pushed to the operation and maintenance personnel to execute the instructions and provide feedback on the verification results.

[0049] (5) Knowledge Closed-Loop Update: After troubleshooting, the full-process text synchronization mechanism is activated. First, the structured text information of the fault "device + port + high optical power" is automatically extracted, including device name, alarm type, fault description, diagnostic command, execution result, root cause analysis, processing steps and verification results, and standardized processing (removing redundant logs and unifying terminology format). Second, the standardized text is pushed to the RAG dynamic library in real time through the pre-configured API interface of the Dify platform. The API interface call triggers the Dify platform to complete the text slicing and vector index update, providing new data for the subsequent retrieval of the RAG dynamic empowerment algorithm, ensuring that new knowledge can be retrieved immediately. At the same time, the adjustment logic and execution result of the current strategy optimization algorithm are synchronously archived to R. The AG dynamic library optimizes the algorithm model. If new device models or unrecorded fault types are involved, maintenance personnel supplement text details through the Dify platform, triggering a secondary synchronization to the RAG dynamic library. This forms a dual guarantee of "automatic synchronization + manual verification," simultaneously updating the scene weight training data of the strategy optimization algorithm. Thirdly, if basic text data from the static base library is involved, maintenance personnel upload text or update associated data sources through the Dify platform, automatically triggering MCP collaborative verification. Upon successful verification, the data is synchronized to the static base library, updating the training data of the scene weight retrieval algorithm. Fourthly, the large model obtains the retrieval popularity and matching accuracy of newly added cases through the Dify platform's retrieval feedback interface, inversely optimizing the decision weights of its internal self-learning module. The entire process requires no manual intervention in the text synchronization stage, achieving a closed-loop automation of "fault handling - text accumulation - knowledge base update - capability upgrade."

[0050] It should be noted that although the operation of the method of the present invention has been described in a specific order in the above embodiments and figures, this does not require or imply that the operations must be performed in that specific order, or that all the operations shown must be performed to achieve the desired result. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.

[0051] To provide a clearer explanation of the above-mentioned troubleshooting based on digital employee network fault monitoring, a specific embodiment will be used for illustration below. However, it is worth noting that this embodiment is only for better illustrating the present invention and does not constitute an improper limitation of the present invention.

[0052] Implementation process:

[0053] 1. Fault Access and Intent Resolution: After an alarm is triggered, the large model accurately extracts core information from the natural language alarm (Device IP: 10.1.1.1, Alarm Type: Port down). Then, relying on the RAG dynamic empowerment algorithm, it retrieves historical troubleshooting cases related to "port down" from the RAG static library and the latest troubleshooting cases from the dynamic library, initially matching the fault type as "physical link or port hardware failure". At the same time, based on the retrieved data of similar scenarios, it initializes the decision weights of the strategy optimization algorithm, with "Query Port Status" having an initial weight of 0.6, "Detect Receiver / Receiver Rate" having an initial weight of 0.5, and "Verify Configuration Parameters" having an initial weight of 0.4. Finally, it generates a structured task instruction that conforms to the two-way data interaction mechanism: {"Device IP": "10.1.1.1", "Operation Type": "Port Diagnosis, Receiver / Receiver Rate Detection, Configuration Verification", "Parameter Requirements": "Query Gi0 / 0 / 1 Port Status, Detect Receiver / Receiver Rate, Verify Configuration", "Timeout Threshold": "30s"}.

[0054] 2. Prompt word matching and task decomposition: Based on the scenario of "alarm type is port DOWN", the RAG dynamic empowerment algorithm retrieves exclusive prompt words from the dual database: "Huawei port diagnostic command: display interface Gi0 / 0 / 1; optical power detection command: display interface Gi0 / 0 / 1; configuration verification command: display interface Gi0 / 0 / 1; port fault threshold: port status is down, transmit and receive power exceeds the threshold set by the device". The large model combines this prompt word and, based on the initial weights of the strategy optimization algorithm, decomposes the total troubleshooting task into 3 sub-tasks, generating a fault strategy tree: sub-task 1 (query port status) → sub-task 2 (detect transmit and receive power) → sub-task 3 (verify configuration parameters). The format of the sub-tasks all conforms to the command specifications of the two-way data interaction mechanism.

[0055] 3. Real-time Execution and Data Feedback: The MCP intelligent scheduling submodule receives structured task instructions, autonomously selects the SSH interface to log in to the device according to Huawei device interface adaptation rules, and executes the instructions corresponding to the three sub-tasks in sequence. After execution, the results are encapsulated in a standardized format, embedded with data verification bits, and sent back to the large model. The feedback data is as follows: {“Device IP”: “10.1.1.1”, “Execution Status”: “Success”, “Indicator Data”: {“Port Status”: “UP”, “Received Optical Power”: “Received 4.6dBm, threshold [-10.599, 4.500]dBm, Emitted 0.99dBm, threshold [-4.300, 4.500]dBm”}}. At the same time, the execution data is synchronized to the RAG dynamic library for archiving, preparing for subsequent knowledge updates.

[0056] 4. Dynamic Decision-Making and Solution Output: The large model combines historical troubleshooting cases of "Huawei equipment receiving light rate exceeding the threshold causing port down" retrieved by the RAG dynamic empowerment algorithm, as well as the real-time execution results returned by MCP (port status UP, receiving light rate exceeding the limit), to drive the strategy optimization algorithm to start the gradient descent algorithm. With "optimal fault diagnosis efficiency" as the goal, the priority of subtasks is dynamically adjusted: the weight of "detecting optical power" is adjusted to 0.8 (highest), the weight of "querying port status" is adjusted to 0.7, and the weight of "verifying configuration parameters" is adjusted to 0.3. After 3 iterations, the convergence threshold (ε=0.001) is reached, and the subtask order is finally optimized to: Subtask 1 (detecting optical power) → Subtask 2 (querying port status) → Subtask 3 (verifying configuration parameters). At the same time, combined with the RAG retrieval results and real-time data, the root cause of the fault is determined to be "the receiving light power of Gi0 / 0 / 1 port exceeds the limit, and the port frequently goes down", generating an optimized troubleshooting strategy: "1. Close Gi0 / 0 / 1 port (command: shutdown)". 1. Gi0 / 0 / 1); 2. Replace the optical module; 3. Enable the Gi0 / 0 / 1 port (command: undo shutdown Gi0 / 0 / 1); 4. Verify the port status and link attenuation value (command: display interface Gi0 / 0 / 1); After the solution is generated, it is synchronously converted into standardized commands for backup, and pushed to the terminal of the operation and maintenance personnel.

[0057] 5. Operation and maintenance execution and result verification: The operation and maintenance personnel replace the optical module corresponding to the Gi0 / 0 / 1 port according to the pushed troubleshooting strategy. After executing the relevant instructions, the verification result is fed back through the MCP module: "Port status Up, received optical power -4.0dB (normal), terminal can connect to the network normally". The fault diagnosis is completed, and the whole process takes 8 minutes (traditional manual diagnosis takes more than 30 minutes).

[0058] 6. Knowledge Closed-Loop Update: After fault diagnosis, the text synchronization mechanism is automatically activated: First, the structured text information of this fault is extracted (Device IP: 10.1.1.1, Alarm Type: Port Down, Root Cause: Received Optical Power Exceeds Limit, Handling Steps: Close Port → Replace Optical Module → Enable Port → Verify, Verification Result: Normal). Redundant logs are removed and standardized terminology is formatted by the text cleaning module. Second, the standardized text is pushed to the RAG dynamic incremental library in real time through the pre-configured API interface of the Dify platform. The API call simultaneously triggers the Dify platform to automatically complete text slicing and vector index updates, ensuring that this case can be immediately retrieved for subsequent similar faults. Third, this fault does not involve updates to the static library's basic text data, so no MCP collaborative verification is required. Fourth, the large model obtains the search tags ("Huawei Port DOWN", "Received Optical Power Exceeds Limit") of the newly added case through the Dify platform's search feedback interface, and optimizes the decision weights of its internal self-learning module in reverse, improving the efficiency of fault identification and decision-making in subsequent similar scenarios.

[0059] Based on the same inventive concept, this invention also proposes a system for fault monitoring and troubleshooting of a digital employee network. The implementation of this system can refer to the implementation of the methods described above, and repeated details will not be repeated. The term "module" as used below can refer to a combination of software and / or hardware that implements a predetermined function. Although the system described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.

[0060] Figure 2 This is a schematic diagram of the system structure for troubleshooting and monitoring faults in a digital employee network, as per the present invention. Figure 2 As shown, the system includes:

[0061] The fault access and intent parsing module 101 is used to extract the device IP, port and alarm type from the natural language alarm after the device alarm is triggered. Using the RAG dynamic empowerment algorithm, it retrieves the corresponding historical troubleshooting cases in the RAG static library and the corresponding latest troubleshooting cases in the RAG dynamic library according to the alarm type to initially match the fault type. At the same time, based on the retrieved similar scenario data, it initializes the decision weight of the strategy optimization algorithm and finally generates a structured task instruction that conforms to the two-way data interaction mechanism.

[0062] The prompt word matching and task decomposition module 102 is used to use the RAG dynamic empowerment algorithm to retrieve the RAG static library and the RAG dynamic library according to the alarm type scenario and match the corresponding prompt word information; the large model combines the prompt word information and decomposes the troubleshooting task into multiple sub-tasks based on the initial weight of the strategy optimization algorithm to generate a fault strategy tree, and the format of the sub-tasks conforms to the instruction specification of the two-way data interaction mechanism;

[0063] The real-time execution and data feedback module 103 is used to receive the structured task instructions, select the optimal device interface to log in to the device according to the device interface adaptation rules, and execute the instructions corresponding to multiple sub-tasks in sequence; after the execution is completed, the results are packaged in a standardized format, embedded with data verification bits, and sent back to the large model.

[0064] The dynamic decision-making and strategy output module 104 is used to drive the strategy optimization algorithm to dynamically adjust the priority and execution order of the subtasks by combining the retrieved historical troubleshooting cases and the returned execution results, and finally determine the root cause, generate the adjusted troubleshooting strategy and convert it into standardized instructions, push it to the operation and maintenance personnel to execute the instructions and provide feedback on the verification results.

[0065] The knowledge loop update module 105 is used to automatically organize the standardized text information of this fault and push it to the RAG dynamic library, while triggering the Dify platform to complete the text slicing and vector index update; synchronize the adjustment logic and effect data of the strategy optimization algorithm to the RAG dynamic library; if a new device model or an unrecorded fault type is involved, the Dify platform is used to supplement the text details and trigger a second synchronization to the RAG dynamic library, while simultaneously updating the scene weight training data of the strategy optimization algorithm.

[0066] Preferably, the RAG static library is used to store basic text data, including historical troubleshooting cases, and is initialized by batch text upload and related import supported by the Dify platform; by setting a cron scheduled task, the updated basic text data is automatically pulled and the Dify platform is triggered to complete text re-slicing and vector index update;

[0067] The RAG dynamic library is used to store dynamic data, including the latest device information and the latest troubleshooting cases, and is synchronized through the API interface opened by the Dify platform. The API interface is called based on the status after the troubleshooting process is completed, and the latest troubleshooting cases are written into the RAG dynamic library. For new device models or alarm types not stored in the RAG static library, the RAG dynamic library receives the latest troubleshooting cases in real time through the API interface and completes the knowledge base update within a predetermined time.

[0068] It should be noted that although several modules of the system for monitoring and troubleshooting digital employee networks are mentioned in the detailed description above, this division is merely exemplary and not mandatory. In fact, according to embodiments of the present invention, the features and functions of two or more modules described above can be embodied in one module. Conversely, the features and functions of one module described above can be further divided and embodied by multiple modules.

[0069] Based on the aforementioned inventive concept, such as Figure 3 As shown, the present invention also proposes a computer device 200, including a memory 210, a processor 220, and a computer program 230 stored in the memory 210 and executable on the processor 220. When the processor 220 executes the computer program 230, it realizes the aforementioned fault monitoring and troubleshooting based on the digital employee network.

[0070] Based on the aforementioned inventive concept, the present invention also proposes a computer-readable storage medium storing a computer program that executes the aforementioned fault monitoring and troubleshooting based on a digital employee network.

[0071] While the spirit and principles of the invention have been described with reference to several specific embodiments, it should be understood that the invention is not limited to the disclosed specific embodiments, and the division of aspects does not imply that features in these aspects cannot be combined for benefit; such division is merely for ease of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

[0072] Regarding the limitation of the scope of protection of this invention, those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solution of this invention are still within the scope of protection of this invention.

Claims

1. A method for fault monitoring and troubleshooting based on a digital employee network, characterized in that, The method includes: After a device alarm is triggered, the big data model extracts the device IP, port, and alarm type from the natural language alarm. Using the RAG dynamic empowerment algorithm, it retrieves the corresponding historical troubleshooting cases in the RAG static library and the corresponding latest troubleshooting cases in the RAG dynamic library based on the alarm type to initially match the fault type. At the same time, based on the retrieved data of similar scenarios, it initializes the decision weights of the strategy optimization algorithm and finally generates a structured task instruction that conforms to the two-way data interaction mechanism. Using the RAG dynamic empowerment algorithm, based on the alarm type scenario, the RAG static library and the RAG dynamic library are retrieved, and the corresponding prompt word information is matched; the large model combines the prompt word information and, based on the initial weight of the strategy optimization algorithm, decomposes the troubleshooting task into multiple sub-tasks, generates a fault strategy tree, and the format of the sub-tasks conforms to the instruction specification of the two-way data interaction mechanism. Upon receiving the structured task instructions, the system selects the optimal device interface to log in to the device according to the device interface adaptation rules, and executes the instructions corresponding to multiple sub-tasks in sequence. After execution, the system encapsulates the results in a standardized format, embeds data verification bits, and sends them back to the large model. The large model combines the retrieved historical troubleshooting cases with the returned execution results to drive the strategy optimization algorithm to dynamically adjust the priority and execution order of the subtasks, ultimately determining the root cause, generating the adjusted troubleshooting strategy and converting it into standardized instructions, which are then pushed to the operations and maintenance personnel to execute the instructions and provide feedback on the verification results. The structured text information of this fault is automatically extracted and pushed to the RAG dynamic library, while triggering the Dify platform to complete the text slicing and vector index update; the adjustment logic and execution data of the strategy optimization algorithm are synchronized to the RAG dynamic library.

2. The method for fault monitoring and troubleshooting based on a digital employee network according to claim 1, characterized in that, The method further includes: If a new device model or an unrecorded fault type is involved, the Dify platform will supplement the text details and trigger a secondary synchronization to the RAG dynamic library, while simultaneously updating the scene weight training data of the strategy optimization algorithm.

3. The method for fault monitoring and troubleshooting based on a digital employee network according to claim 1, characterized in that, The large model obtains the search popularity and matching accuracy of newly added cases through the search feedback interface of the Dify platform, and then optimizes the decision weights of its internal self-learning module in reverse.

4. A system for fault monitoring and troubleshooting based on a digital employee network, characterized in that, The system includes: The fault access and intent parsing module is used to extract the device IP, port, and alarm type from the natural language alarm after a device alarm is triggered. Using the RAG dynamic empowerment algorithm, it retrieves the corresponding historical troubleshooting cases in the RAG static library and the corresponding latest troubleshooting cases in the RAG dynamic library according to the alarm type to initially match the fault type. At the same time, based on the retrieved data of similar scenarios, it initializes the decision weights of the strategy optimization algorithm and finally generates a structured task instruction that conforms to the two-way data interaction mechanism. The prompt word matching and task decomposition module is used to utilize the RAG dynamic empowerment algorithm to retrieve the RAG static library and the RAG dynamic library according to the alarm type scenario, and match the corresponding prompt word information; the large model combines the prompt word information and, based on the initial weights of the strategy optimization algorithm, decomposes the troubleshooting task into multiple sub-tasks to generate a fault strategy tree, and the format of the sub-tasks conforms to the instruction specification of the two-way data interaction mechanism; The real-time execution and data feedback module is used to receive the structured task instructions, select the optimal device interface to log in to the device according to the device interface adaptation rules, and execute the instructions corresponding to multiple sub-tasks in sequence; after the execution is completed, the results are packaged in a standardized format, embedded with data verification bits, and sent back to the large model. The dynamic decision-making and strategy output module is used by the large model to combine the retrieved historical troubleshooting cases and the returned execution results to drive the strategy optimization algorithm to dynamically adjust the priority and execution order of the sub-tasks, ultimately determine the root cause, generate the adjusted troubleshooting strategy and convert it into standardized instructions, push it to the operation and maintenance personnel to execute the instructions and provide feedback on the verification results. The knowledge closed-loop update module is used to automatically organize the standardized text information of this fault and push it to the RAG dynamic library, while triggering the Dify platform to complete the text slicing and vector index update; it also synchronizes the adjustment logic and effect data of the strategy optimization algorithm to the RAG dynamic library; if a new device model or an unrecorded fault type is involved, the Dify platform is used to supplement the text details and trigger a second synchronization to the RAG dynamic library, while simultaneously updating the scene weight training data of the strategy optimization algorithm.

5. The system for fault monitoring and troubleshooting based on a digital employee network according to claim 4, characterized in that, The RAG static library is used to store basic text data, including historical troubleshooting cases. Initialization is completed through batch text upload and related import supported by the Dify platform. By setting a cron scheduled task, the updated basic text data is automatically pulled and the Dify platform is triggered to complete text re-slicing and vector index update. The RAG dynamic library is used to store dynamic data, including the latest device information and the latest troubleshooting cases, and is synchronized through the API interface opened by the Dify platform. The API interface is called based on the status after the troubleshooting process is completed, and the latest troubleshooting cases are written into the RAG dynamic library. For new device models or alarm types not stored in the RAG static library, the RAG dynamic library receives the latest troubleshooting cases in real time through the API interface and completes the knowledge base update within a predetermined time.

6. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method according to any one of claims 1-3.

7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that performs the method according to any one of claims 1-3.