Artificial intelligence-based automated operation and maintenance fault prediction and active recovery system
By using an AI-based automated operation and maintenance system, combined with early warning of health degradation, state-level precise repair, and domain-level deterministic isolation, the system solves the problems of delayed fault prediction, inaccurate repair, and rapid spread in edge computing and IoT scenarios. It achieves early warning, precise repair, and rapid isolation, ensuring the stable operation of edge nodes.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING TONGYU HUAZHOU TECHNOLOGY CO LTD
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-19
Smart Images

Figure CN122247826A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of automated operation and maintenance technology, and more specifically, to an automated operation and maintenance fault prediction and proactive recovery system based on artificial intelligence. Background Technology
[0002] Automated operation and maintenance (O&M) is a core supporting technology for edge computing, IoT, and other fields. Its core requirement is to achieve early fault prediction and rapid handling to ensure the stable operation of distributed nodes and related services. With the increasing heterogeneity of edge nodes and the increasing complexity of business scenarios, traditional O&M methods face severe challenges, placing higher demands on the timeliness of fault prediction, the accuracy of repair, and the effectiveness of isolation.
[0003] Existing O&M technologies largely rely on single-metric threshold monitoring or centralized cloud-based analysis, which has significant shortcomings. In terms of fault prediction, traditional methods struggle to capture hidden anomalies such as a surge in disordered losses despite no decrease in effective output, resulting in delayed warnings and a high false alarm rate. Repair methods often employ crude approaches like process restarts and node resets, lacking precise control over process semantic states, leading to poor repair results and potential disruption to normal business operations. Regarding fault isolation, existing solutions lack targeted domain-level isolation mechanisms, resulting in rapid fault propagation, cumbersome processing procedures, and difficulty adapting to scenarios with limited edge node resources and unstable networks. Furthermore, cloud-based O&M models suffer from transmission delays, failing to meet the rapid fault response requirements in edge scenarios.
[0004] To address the aforementioned issues, this invention proposes an automated operation and maintenance fault prediction and proactive recovery system based on artificial intelligence. Through multi-module collaboration, it achieves early warning, accurate repair, and rapid isolation of faults, thus overcoming the shortcomings of existing technologies. Summary of the Invention
[0005] In view of the shortcomings of existing technologies, the purpose of this invention is to provide an automated operation and maintenance fault prediction and proactive recovery system based on artificial intelligence.
[0006] To achieve the above objectives, the present invention provides the following technical solution: An AI-based automated operation and maintenance fault prediction and proactive recovery system includes an early health decline warning module, a state-level precise repair module, and a domain-level deterministic isolation module. The early health decline warning module is deployed locally on the edge node, collects multi-dimensional software, hardware and business data, quantifies the disorder and effective output of system operation through entropy change theory, establishes a steady-state efficiency evaluation system, and realizes early health decline warning. The state-level precise repair module subscribes to early warning events, captures the semantic state of the process based on reversible computation, generates reverse operation instructions through difference analysis, and executes precise repair locally. The domain-level deterministic isolation module monitors the repair results. For faults that fail to be repaired, it performs a health assessment based on a preset fault fence, identifies the disabled fence, and performs an atomic isolation operation to achieve deterministic isolation of the fault domain and system self-reorganization.
[0007] Furthermore, the early health decline warning module includes multi-dimensional hardware, software, and business data, including edge node hardware performance indicators, operating system kernel events, hardware environment parameters, and effective output data from the business layer. The edge node hardware performance indicators include the number of CPU cycles, the number of cache misses at each level, the number of branch prediction failures, and the number of memory access latency cycles. The operating system kernel events include the number of context switches and the interrupt request frequency. The hardware environment parameters include hardware temperature, battery power fluctuation parameters, and power supply voltage fluctuation parameters. The effective output data from the business layer includes the number of successfully processed terminal requests, the amount of data bytes effectively transmitted, and the local task execution completion rate.
[0008] Furthermore, the early health decline warning module constructs a steady-state efficiency evaluation system by calculating operation and maintenance entropy and ordered work. Operation and maintenance entropy represents the degree of disorder in the internal operation of the edge node, which is a weighted synthesis of edge node hardware performance indicators, operating system kernel events and hardware environment parameters. Ordered work represents the effective output of the edge node, which is quantified by the effective output data of the business layer. The two provide basic parameters for the calculation of steady-state efficiency coefficient.
[0009] Furthermore, the steady-state efficiency coefficient is the result of the correlation calculation between ordered work and operation and maintenance entropy, which is used to reflect the effective output capability of edge nodes under limited resources. The early health decline warning module establishes a health baseline based on the historical data of the edge node's health period, and jointly determines the early health decline state by calculating the deviation of the steady-state efficiency coefficient from the health baseline and the trend judgment factor in real time.
[0010] Furthermore, the trend determination factor is obtained by calculating the ratio of the time change rate of the operation and maintenance entropy to the time change rate of the ordered work. It is used to identify hidden anomalies where effective output has not improved but disordered loss has surged. When both the deviation and the trend determination factor meet the preset conditions, a first-level early warning event containing node identifiers and core parameters is generated.
[0011] Furthermore, the state-level precise repair module performs difference analysis by calculating the state deviation vector, mapping the current semantic state and the historical healthy semantic state to numerical feature vectors respectively. The state deviation vector is obtained by the difference between the two. Based on the state deviation vector, the degree of deviation of each indicator is quantified to locate the pollution state that causes the process to be abnormal. The semantic state includes, but is not limited to, process stack parameters, thread state information, resource handle state, edge data interaction context, local task execution state, memory fragmentation rate, abnormal log characteristics, and hardware association state.
[0012] Furthermore, the domain-level deterministic isolation module calculates the real-time health of the fault fence using a weighted summation method. The fault fence is based on edge node cluster attributes or business domain division, forms a logical fence by grouping nodes by node labels, and locks the communication range with the help of software-defined networking. Each fault fence is pre-set with at least two independent health detection anchor points, including an internal heartbeat deployed on the core node of the fence and a boundary API probe bound to the business interaction interface. The weighting is set according to the reliability of the anchor points and the fence type.
[0013] Furthermore, the domain-level deterministic isolation module calculates the short-term downward trend of the health of the fault fence and combines it with the absolute value of the health to jointly determine the disabled fence. It then performs atomic operations of connection disconnection, process termination, and clean reconstruction on the disabled fence in sequence. Connection disconnection blocks the spread path of the faulty network through software-defined network rules or firewall rules. Process termination uses a hierarchical termination mechanism to clear abnormal processes. Clean reconstruction is based on the verified clean base image stored locally on the edge node to rebuild the fence instance.
[0014] Compared with the prior art, the present invention has the following beneficial effects: 1. A steady-state efficiency evaluation system based on operational entropy and ordered work overcomes the limitations of traditional single-indicator monitoring. This system integrates edge node hardware performance, kernel events, environmental parameters, and business output data. Through joint judgment of deviation and trend determination factors, it accurately identifies hidden health declines, achieving predictive early warnings earlier than traditional alarms, thus buying sufficient time for subsequent repairs and effectively reducing the probability of failures. 2. Innovative integration of reversible computation for precise repair and fault fence isolation mechanisms optimizes fault handling processes. For general faults, reversible computation captures semantic states and generates reverse operation instructions for precise repair; for complex faults, a fault fence mechanism is activated. After identifying the disabled fence through health assessment, atomic isolation and clean reconstruction operations are performed to prevent fault propagation. Compared with the traditional extensive handling method, this significantly reduces fault handling time and ensures the continuity of core business. Attached Figure Description
[0015] Figure 1 This is a block diagram of an AI-based automated operation and maintenance fault prediction and proactive recovery system. Figure 2 This is a flowchart illustrating the implementation of the early health decline warning module of the present invention. Figure 3 This is a flowchart illustrating the implementation of the state-level precise repair module of the present invention. Figure 4 This is a flowchart illustrating the implementation of the domain-level deterministic isolation module of the present invention. Detailed Implementation
[0016] Example, refer to Figure 1 The AI-based automated operation and maintenance fault prediction and proactive recovery system in this embodiment includes an early health decline warning module, a state-level precise repair module, and a domain-level deterministic isolation module. Early Health Decline Warning Module: This module addresses the characteristics of highly heterogeneous edge nodes and limited resources on individual nodes. It quantifies the energy consumption efficiency of edge devices during orderly operation using entropy change theory, replacing single-indicator threshold monitoring and achieving early detection of implicit health decline compared to traditional alarms. This module is deployed locally on the edge node with a lightweight process. The output of the first-level warning event provides trigger signals and core data support for subsequent repair modules, enabling local warning judgment without uploading to the cloud. Figure 2 As shown, the specific implementation is as follows: S11. Multi-dimensional raw data acquisition and preprocessing: The system deploys lightweight acquisition agents at edge nodes (such as industrial gateways, edge servers, and IoT terminals) to periodically capture underlying hardware performance counters and operating system kernel events, adapting to the low-power operation requirements of edge devices. Key data sources include edge node CPU cycles, cache misses at each level (L1, L2, L3), branch prediction failures, memory access latency cycles, operating system-level context switching and interrupt request frequency. Additionally, edge-specific metrics such as hardware temperature and power supply stability (battery power fluctuations or power supply voltage fluctuations) are collected for heterogeneous edge devices. At the business level, the key metrics for measuring the orderly output of edge nodes include the number of terminal requests successfully processed on the edge, the amount of data bytes effectively transmitted between the edge and the terminal or between the edge and the cloud, and the local task completion rate.
[0017] All time-series data are standardized and smoothed using a sliding window to form a time series set. Taking into account the computing power of edge nodes, the sliding window length is set to 5-30 seconds, the sampling period is set to 1-2 seconds, and the standardization adopts the min-max normalization method to map the data to the 0-1 interval, reducing the complexity of subsequent calculations.
[0018] S12. Calculate the entropy and ordered work of the computing system: Operational entropy characterizes the internal disorder or internal friction cost of edge nodes, and is a weighted synthesis of underlying events combined with the hardware characteristics of edge devices (such as embedded chips and limited caches); ordered work characterizes the effective output of edge nodes (including local task processing and edge collaborative transmission). Both provide basic parameters for subsequent steady-state efficiency coefficient calculation, and their dimensions are unified as dimensionless coefficients to meet the lightweight computing needs of edge nodes.
[0019] Operation and maintenance entropy The calculation formula is as follows: ; in, This represents the normalized mean (dimensionless) memory access latency of edge nodes within the time window t, reflecting the disordered loss of memory I / O of edge nodes. The data comes from the number of memory access latency cycles collected in step S11, and is obtained after min-max normalization to adapt to the embedded memory characteristics of edge nodes. The normalized mean (dimensionless) of the number of operating system context switches on edge nodes within the time window t represents the degree of disorder in edge process scheduling. The data is taken from the context switch time series data collected in step S11 and adapted for lightweight computing after standardization and compression. The normalized mean (dimensionless) of the number of misses in the i-th level cache (L1, L2, L3) of the edge node within the time window t is represented. i=1,2,3 correspond to the three levels of cache respectively. The data comes from the hardware performance counter collection results in step S11 and is consistent with the characteristics of the edge node limited cache architecture. The normalized mean (dimensionless) of the interrupt request frequency of edge nodes within the time window t reflects the interference of hardware interrupts on the operation order of edge nodes. The data is taken from the interrupt request timing data collected in step S11. The weighting coefficients (ranging from 0.1 to 0.4, summing to 1) correspond to four key indicators: memory access latency, context switching, cache miss, and interrupt requests. These coefficients are calculated using Principal Component Analysis (PCA) based on historical edge node health data (categorized by device type, such as industrial gateways and IoT terminals) over the past three months. The core principle is the variance contribution of each indicator to the disorder of edge node operation, ensuring that the weighting allocation aligns with the characteristics of heterogeneous edge devices, and that the cumulative variance contribution rate is no less than 85%. In this embodiment... ; The sub-weight coefficients (ranging from 0.2 to 0.5, summing to 1) representing the L3 cache miss metric are also determined using the PCA algorithm. Weights are assigned based on the significance of the impact of different cache levels on edge node performance; for example, L1 cache has a more significant impact on embedded chips, so the corresponding weights can be appropriately increased. Values, in this embodiment .
[0020] Ordered work is defined as: ; This represents the number of terminal requests successfully processed by the edge node within the time window t (unit: times / window). The data comes from the business layer collection results in step S11, covering core edge businesses such as IoT terminal data reporting and control command response, and directly characterizes the business output capability of the edge node. This represents the amount of data bytes effectively transmitted between the edge node and the terminal or cloud within the time window t (unit: MB / window). Excluding retransmissions and redundant data, it is taken from the transmission data collected by the service layer in step S11 and reflects the effective output of edge node data collaboration. This represents the service adjustment coefficient (value from 0.001 to 0.01), set according to the priority of edge services. For example, in industrial control scenarios, the priority of terminal request processing is higher than that of data transmission, and it can be set... The data acquisition scenario can be adjusted in reverse; its core function is to compress the dimensions of business indicators, ensuring that the parameters within the logarithmic function are within a reasonable range (1~10), adapting to the low-computing-power needs of edge nodes, and avoiding numerical overflow. In this embodiment... ; Logarithmic function: It adopts the natural logarithm (base e), which has the characteristic of mild compression of positive values. It can preserve the difference of business output, avoid the distortion of ordered power values caused by excessively high single business indicators (such as terminal request peak), and reduce the computational complexity of edge nodes.
[0021] S13. Establish and monitor the steady-state efficiency coefficient: The steady-state efficiency coefficient is the ratio of ordered work to operational entropy, reflecting the work efficiency of edge nodes (i.e., the effective output capacity under limited resources). A benchmark value is established based on the health baseline data of edge nodes to provide a basis for subsequent decline warning. The coefficient is a dimensionless coefficient, and the calculation process is performed locally on the edge nodes to avoid network transmission delay.
[0022] Define the steady-state efficiency coefficient of the system for: ; in, The ordered work (dimensionless) of the edge nodes within the time window t is taken from the calculation result of step S12 and represents the effective service output capability of the edge nodes under limited resources. The edge node operation and maintenance entropy (dimensionless) within the time window t is taken from the calculation result of step S12 and represents the disordered loss of the operation inside the edge node. It is a very small constant (with a value of 10⁻) 6 Based on engineering calculation specifications, its core function is to avoid operational entropy. When the value approaches 0, a division-by-zero error occurs. The dimension is consistent with the operation and maintenance entropy, and the value is much smaller than the normal operation and maintenance entropy range of edge nodes (0.2~1.5), which does not affect the accuracy of the efficiency coefficient calculation. The steady-state efficiency coefficient (dimensionless) is defined by dividing the effective output by the disordered loss. It reflects the resource utilization efficiency of the edge node and has a normal operating range of 1.8 to 3.5 (based on 72-hour edge node health baseline data statistics). If it deviates from this range, it indicates that the node is operating abnormally.
[0023] Health baseline parameter description: The health baseline learning period is set to 72~168 hours (3~7 days), based on the operational cycle of edge nodes (such as two shifts in industrial production, 24-hour online IoT terminals). Only data from stable periods with 30%~70% load is selected, excluding extreme scenarios such as peak terminal access (such as the surge in device online at 8 am) and low cloud synchronization (such as 2 am), to ensure the baseline. (Moving average) and (Standard deviation) matches the actual operating state of the edge nodes, wherein, in this embodiment The normal range is 0.3~0.5 (based on actual measurement statistics of multiple models of edge devices).
[0024] S14. Determination and Issuance of Early Recession Warnings: By calculating the deviation of the efficiency coefficient and trend factor of edge nodes in real time, early health decline is jointly determined, which is adapted to the characteristics of easy spread of edge device failures and short repair windows. The generated first-level early warning event contains complete feature parameters, ensuring that the repair module can accurately locate the anomalies of edge nodes and corresponding processes, and can start local response without cloud intervention.
[0025] Calculate the efficiency coefficient of the current window in real time. And calculate its Z-score deviation relative to the healthy baseline. : ; At the same time, a trend determination factor is introduced. Both derivatives are dimensionless per second, ensuring It is a dimensionless coefficient, which is suitable for the lightweight computing needs of edge nodes.
[0026] in, The steady-state efficiency coefficient (dimensionless) for the current time window is taken from the real-time calculation result in step S13; The steady-state efficiency coefficient is the dimensionless moving average of the healthy baseline. To correspond to the standard deviation (dimensionless), both are taken from the health baseline established in step S13 and calculated based on the historical data of the edge node's health period. Z-score deviation (dimensionless), set according to the statistical normal distribution principle, is used to quantify the degree of deviation of the current efficiency coefficient from the healthy baseline. A negative value indicates that the efficiency is lower than the baseline, and the larger the absolute value, the more significant the deviation. The first derivative of operational entropy with respect to time (dimensionless / second) is expressed as the first-order difference method. , (with a sampling period of 1 to 5 seconds), characterizing the rate of change of disorder loss inside the edge node, with positive values indicating an increase in disorder; The first derivative of ordered work with respect to time (dimensionless / second) is also calculated using the first-order finite difference method. It represents the rate of change of the effective output of the edge node, with a positive value indicating an improvement in business output. This is a trend-determining factor (dimensionless), logically defined based on the ratio of the change in disordered loss to the change in effective output. It is used to identify hidden anomalies where effective output does not increase but loss surges. For example, memory fragmentation at edge nodes can cause disordered loss to increase while business output remains unchanged. Significantly increased.
[0027] An edge node is considered to be in early health decline when both of the following conditions are met: 1. ,in The preset negative deviation threshold is set according to the statistical 3σ principle, combined with the edge node fault tolerance adjustment (Z-score in the standard normal distribution). With a corresponding probability of only 0.62%, false alarms can be effectively avoided. Verified by testing on 100 edge devices, this threshold shows a false alarm rate of less than 1.2%. In this embodiment, the value is [value missing]. This indicates that the resource utilization efficiency of edge nodes has been significantly reduced, which may be due to edge-specific issues such as hardware aging and memory leaks. 2. ,in The threshold value is a positive value, set based on actual measurements of typical edge node failure modes (memory leak, lock contention, hardware overheating). When the value is set at a certain threshold, it indicates that the rate of disordered loss growth is more than 1.5 times the rate of effective output growth, demonstrating significant latent decay characteristics. Experimental verification shows that this threshold achieves an accuracy rate of over 92% in identifying early faults in edge nodes. In this embodiment, the value is set to a specific value. This indicates that when the ordered power of edge nodes does not increase or decreases, the internal friction cost shows a clear upward trend, requiring timely intervention to prevent the fault from spreading to the terminal equipment.
[0028] Once the above conditions are met, this early warning module generates a Level 1 early warning event and publishes it through the edge local event bus; the event includes the early warning level, trigger time, associated edge node ID / service identifier, etc. , and The value is combined with unique parameters such as edge node hardware temperature and power supply status; this joint criterion can sensitively capture the idling state of edge nodes caused by memory fragmentation, lock contention, software aging, hardware overheating, etc., and achieve predictive maintenance much earlier than traditional threshold alarms.
[0029] State-Level Precise Repair Module: This module subscribes to the first-level warning events of the Early Health Decline Warning Module (hereinafter referred to as the Warning Module). It performs lightweight, surgical-like repairs on abnormal internal states of edge node processes, adapting to the limited resources of edge nodes and their inability to handle heavy repair tasks. The entire repair process is executed locally on the edge node, and the repair results are synchronously fed back to the edge control plane, determining whether the edge node returns to normal or initiates an isolation process, avoiding repair delays caused by reliance on the cloud. Figure 3 As shown, the specific implementation is as follows: S21. Early Warning Event Reception and Preliminary Analysis: The state-level precise repair module (hereinafter referred to as the repair module) is deployed locally on the edge node in the form of a lightweight service. After receiving the early warning event, it accurately locates the set of affected edge processes and checks whether the processes have a built-in lightweight semantic checkpoint function. The checkpoint function is implemented by injecting a lightweight SDK during compilation or deployment, adapting to the storage and computing power limitations of the edge node, and providing support for subsequent state capture and rollback.
[0030] After receiving an alert event, the repair module locates the set of potentially affected processes on the node based on the edge node ID / service identifier in the event; it checks whether these processes have a pre-configured lightweight semantic checkpoint function; the SDK is developed in C++ and is compatible with mainstream embedded operating systems for edge nodes (such as Linux embedded version, RTOS), and the injection method supports static compilation injection and dynamic link injection to avoid consuming too many edge resources; hooks (i.e., hook functions, a type of lightweight callback function pre-embedded at a specific execution node of the program, used to intercept, capture or modify the program's running state and data) are implanted at key logic branch points in the code, including edge terminal request processing entry points, edge and cloud / terminal data interaction return points, and edge node local resource allocation / release points, to capture a predefined set of decisive internal semantic states and ensure the accuracy of the repair.
[0031] S22. Target process status capture and difference analysis: The system captures the current semantic state of the target process on the edge node and calls historical health checkpoints. It locates the pollution state through lightweight difference analysis, which is suitable for low computing power scenarios on the edge node. The anomaly score of the difference item quantifies the degree of anomaly, providing a basis for the generation of reverse operation instructions, ensuring that the repair operation is accurate and the resource consumption is controllable.
[0032] For target processes that support checkpoints, the repair module sends a local command to trigger the recording of the current semantic state, avoiding latency and data loss caused by cross-node transmission. At the same time, it loads the most recent checkpoint marked as healthy for the process from the local persistent storage (such as SD card or local solid-state drive) on the edge node, which includes the health semantic state. The checkpoint storage adopts a lightweight distributed KV database, with the storage period consistent with the collection period, retaining the most recent 100 health checkpoints, and the storage latency does not exceed 100 milliseconds, adapting to the limited storage resources of edge nodes.
[0033] The repair module executes a lightweight difference analysis algorithm to calculate the state deviation vector: ; in, This represents the current semantic state set of the target process on the edge node, taken from the state capture result in step S22. Based on the edge process's running characteristics, 10-20 key semantic states are selected to suit the lightweight storage requirements of edge nodes. In this embodiment, the key semantic states include: Key parameters of the process stack: stack bottom address, stack top offset, current stack frame depth, total heap memory allocation and percentage used; Thread status information: number of active threads, core thread ID and running status (ready / blocked / running), thread lock holding status (holding lock ID, lock type). Resource handle status: number of file handles and their effective percentage, number of network handles (Sockets), terminal device connection handle ID and associated terminal MAC; Edge data interaction context: terminal request session ID, request processing progress (completed / blocked / pending response), and connection session status with edge gateway / cloud (connected / disconnected / reconnecting). Local task execution status: currently executing task ID, task priority, remaining execution time, amount of local cached data, and validity flag; Edge hardware resource association: status of bound edge node hardware interfaces (serial port / network port) and hardware resource usage threshold (current CPU / memory usage as a percentage of process quota). Memory fragmentation rate: the percentage of heap memory fragments and the minimum size of contiguous available memory blocks; Exception log characteristics: whether there are hidden exception logs (no explicit alarms but exception markers), exception log generation timestamps and types (memory / network / hardware); Semantic checkpoint identifier: the most recent health checkpoint generation time, checkpoint integrity check code, and consistency identifier between the current state and the health checkpoint; Semaphore status: number of acquired semaphores, length of the queue of waiting semaphores; Power supply stability correlation: Remaining battery power percentage of battery power nodes, power supply voltage fluctuation records (peak / valley values in the last 5 seconds); Hardware temperature correlation: Process binding to the core hardware temperature, temperature alarm threshold trigger status; This represents the set of semantic health states of the target process at the edge node, taken from the historical health checkpoints loaded in step S22, and corresponds one-to-one with the indicators of the current state set to ensure the consistency of the difference analysis.
[0034] The mapping function from semantic state to numerical feature vector is set according to the semantic index type of the edge process. Discrete indices (such as handle state) use 0-1 encoding, while continuous indices (such as stack occupancy) use min-max normalization. The dimension of the mapped feature vector is set to 20-50 dimensions, which is determined according to the computing power level of the edge node (too high a dimension will increase the computing pressure, while too low a dimension will lose key information). According to actual tests, 20-30 dimensions can balance accuracy and computing power consumption. This represents the state deviation vector (dimensionless), where each dimension represents the difference between the current feature and the healthy feature, ranging from [-1, 1]. Positive values indicate that the current indicator is higher than the healthy value, while negative values indicate that it is lower than the healthy value. The larger the absolute value, the more significant the deviation.
[0035] For each difference item, an anomaly score is calculated, and items with scores exceeding the threshold are judged to be in a contaminated state; Anomaly score threshold (0.3~0.5): Based on the Euclidean distance calculation principle and combined with the experience of edge process fault repair, for example, when the Euclidean distance of the deviation vector is >0.4, the probability of anomaly in the corresponding process is over 88%; at the same time, it can be dynamically adjusted according to the business type. The threshold is set to 0.3 for real-time control business (such as industrial equipment control) and 0.5 for non-real-time business (such as data acquisition) to balance the warning sensitivity and the false repair rate.
[0036] S23. Generation, verification, and execution of inverse operation instruction sequences: Based on the difference analysis results, a simplified reverse operation instruction sequence is generated and injected into the process after being verified in a local lightweight sandbox to ensure security. This minimizes the impact on other services and terminal devices on edge nodes. It enables the process state to be accurately rolled back to a healthy state, avoiding secondary failures caused by repair operations and adapting to the business continuity requirements of edge scenarios.
[0037] Based on the difference analysis results, the repair module dynamically generates a series of simplified reverse operation instructions according to the edge process state change logic. The instruction sequence size is controlled within 10KB to reduce memory consumption. For example, an unlock instruction is generated when the lock identifier is abnormal, a memory release instruction is generated when the memory block information grows abnormally, and a handle release instruction is generated for edge node-specific problems (such as terminal connection handle leakage). The instruction sequence is encapsulated in JSON format, supports atomic execution and rollback, and ensures the reliability of the repair operation. The specific process for generating the inverse operation instruction is as follows: Step 1: Reversible modeling of operation logs and state snapshot fusion.
[0038] The system uses an injected lightweight SDK to synchronously record two types of information on the critical path of the process: (a) Semantic state snapshot: i.e., the aforementioned It periodically captures crucial internal states such as process stacks, resource handles, and thread states.
[0039] (b) Lightweight operation log: Records high-level operation identifiers that lead to state changes (such as memory allocation). Lock Open the file Sending via network (and its key parameters, and aligned with the state snapshot timestamp.)
[0040] When an alert triggers a repair, the system not only compares the current state snapshot... With historical health snapshots To calculate the state deviation vector Simultaneously, the system analyzes operation logs recorded since the health checkpoint. By associating abnormal state variables with the operation records that most recently modified them, the system can locate suspected contaminated operations that led to state contamination, thus providing direct evidence for generating reverse operations.
[0041] Step 2: Ensuring the accuracy of reverse operation generation and execution.
[0042] (a) Rule-based inverse operation generation: The system maintains an extensible inverse operation rule library. Each rule corresponds to a type of reversible operation and explicitly defines its forward operation mode, inverse operation generation logic, pre-execution verification conditions for safe execution, and expected state after execution. For example, for forward operations... Its inverse rule may generate Instructions, and verify the address before execution. Whether it is still legally held by the process.
[0043] (b) Sandbox Simulation Verification: The generated inverse operation instruction sequence is not directly injected into the production process. The system first uses health checkpoints in a lightweight sandbox environment local to the edge node. Reconstruct a simulated process using the operation logs from before the contamination operation. Execute the contamination operation and its planned reverse operation sequentially within this sandbox, verifying that the state has recovered to the healthy baseline after execution and detecting any unexpected side effects (such as resource leaks or data inconsistencies). The reverse operation sequence is only approved for execution if the sandbox verification is fully successful.
[0044] (c) Atomic Execution and Rollback Contingency Plan: The approved sequence of reverse operations is encapsulated as atomic transactions. A rollback point is created before execution, and process health metrics are monitored in real time during execution. If all reverse operations succeed and the process recovers normally within the observation period, the repair is successful; if any step fails or the process fails to recover, a transaction rollback is triggered (reverse recovery operation is performed), causing the process state to revert to the state before the repair attempt, and the repair failure is immediately marked and the isolation process is triggered, thereby preventing secondary failures caused by the repair operation.
[0045] After the sandbox verification is passed, the repair module will securely inject the reverse operation instruction sequence into the target process for execution, accurately roll back the internal state of the process to the healthy state corresponding to the health checkpoint, and the whole process takes very little time, which is suitable for the rapid repair needs of edge faults.
[0046] If the repair is successful, the process terminates and the edge node returns to normal operation; if it fails or times out, the event serves as a key signal to trigger the intervention of the domain-level deterministic isolation module (hereinafter referred to as the isolation module) to prevent the fault from continuing to affect the terminal business.
[0047] Domain-level deterministic isolation module: This module serves as the ultimate security guarantee for edge computing systems. Adapting to the characteristics of distributed deployment of edge nodes and the ease with which faults can spread across nodes, it monitors local edge alerts and repair result events. Upon repair failure or timeout, it immediately initiates a local isolation process. Through fence health assessment and rapid reconstruction, it achieves deterministic blocking of fault impact, preventing the fault from spreading to other edge nodes and terminal devices, and ensuring the continuity of core edge services. Figure 4 As shown, the specific implementation is as follows: S31. Isolation Trigger and Fence Health Assessment: The isolation module is deployed on the edge gateway or core edge node. After activation, it quickly scans the health of all preset fault fences through dual anchor point detection, adapting to the characteristics of edge network fluctuations and node heterogeneity. The health calculation adopts a lightweight weighted summation method to ensure that the evaluation results are accurate, reliable and efficient, providing a basis for the identification of disabled fences. The entire process is executed locally without cloud dependency.
[0048] Fault Fence Pre-configuration Building Instructions: Fault fences need to be pre-configured during system deployment or edge service expansion. Standardized configuration is completed based on the edge control plane. The core construction logic is as follows: 1. Division rules: Divide according to edge node clusters (grouping devices of the same region / model) and business domains (industrial control, data acquisition, terminal access, etc.) to ensure high correlation between nodes / businesses within the fence and minimize the impact range when isolating faults; 2. Technical Implementation: Node labels are configured through the edge control plane, and logical fences are formed based on the label groups. At the same time, software-defined networking (SDN) is used to divide the network boundary and lock the communication range of nodes within the fence. 3. Core Configuration: Each fence needs to be pre-configured with two health detection anchor points. The internal heartbeat is deployed on the core node of the fence, and the boundary API probe is bound to the business interaction interface. Basic parameters such as anchor point weight and probe latency threshold are also recorded to adapt to subsequent health assessment requirements. 4. Heterogeneous adaptation: The node fence composed of industrial gateways and edge servers is configured with complete dual anchor points; the fence of lightweight devices such as IoT terminals can be simplified to single heartbeat and lightweight probes to reduce resource consumption.
[0049] Once the triggering conditions are met, the isolation module starts immediately without attempting to diagnose the specific root cause of the fault (adapting to the characteristics of complex fault location and short repair window at the edge). It performs a health scan on all predefined fault fences in the edge cluster in parallel. Each fence is preset with at least two independent health detection anchors, including the heartbeat inside the edge node (once every 100 milliseconds) and the boundary API probe (for the interaction interface between the edge node and the terminal / gateway), adapting to the detection packet loss problem caused by the instability of the edge network.
[0050] The isolation module collects the status of all anchor points within a very short time window and calculates the instantaneous health of each fence using the following formula: ; in, The instantaneous health status of the i-th fault fence (dimensionless, range [0,1]), where 0 represents complete failure and 1 represents complete health. Based on the weighted logic definition of dual-anchor point status, it adapts to the unreliable single-anchor point detection problem caused by edge network fluctuations. This is the heartbeat signal inside the i-th fence edge node, with a transmission period of 100 milliseconds. It is set according to the communication delay characteristics of the edge node to ensure that the heartbeat signal can promptly reflect the node's survival status and avoid misjudging the node's status due to brief network fluctuations. For the i-th fence boundary API probe, targeting the core interaction interface between the edge node and the terminal or gateway (such as MQTT interface, industrial Ethernet interface), the probe request period is set according to the edge service communication protocol, and the heartbeat period is consistent with the probe request period. It is an indicator function (takes the value 0 or 1). It takes the value 1 when the condition is met and 0 otherwise. It outputs the result based on the logical judgment. Its core function is to quantify the anchor point state into a computable index. The health assessment time window (1-3 seconds) is set according to the edge fault propagation speed to ensure that the assessment is completed before the fault spreads to other nodes, while reserving enough time to collect anchor point status to avoid missed detections. The maximum allowable latency threshold for API probes (100~500 milliseconds) is set according to the heterogeneous characteristics of edge networks. For wired connection nodes such as industrial gateways, it is set to 100~200 milliseconds, and for wireless connection nodes such as IoT terminals, it is set to 300~500 milliseconds to adapt to the network environment of different edge nodes. Anchor point weights are set (empirically ranging from 0.4 to 0.6, with a total of 1). Based on anchor point reliability, node fences (primarily based on hardware status) increase heartbeat weights, and service fences (primarily based on communication status) increase probe weights. In this embodiment... , According to actual testing on edge clusters, this weight allocation can achieve a health assessment accuracy of over 93%.
[0051] S32. Disability Fence Identification and Isolation Decision-Making: The system uses a combination of absolute health values and declining trends to determine the fault fence, which is adapted to the rapid spread of edge faults. It eliminates the need for root cause diagnosis and makes decisions based solely on edge node status indicators, ensuring timely isolation and preventing faults from affecting the operation of terminal equipment.
[0052] The isolation module calculates the short-term downward trend of the health of each fence; the trend is calculated using a lightweight linear regression algorithm, with the regression window set to 3-5 evaluation periods to reduce the computational pressure on edge nodes; if a fence simultaneously meets the conditions that its health is less than the critical health threshold and its trend is less than 0, it is determined to be a disabled fence. The critical threshold for health (0.3~0.5) is set according to the importance of the fence. The threshold for core business fences (such as industrial control fences) is set to 0.4~0.5, and the threshold for ordinary fences (such as data backup fences) is set to 0.3~0.4 to ensure that core business failures are triggered for isolation earlier. In this embodiment, the threshold for core business fences is set to 0.5 and the threshold for ordinary fences is set to 0.3. According to actual tests, this threshold can achieve a failure propagation blocking rate of over 89%. S33. Execute the least impact domain explosion protocol: The system performs a three-step atomic operation on the disabled fence, sequentially severing the connection, terminating the process, and performing a clean reconstruction. This is adapted to the characteristics of limited resources at edge nodes and the need for rapid service recovery. The operation process is atomic and irreversible, ensuring complete isolation of the fault and no residual risk in the new instance, while minimizing the impact on other edge nodes and terminal services.
[0053] Upon identifying a disabled fence, the isolation module immediately invokes a pre-defined recovery protocol; the protocol consists of three atomic operations, executed sequentially: (1) Connection cut-off: By defining network rules or firewall rules in the local software of the edge node, the external network connection of all edge nodes within the fence (including the connection with other edge nodes, cloud, and terminal devices) is cut off instantly, while prioritizing the preservation of the local cache and offline operation capability of the terminal device; the rule takes effect within 100 milliseconds, covering all TCP and UDP ports, to prevent the fault from spreading through the network; (2) Process termination: Send a safe termination signal to all abnormal processes of edge nodes within the fence, with a waiting time of 1 to 3 seconds. After the timeout, the remaining processes will be forcibly terminated. Use the SIGTERM signal to attempt graceful termination first to reduce data loss. Use the SIGKILL signal for forced termination to ensure that abnormal processes are cleared quickly. At the same time, monitor the CPU and memory load of edge nodes to ensure that the termination operation does not cause node crashes. (3) Clean Reconstruction: Quickly reconstruct a new fence instance from the verified clean base image stored in the local read-only medium of the edge gateway, avoiding reconstruction delay caused by relying on the cloud image repository; the image storage adopts distributed read-only storage (local cache copy of the edge node), the reconstruction time does not exceed 60 seconds, the image verification adopts SHA-256 algorithm to ensure integrity, and the edge node and terminal device adaptation configuration is automatically restored after reconstruction.
[0054] The clean base image adopts a layered architecture, mainly including: (a) a minimal operating system layer: customized based on the edge device hardware architecture, containing only the necessary kernel modules, system libraries and initialization processes; (b) a security and operation and maintenance foundation layer: integrating the lightweight collection agent, communication middleware and security hardening components (such as firewall rules and access control lists) required by this system; (c) a business ready layer: providing a standard runtime environment and pre-configured business framework to ensure that business applications can be quickly loaded after reconstruction.
[0055] S34. System Self-Reorganization and Event Archiving: After the new instance starts, it automatically completes edge service registration and terminal traffic switching, and archives the entire event chain to the edge local storage, adapting to the offline operation requirements of edge nodes; it provides data support for subsequent system optimization, realizes edge operation and maintenance closed loop, and does not need to rely on cloud data analysis.
[0056] After the instance starts, it announces its availability to the edge control plane through the edge local preset service registration mechanism. The peripheral edge nodes, load balancers and terminal devices automatically switch traffic to the new instance according to the preset elastic policy, giving priority to core terminal services. The elastic policy supports retries and failover, with a switching latency of no more than 500 milliseconds. At the same time, it retains the offline data cache of the terminal devices, which will be synchronized to the edge nodes after the connection is restored.
[0057] Once the entire isolation and reconstruction process is complete, the isolation module will issue an isolation completion event, recording the isolated fence, the IDs of the involved edge nodes, the operation time, and the results. This complete event chain from alert to isolation will be archived to the local storage of the edge nodes for a storage period of no less than 90 days, and will also be periodically synchronized to the cloud for backup (only when the network is stable). The archived data will be used for subsequent edge node health analysis and model optimization, including weight coefficient adjustment, threshold optimization, and adaptation to changes in edge node heterogeneity and operating characteristics.
[0058] Furthermore, the three modules mentioned above are deployed in a distributed manner on edge nodes and edge gateways, forming a loosely coupled and highly efficient collaborative system through a local event bus, which is adapted to the characteristics of the distributed architecture of edge computing. The process is promoted in the form of a state machine to realize a progressive defense from routine monitoring to early warning, repair and isolation, ensuring that the edge computing system has high availability in scenarios with limited resources, unstable networks and heterogeneous nodes.
[0059] (a) Routine monitoring: The early warning module is deployed in a lightweight manner on each edge node, continuously calculates the node efficiency coefficient, is in a routine monitoring state, occupies only a small amount of node resources, and does not affect the operation of core business; (b) Warning release and repair attempt: The warning module determines that the deviation of the efficiency coefficient and the trend factor of the edge node exceed the threshold, generates a first-level warning event and releases it through the local event bus; the repair module, as an event subscriber, is immediately awakened, enters the repair attempt state, starts the repair countdown, and responds locally throughout the process without cloud delay; (c) Repair feedback: The repair module completes the local repair attempt within the countdown. Regardless of success or failure, it publishes a repair result event; the event is received by both the warning module and the isolation module. Repair successful: The early warning module uses this case as positive feedback to fine-tune the local model parameters (including thresholds and weight coefficients), enabling edge nodes to learn on their own without the need for unified optimization in the cloud; the process ends and the system returns to normal monitoring. Repair failure or timeout: This result serves as a core trigger condition, activating the isolation module in standby mode to prevent the fault from spreading to other edge nodes and terminal devices; (d) Ultimate Isolation: After the isolation module is activated, it quickly identifies disabled fences based on fence health and trends, performs rapid isolation and reconstruction without diagnosis, and completes the entire process on the edge side. After reconstruction, the terminal service is automatically restored; and an isolation completion event is released after completion. (e) Closed-loop learning: The early warning module receives the isolation completion event, uses this serious failure case as important data, optimizes the sensitivity of the local early warning model, and adjusts the algorithm for edge node-specific failure modes (such as hardware overheating, network fluctuations and other anomalies) to detect similar hidden dangers earlier in the future.
[0060] Through the detailed description of the above embodiments, the AI-based automated operation and maintenance fault prediction and proactive recovery system of the present invention constructs a comprehensive operation and maintenance fault prevention and control system through the layered and collaborative operation of three core modules. The early warning module, based on the entropy change theory, achieves early detection of latent decay; the repair module relies on reversible computing to complete accurate state repair; and the isolation module uses a fault fence mechanism to block fault propagation and quickly rebuild. The system executes entirely locally at the edge, overcoming the pain points of traditional operation and maintenance such as delayed early warning, crude repair, and inefficient isolation, while also adapting to the special needs of edge computing scenarios. This significantly improves the intelligence level of automated operation and maintenance and the stability of system operation, providing strong protection for the safe and reliable operation of edge nodes and related services.
[0061] The above formulas are all dimensionless calculations, and the preset parameters in the formulas should be set by those skilled in the art according to the actual situation.
[0062] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more sets of available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. A semiconductor medium can be a solid-state drive.
[0063] It should be understood that in the various embodiments of this application, the order of the above-mentioned processes does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0064] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0065] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0066] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0067] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0068] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. An automated operation and maintenance fault prediction and proactive recovery system based on artificial intelligence, characterized in that, This includes an early warning module for declining health, a state-level precise repair module, and a domain-level deterministic isolation module; The early health decline warning module is deployed locally on the edge node, collects multi-dimensional software, hardware and business data, quantifies the disorder and effective output of system operation through entropy change theory, establishes a steady-state efficiency evaluation system, and realizes early health decline warning. The state-level precise repair module subscribes to early warning events, captures the semantic state of the process based on reversible computation, generates reverse operation instructions through difference analysis, and executes precise repair locally. The domain-level deterministic isolation module monitors the repair results. For faults that fail to be repaired, it performs a health assessment based on a preset fault fence, identifies the disabled fence, and performs an atomic isolation operation to achieve deterministic isolation of the fault domain and system self-reorganization.
2. The AI-based automated operation and maintenance fault prediction and proactive recovery system according to claim 1, characterized in that, The early health decline warning module includes multi-dimensional hardware, software, and business data, including edge node hardware performance indicators, operating system kernel events, hardware environment parameters, and effective output data from the business layer. The edge node hardware performance indicators include the number of CPU cycles, the number of cache misses at each level, the number of branch prediction failures, and the number of memory access latency cycles. The operating system kernel events include the number of context switches and the interrupt request frequency. The hardware environment parameters include hardware temperature, battery power fluctuation parameters, and power supply voltage fluctuation parameters. The effective output data from the business layer includes the number of successfully processed terminal requests, the amount of data bytes effectively transmitted, and the local task execution completion rate.
3. The AI-based automated operation and maintenance fault prediction and proactive recovery system according to claim 2, characterized in that, The early health decline warning module constructs a steady-state efficiency evaluation system by calculating operation and maintenance entropy and ordered work. Operation and maintenance entropy represents the degree of disorder in the internal operation of the edge node, which is a weighted synthesis of edge node hardware performance indicators, operating system kernel events and hardware environment parameters. Ordered work represents the effective output of the edge node, which is quantified by the effective output data of the business layer. The two provide basic parameters for the calculation of steady-state efficiency coefficient.
4. The AI-based automated operation and maintenance fault prediction and proactive recovery system according to claim 3, characterized in that, The steady-state efficiency coefficient is the result of the correlation calculation between ordered work and operation and maintenance entropy, which is used to reflect the effective output capability of edge nodes under limited resources. The early health decline warning module establishes a health baseline based on the historical data of the edge node's health period, and jointly determines the early health decline state by calculating the deviation of the steady-state efficiency coefficient from the health baseline and the trend judgment factor in real time.
5. The AI-based automated operation and maintenance fault prediction and proactive recovery system according to claim 4, characterized in that, The trend determination factor is obtained by calculating the ratio of the time change rate of operation and maintenance entropy to the time change rate of ordered work. It is used to identify hidden anomalies where effective output has not improved but disordered loss has surged. When both the deviation and the trend determination factor meet the preset conditions, a first-level early warning event containing node identifiers and core parameters is generated.
6. The AI-based automated operation and maintenance fault prediction and proactive recovery system according to claim 1, characterized in that, The state-level precise repair module performs difference analysis by calculating the state deviation vector. It maps the current semantic state and the historical healthy semantic state into numerical feature vectors, respectively. The difference between the two is used to obtain the state deviation vector. Based on the state deviation vector, the degree of deviation of each indicator is quantified to locate the pollution state that causes the process to be abnormal. The semantic state includes, but is not limited to, process stack parameters, thread state information, resource handle state, edge data interaction context, local task execution state, memory fragmentation rate, abnormal log characteristics, and hardware association state.
7. The AI-based automated operation and maintenance fault prediction and proactive recovery system according to claim 1, characterized in that, The domain-level deterministic isolation module calculates the real-time health of the fault fence using a weighted summation method. The fault fence is based on the edge node cluster attributes or business domain division, forms a logical fence by grouping nodes by node labels, and locks the communication range with the help of software-defined networking. Each fault fence is pre-set with at least two independent health detection anchor points, including an internal heartbeat deployed on the core node of the fence and a boundary API probe bound to the business interaction interface. The weighting is set according to the reliability of the anchor points and the fence type.
8. The AI-based automated operation and maintenance fault prediction and proactive recovery system according to claim 7, characterized in that, The domain-level deterministic isolation module calculates the short-term downward trend of the health of the faulty fence and combines it with the absolute value of the health to jointly determine the disabled fence. It then performs atomic operations of connection disconnection, process termination, and clean reconstruction on the disabled fence in sequence. Connection disconnection blocks the spread path of the faulty network through software-defined network rules or firewall rules. Process termination uses a hierarchical termination mechanism to clear abnormal processes. Clean reconstruction is based on the verified clean base image stored locally on the edge node to rebuild the fence instance.