A device failure review analysis method and system

By performing time consistency verification and rearranging on multi-source fault data, generating fault evidence packages and reconstructing fault event chains, the problem of difficulty in analyzing multi-source data under a unified time benchmark in existing technologies is solved, the traceability and verifiability of fault evidence are realized, and the interpretability and credibility of fault root cause conclusions are improved.

CN122240380APending Publication Date: 2026-06-19CHENGDU MINGSHUYINGHE TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHENGDU MINGSHUYINGHE TECHNOLOGY CO LTD
Filing Date
2026-04-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In the retrospective analysis of industrial equipment failures, existing technologies suffer from clock drift, sampling period differences, and out-of-order issues in multi-source data, making it difficult to form a traceable and verifiable chain of failure evidence under a unified time reference. Furthermore, the lack of structured solidification and reliable quantification verification mechanisms affects the accuracy and interpretability of failure root cause conclusions.

Method used

By obtaining the time point of the failure, collecting multi-source failure data and performing time consistency verification, including drift correction and out-of-order rearrangement, a failure evidence package is generated. A control group and an experimental group playback environment are constructed for differential comparison. The failure event chain is reconstructed by combining the equipment topology relationship, and a credibility score and integrity verification summary are generated.

Benefits of technology

It enables comparable and correlated analysis of multi-source fault data under a unified time benchmark, improves the stability and repeatability of fault retrospective analysis, ensures the traceability and verifiability of fault evidence, reduces the risk of misjudgment, and improves the interpretability and credibility of fault root cause conclusions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240380A_ABST
    Figure CN122240380A_ABST
Patent Text Reader

Abstract

This invention relates to the field of industrial equipment fault diagnosis and maintenance analysis, and discloses a method and system for equipment fault retrospective analysis. The method includes: responding to fault trigger events to obtain the fault occurrence time point and determining the fault retrospective time window; collecting multi-source fault data within the time window; performing time consistency verification and implementing drift correction and out-of-order rearrangement to obtain consistent multi-source data; generating a fault evidence package based on the multi-source data, the evidence package containing a configuration snapshot, consistency verification records, and integrity verification summary; performing event tuple generation on the evidence package and reconstructing the fault event chain to generate root cause candidates and evidence references; generating a replay script, isolating or simulating external dependency interfaces in the control and experimental group replay environments, executing replay and differential comparison, and outputting the fault root cause and confidence level. The system includes modules for retrospective window, data acquisition, verification, evidence package, event chain, root cause reasoning, differential replay, differential verification, and archiving.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of industrial equipment fault diagnosis and operation and maintenance analysis, and specifically to a method and system for equipment fault retrospective analysis. Background Technology

[0002] During the operation of industrial automation equipment, electromechanical equipment, and production lines, equipment failures often manifest as alarm triggering, abnormal parameters, control mode switching, actuator response lag, or shutdown protection. After a failure occurs, it is usually necessary to review and debrief the events before and after the failure to clarify the cause of the anomaly, the propagation path, and the effectiveness of the handling, and to provide a basis for subsequent operation and maintenance strategies and parameter tuning.

[0003] Current retrospective analyses primarily rely on alarm and event logs, controller status records, and partial sensor curves for retrieval and playback. Some platforms support trend overlay, alarm correlation, work order records, and report summaries. Other solutions involve exporting multi-source data offline and manually aligning and comparing it, or replaying key inputs in a simulation environment based on experience to verify a hypothesis. Regarding configuration and version information, the common practice is to only record the current version or add it retrospectively, making it difficult to maintain consistency with the time of the failure.

[0004] Because multi-source data suffers from clock drift, sampling period differences, missing and out-of-order data, and inconsistent data quality, and because the review process lacks a structured, solidified, credible, quantifiable, and reproducible comparison and verification mechanism for fault evidence, it is difficult to form a traceable and verifiable chain of fault evidence and stably verify the root cause conclusion under a unified time benchmark. Summary of the Invention

[0005] In view of the shortcomings of the prior art, the present invention provides a method and system for reviewing and analyzing equipment failures to solve the technical problems existing in the prior art.

[0006] The above-mentioned technical objective of the present invention is achieved through the following technical solution: A method for retrospective analysis of equipment failures includes the following steps: S1. Respond to the fault triggering event of the target device, obtain the fault occurrence time point, and determine the fault review time window based on the fault occurrence time point and the preset pre-fault review duration and post-fault review duration. S2. Collect multi-source fault data within the fault review time window. The multi-source fault data shall include at least two of the following types of data: sensor time series data, controller status data, actuator feedback data, alarm and event log data, and configuration parameters and version information. S3. Perform time consistency verification on the multi-source fault data, and perform drift correction and out-of-order rearrangement on the multi-source fault data based on the verification results to obtain the multi-source fault data after consistency processing. S4. Generate a fault evidence package based on the multi-source fault data after consistency processing; S5. Perform event tuple processing on the fault evidence package to generate an event tuple set, and reconstruct the fault event chain based on the event tuple set; S6. Generate a root cause candidate set based on the fault event chain, and associate each root cause candidate with corresponding evidence citation information. S7. Generate a replay script based on the fault evidence package, construct the control group replay environment and the experimental group replay environment, and execute the replay script respectively to obtain the control replay results and the experimental replay results; S8. Compare the control playback results with the experimental playback results to obtain the difference index, and determine the root cause of the failure and the corresponding confidence level based on the difference index, and output the failure review analysis results.

[0007] Preferably, the time consistency check includes: Extract anchor events, which include at least one of alarm trigger events, control command issuance events, and state variable transition events; The timestamps of data from different sources are aligned based on anchor events, and resampling is performed on data with different sampling periods so that multi-source data can be compared under a unified time reference. Based on timestamp deviation calculation, drift correction is performed on timestamps with deviation, out-of-order data is rearranged in time, and missing data is marked with missing tags. The alignment results, drift correction results, and missing markers are written into the consistency verification report area to form a traceable verification record.

[0008] Preferably, the fault evidence package includes an original data area, a derived feature area, a configuration snapshot area, and a consistency verification report area, wherein: The configuration snapshot area includes at least two of the following: device model information, firmware version information, key parameter configuration items, and device topology relationship information. The derived feature region includes at least one or more of the following: rate of change features, mutation point features, threshold crossing features, statistical aggregation features, and frequency domain features. The derived features are generated from multi-source fault data after consistency processing. After the fault evidence package is generated, an integrity verification digest is further generated to verify the integrity of the fault evidence package content.

[0009] Preferably, the credibility scoring area is used to generate credibility scores for each data source, and the credibility scores are determined by at least two of the following indicators: data integrity indicator, time consistency indicator, conflict consistency indicator, and noise level indicator. When a data conflict is detected, the conflict type and conflict location are recorded. The conflict type includes at least one of numerical conflict, state conflict, and sequence conflict, which is used for evidence screening when generating the root cause candidate set.

[0010] Preferably, the event tuple includes a device identifier, a component identifier or a signal identifier, an event type, an event value, a timestamp, and a credibility score; Event tuple processing includes: converting continuous sampled data into state change events, threshold overrun events, or mutation events; converting control commands and actuator feedback into control action events and response confirmation events; and supplementing event tuples with data source identifiers and sampling period identifiers to characterize the basis for event tuple generation.

[0011] Preferably, the reconstruction of the fault event chain includes: Generate an event sequence based on the temporal order of event tuples; Determine the causal constraints of events based on the equipment topology, control dependency, component association, or signal association. The causal constraints of events include two or more of the following: component connection constraints, control link constraints, and alarm association constraints. A directed event graph is constructed based on event sequences and event causal constraints, and the critical path is extracted from the directed event graph as a fault event chain. Key turning points in the fault event chain are marked. Key turning points include at least one of the following: state change, alarm escalation, and control mode switching.

[0012] Preferably, the generation of the root cause candidate set includes: Based on the key turning events in the fault event chain, a set of suspected components is identified, and root cause candidates are screened in combination with equipment topology path constraints. The similarity is calculated based on a historical fault case database and the candidate ranking results are output. The similarity is based on one or more of the following: event sequence similarity, alarm sequence similarity, and key state feature similarity. This is a list of evidence references for each root cause candidate. The list of evidence references includes the corresponding event tuple identifier, the location of the original data fragment, and the configuration snapshot item.

[0013] Preferably, the construction of the control group playback environment and the experimental group playback environment includes: Load the baseline configuration version corresponding to the time of the failure in the isolated execution space to build the control group replay environment, and load the modified repair configuration version to build the experimental group replay environment; The fault evidence package is used as a unified input to drive the same playback script to be executed in both the control group playback environment and the experimental group playback environment. During the replay execution, external dependent interfaces are isolated or simulated. Isolation or simulation processing includes redirecting external dependent interface requests to the simulation service or returning preset response data through the interface adaptation layer to ensure the determinism and reproducibility of the replay execution.

[0014] Preferably, the differential index includes one or more of the following indices: critical state quantity deviation index, state transition sequence difference index, alarm trigger sequence difference index, abnormal duration difference index, and recovery time difference index. The step of determining the root cause of the failure and its corresponding confidence level based on the differential index includes: updating the confidence level of the root cause candidate based on the consistency between the confidence score of the root cause candidate association and the differential index; When the experimental playback results meet the preset differential judgment conditions compared with the control playback results, the confidence of the corresponding root cause candidate is increased and the result is output as the root cause of the fault. The preset differential judgment conditions include one or more of the following conditions: the difference in recovery time reaches a preset threshold, the difference in alarm trigger sequence meets the preset sequence constraints, and the difference in abnormal duration reaches a preset threshold. The output of the fault review analysis results includes generating a review report, which includes a fault timeline, a list of evidence references, root cause conclusions, differential indicators, handling suggestions and verification conclusions. The fault evidence package, control playback results, experimental playback results and review report are associated and archived, and a traceable task identifier is generated.

[0015] A device failure review and analysis system, comprising: The review window management module is used to respond to the fault triggering event of the target device, obtain the fault occurrence time point, and determine the fault review time window based on the fault occurrence time point and the preset pre-fault review duration and post-fault review duration. The data acquisition module is used to collect multi-source fault data within the fault review time window; The consistency verification module is used to perform time consistency verification on multi-source fault data, and based on the verification results, perform drift correction, out-of-order rearrangement and missing label processing on the multi-source fault data to obtain consistent multi-source fault data. The fault evidence package generation module is used to generate a fault evidence package based on multi-source fault data after consistency processing. The fault evidence package includes the original data area, the derived feature area, the configuration snapshot area and the consistency verification report area, and generates an integrity verification summary. The event chain reconstruction module is used to process the fault evidence package into event tuples to generate a set of event tuples, and reconstruct the fault event chain based on the device topology and control dependencies. The root cause reasoning module is used to generate a root cause candidate set based on the chain of failure events and to cite information and confidence level for each root cause candidate as evidence. The differential playback module is used to generate playback scripts based on the fault evidence package, construct the control group playback environment and the experimental group playback environment, and execute the playback respectively to obtain the control playback results and the experimental playback results, and isolate or simulate external dependent interfaces. The differential verification module is used to perform differential comparison between the control playback results and the experimental playback results to obtain differential indices, and update the root cause candidate confidence based on the differential indices and output the root cause of the failure. The report archiving module is used to generate debriefing reports and associate and archive the fault evidence package, comparison playback results, and experiment playback results with the debriefing report, generating a traceable task identifier.

[0016] In summary, the present invention has the following main beneficial effects: This application determines a fault review time window based on the fault occurrence time and collects at least two types of data within the time window, including sensor timing data, controller status data, actuator feedback data, alarm and event log data, and configuration parameters and version information. This achieves full-link coverage of the fault cause, fault trigger, and fault evolution process. By performing time consistency verification based on anchor events and performing drift correction, out-of-order rearrangement, resampling, and missing data marking on multi-source data, it achieves the effect of cross-data source comparability and correlation analysis under a unified time benchmark. This reduces the risk of misjudgment caused by clock drift, sampling period differences, or data out-of-order, and improves the stability and repeatability of fault review analysis.

[0017] Compared with existing technologies, this application constructs a fault evidence package by structurally encapsulating multi-source fault data after consistency processing. The fault evidence package includes a snapshot area, a consistency verification report area, a credibility score area, and an integrity verification summary. This transforms fault-related evidence from temporary data display into a traceable, verifiable, and reproducible evidence carrier. By recording conflict types and conflict locations and generating credibility scores for each data source, the application achieves the effect of quantitatively evaluating and screening evidence quality. This allows for the priority use of high-credibility evidence and the de-weighting of low-credibility evidence during the root cause reasoning and differential verification stages, thereby improving the interpretability, reproducibility, and engineering credibility of the review conclusions.

[0018] Compared with existing technologies, this application constructs causal constraints for events by tuple-based multi-source data events and combining them with device topology and control dependencies. This achieves the effect of reconstructing the fault event chain constrained by topology and control links, thus providing clear structural and logical basis for the generation of root cause candidates. By constructing a control group playback environment and an experimental group playback environment and executing the same playback script in an isolated execution space, while isolating or simulating external dependency interfaces, the application achieves differential verification of the repair configuration version under the same evidence input conditions. This forms a closed loop of reasoning, playback, and differential verification, making the output fault root causes and confidence levels verifiable and reducing the uncertainty brought about by empirical conclusions. Attached Figure Description

[0019] Figure 1 This is a flowchart of the method of the present invention. Detailed Implementation

[0020] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0021] Example 1 refer to Figure 1 A method for retrospective analysis of equipment failures includes the following steps: S1. Respond to the fault triggering event of the target device, obtain the fault occurrence time point, and determine the fault review time window based on the fault occurrence time point and the preset pre-fault review duration and post-fault review duration. S2. Collect multi-source fault data within the fault review time window. The multi-source fault data shall include at least two of the following types of data: sensor time series data, controller status data, actuator feedback data, alarm and event log data, and configuration parameters and version information. S3. Perform time consistency verification on the multi-source fault data, and perform drift correction and out-of-order rearrangement on the multi-source fault data based on the verification results to obtain the multi-source fault data after consistency processing. S4. Generate a fault evidence package based on the multi-source fault data after consistency processing; S5. Perform event tuple processing on the fault evidence package to generate an event tuple set, and reconstruct the fault event chain based on the event tuple set; S6. Generate a root cause candidate set based on the fault event chain, and associate each root cause candidate with corresponding evidence citation information. S7. Generate a replay script based on the fault evidence package, construct the control group replay environment and the experimental group replay environment, and execute the replay script respectively to obtain the control replay results and the experimental replay results; S8. Compare the control playback results with the experimental playback results to obtain the difference index, and determine the root cause of the failure and the corresponding confidence level based on the difference index, and output the failure review analysis results.

[0022] This method is applied to an equipment failure review and analysis system to review, reproduce, and verify the processes before and after a failure in a target device, and outputs differentially verified root causes and review conclusions. The target device can be industrial automation equipment, electromechanical equipment, production line equipment, pump and valve units, conveying equipment, robotic equipment, or other intelligent equipment with controllers and sensors. The equipment failure review and analysis system can be deployed on an edge gateway, plant server, or cloud platform, supporting failure review and analysis tasks for single devices or clusters of similar devices.

[0023] S1. Respond to the fault triggering event of the target device, obtain the time point of the fault occurrence, and determine the fault review time window based on the time point of the fault occurrence and the preset pre-fault review duration and post-fault review duration.

[0024] In a specific implementation, the fault triggering event includes at least one of the following triggering methods: 1) Alarm Trigger: The controller reports a fault code or the alarm level reaches the preset level; 2) Threshold trigger: Critical state variables exceed the limit and persist for a preset duration; 3) State triggering: There is a discrepancy between the actuator feedback and the control command, or an abnormal switch in the control mode occurs; 4) Manual triggering: Operation and maintenance personnel initiate a review and analysis task on the maintenance platform.

[0025] The preset pre-fault review duration and post-fault review duration can be configured according to the equipment type and fault characteristics. For example, the pre-fault review duration is used to cover the accumulation process of fault causes, and the post-fault review duration is used to cover the protection action, recovery process, and fault propagation process. Once the fault review time window is determined, it serves as the boundary condition for subsequent multi-source fault data acquisition and analysis.

[0026] S2. Collect multi-source fault data within the fault review time window. The multi-source fault data includes at least two of the following types of data: sensor timing data, controller status data, actuator feedback data, alarm and event log data, and configuration parameters and version information.

[0027] In a specific implementation, the collection of multi-source fault data includes at least: Sensor time-series data: continuous sampling data such as temperature, pressure, vibration, current, voltage, displacement, and rotational speed; Controller status data: operating phase, control mode, status word, control loop open / closed status, protection enable status, etc. Actuator feedback data: valve position feedback, drive current feedback, position confirmation signal, switching quantity feedback, etc.; Alarm and event log data: alarm code, alarm level, alarm start and end time, event text, alarm related fields, etc.; Configuration parameters and version information: device model, firmware version, threshold configuration, control policy version, key parameter configuration items, device topology, etc.

[0028] To support subsequent evidence citation and traceability, each piece of collected data includes at least a data source identifier, a sampling period identifier, and a collection channel identifier, and records the start and end times of the collection. For configuration parameters and version information, a configuration snapshot corresponding to the time of the failure is generated during collection to avoid configuration changes after the failure affecting playback consistency.

[0029] S3. Perform time consistency verification on the multi-source fault data, and perform drift correction and out-of-order rearrangement on the multi-source fault data based on the verification results to obtain the multi-source fault data after consistency processing.

[0030] This step is used to resolve issues such as inconsistent timestamps from multiple data sources, clock drift, different sampling periods, out-of-order data, and missing data, ensuring that subsequent event chain construction and differential playback are based on a unified time base.

[0031] 1) Anchor event extraction and alignment: Anchor events include at least one of the following: alarm trigger events, control command issuance events, and state variable transition events.

[0032] In one specific implementation, the system extracts alarm trigger timestamps from alarm and event log data, control action timestamps from controller status data, and transition edge timestamps from sensor timing data and actuator feedback data, which are used as anchor event sets. The system selects a reference time base, such as the platform's unified time or controller time, and maps timestamps from other data sources to the reference time base.

[0033] 2) Drift correction model and parameter estimation: When a data source experiences clock drift, the original timestamp of that data source is linearly corrected. Let the original timestamp be... The corrected timestamp is ,but: ;in, This is a time scale coefficient used to describe the speed of a clock. This is the time offset, used to describe the overall offset.

[0034] The set of pairs obtained based on anchor event alignment: ,in This is the original timestamp of the anchor event for this data source. To obtain the timestamp of the corresponding anchor event under the reference time base, the least squares estimation is used. and :

[0035]

[0036] Where is the number of anchor event pairs.

[0037] When the number of anchor event pairs is insufficient for a stable estimate, the system employs a degenerate strategy: Let And use the median of the time difference between anchor events as This ensures that the project can still be implemented even when the number of anchor points is insufficient.

[0038] 3) Resampling and out-of-order rearrangement: To enable comparison of multi-source data under a unified time reference, resampling is performed on data with different sampling periods. This applies to continuous signals. At the target time The resampled values ​​are obtained using linear interpolation: ; in, .

[0039] The disordered data is sorted and rearranged according to the corrected timestamps, and the mapping relationship between the original sequence number and the rearranged sequence number is recorded to support the subsequent citation and location of evidence.

[0040] 4) Missing markers and consistency verification report Generate missing markers for missing segments. These markers should include at least the start and end times of the missing segment, the identifier of the missing data source, and the reason for the missing segment. Write the alignment results, drift correction results, and missing markers into the consistency verification report area to create a traceable verification record.

[0041] The multi-source fault data obtained after the above processing is consistent and serves as the unified data basis for subsequent fault evidence package generation and event chain reconstruction.

[0042] S4. Generate a fault evidence package based on the multi-source fault data after consistency processing.

[0043] In one specific implementation, the fault evidence package includes at least the original data area, the derived feature area, the configuration snapshot area, and the consistency verification report area, and further includes a credibility score area, while generating an integrity verification summary.

[0044] 1) Data area division of the fault evidence package: Raw data area: Stores multi-source fault data after consistency processing, and records data source identifier, sampling period identifier, acquisition channel identifier and data segment location information; Derived Feature Area: Stores features generated from multi-source fault data after consistency processing; Configuration snapshot area: Stores configuration snapshots corresponding to the time the failure occurred; Consistency verification report area: stores time alignment, drift correction, missing markers and resampling records; Credibility Score Area: Used to record the credibility scores, conflict information, and conflict locations of each data source.

[0045] The configuration snapshot area includes at least two of the following: device model information, firmware version information, key parameter configuration items, and device topology relationship information, to ensure that the playback environment construction and causal constraint generation can be reproduced.

[0046] 2) Derived feature generation rules: The derived feature region includes at least one or more of the following: rate of change features, abrupt change features, threshold crossing features, statistical aggregation features, and frequency domain features.

[0047] For example, rate of change features are used to reflect signal change trends, abrupt change features are used to locate the time of state abrupt changes, threshold out-of-bounds features are used to determine abnormal intervals, statistical aggregation features are used to describe the interval mean and variance, and frequency domain features are used to reflect the spectral changes of vibration signals. Derived features are all generated from multi-source fault data after consistency processing to maintain input consistency.

[0048] 3) Credibility rating and conflict record: The credibility scoring area is used to generate credibility scores for each data source. Credibility scores are determined by at least two of the following: data integrity metrics, time consistency metrics, conflict consistency metrics, and noise level metrics.

[0049] In one specific implementation, the credibility score uses a weighted fusion method: ; in, Assess credibility. This is a data integrity indicator, with a value ranging from 0 to 1; This is a time consistency indicator, with a value ranging from 0 to 1; The conflict consistency index ranges from 0 to 1, with larger values ​​indicating more severe conflicts; the noise level index ranges from 0 to 1, with larger values ​​indicating higher noise levels. , , , Let be the weighting coefficient, satisfying + + + = 1.

[0050] When a data conflict is detected, the conflict type and location are recorded. Conflict types include at least one of numerical conflict, state conflict, and sequence conflict, which are used for evidence screening during the root cause candidate set generation phase.

[0051] 4) Integrity verification summary: After the fault evidence package is generated, an integrity verification digest is further generated to verify the integrity of the fault evidence package content. Assume that the contents of each section of the evidence package are concatenated according to a preset sequence to obtain the message. ,but: ; in, (·) is the function for calculating the summary. For integrity verification summary, This is the content message of the evidence package. When using the evidence package later, the digest can be recalculated and compared to confirm that the evidence package has not been missing or tampered with.

[0052] To facilitate implementation by those skilled in the art, the following are examples of fields in the evidence package (fields can be expanded according to device type): Raw data area: data source identifier, channel identifier, sampling period identifier, timestamp sequence, numerical sequence, and data segment location information; Derived feature region: feature type, feature value, corresponding data source identifier, feature time range; Configuration snapshot area: device model, firmware version, list of key parameters, list of topology relationships, threshold configuration; Consistency Verification Report Area: Anchor Event List, and Parameters, resampling rules, missing marker list, rearrangement mapping table; Credibility rating area: data source identifier, , , , , Conflict type, conflict location; Integrity verification summary: ; Through the above structured encapsulation, the fault evidence package not only provides data carrying capacity, but also provides consistency and credibility information, enabling subsequent differential playback verification to have deterministic input and a traceable foundation.

[0053] S5. Perform event tuple processing on the fault evidence package to generate an event tuple set, and reconstruct the fault event chain based on the event tuple set.

[0054] 1) Definition and generation rules of event tuples: The event tuple includes device identifier, component identifier or signal identifier, event type, event value, timestamp, and credibility score.

[0055] Event tuple processing includes: Convert continuous sampling data into state change events, threshold out-of-bounds events, or mutation events; Convert control commands and actuator feedback into control action events and response confirmation events; Add data source identifiers and sampling period identifiers to the event tuples to characterize the basis for the generation of the event tuples.

[0056] In one specific implementation, each event tuple is bound to the location information of the data segment in the original data area, ensuring that the evidence reference can be located to the specific data segment and avoiding ambiguity.

[0057] 2) Fault event chain construction: Reconstructing the fault event chain includes: Generate an event sequence based on the temporal order of event tuples; Determine the causal constraints of events based on equipment topology, control dependencies, component relationships, or signal relationships; A directed event graph is constructed based on event sequences and event causal constraints, and the critical path is extracted from the directed event graph as a fault event chain. Mark the key turning points in the fault event chain.

[0058] Event causal constraints must include at least two or more of the following: component connection constraints, control link constraints, and alarm association constraints.

[0059] In one specific implementation, the nodes of the directed event graph are event tuples. When event A satisfies the causal constraint pointing to event B and the timestamp of event A is earlier than that of event B, a directed edge is established between event A and event B. Strong constraint edges are established between control action events and response confirmation events on the same control link to ensure their sequential relationship. When extracting the critical path, fault alarm events or critical turning events are used as the terminating nodes, and the fault event chain is extracted based on the principle of maximizing the path length or the cumulative path weight.

[0060] Key turning points include at least one of state changes, alarm escalation, and control mode switching, and are marked in the event chain for the generation of root cause candidate sets.

[0061] S6. Generate a root cause candidate set based on the fault event chain, and associate each root cause candidate with corresponding evidence reference information.

[0062] The generation of the root cause candidate set includes: 1) Identify a set of suspected components based on key turning points in the fault event chain, and filter root cause candidates by combining equipment topology path constraints; 2) Calculate similarity based on the historical failure case database and output candidate ranking results; 3) A list of evidence cited for each root cause candidate.

[0063] In one specific implementation, the historical fault case database stores fault event chains, alarm sequences, key state feature summaries, and final root cause conclusions using traceable task identifiers as indexes. Similarity can be achieved through a multi-feature weighted fusion method, where event sequence similarity is based on the degree of matching of key turning events, alarm sequence similarity is based on alarm sequence edit distance or common subsequence length, and key state feature similarity is based on feature vector distance metric. The final output is a candidate ranking result.

[0064] The evidence citation list should include at least the corresponding event tuple identifier, the location of the original data fragment, and the configuration snapshot item, so that the retrospective report can be reviewed and verified, and a clear correspondence between the root cause candidates and the evidence can be guaranteed.

[0065] S7. Generate a replay script based on the fault evidence package, construct the control group replay environment and the experimental group replay environment, and execute the replay script respectively to obtain the control replay results and the experimental replay results.

[0066] 1) Construction of playback environments for the control group and the experimental group: Within the isolated execution space, the baseline configuration version corresponding to the time of the failure is loaded to build the control group replay environment, and the modified repair configuration version is loaded to build the experimental group replay environment. The isolated execution space can be a container environment, a virtual machine environment, or an independent process sandbox to avoid external interference.

[0067] 2) Replay script structure: The replay script should include at least: Data source mapping table: Maps the data source identifier of the fault evidence package to the playback environment input channel; Data injection sequence table: specifies the injection time base and order; Interface call sequence list: Specifies the order in which control action events are triggered and the method for verifying response confirmation; Playback stop conditions: The preset playback duration is reached or a critical alarm event is triggered.

[0068] The fault evidence package is used as a unified input to drive the same playback script in both the control group and experimental group playback environments, ensuring that the input and injection order are consistent between the two environments.

[0069] 3) External dependency interface isolation or emulation: During replay execution, external dependency interfaces are isolated or simulated. Isolation or simulation processing includes redirecting external dependency interface requests to the simulation service or returning preset response data through the interface adaptation layer to ensure the determinism and reproducibility of the replay execution. The response data of the simulation service can come from the original data area of ​​the fault evidence package or be generated by the response rules preset by the replay script, thereby ensuring that the replay process of the control group and the experimental group is consistent.

[0070] S8. Compare the control playback results with the experimental playback results to obtain the difference index, and determine the root cause of the failure and the corresponding confidence level based on the difference index, and output the failure review analysis results.

[0071] 1) Definition and calculation rules of the difference index: Differential metrics include one or more of the following: critical state quantity deviation metrics, state transition sequence difference metrics, alarm trigger sequence difference metrics, abnormal duration difference metrics, and recovery time difference metrics.

[0072] In a specific implementation: Key state quantity deviation index: A statistical measure of the difference between key state quantity sequences under a unified time base is calculated as a deviation measure. State transition sequence difference index: Extract the state transition sequence of control mode, operation stage or protection state, and compare the transition edge set; Alarm trigger sequence difference index: Extract alarm sequences in chronological order and calculate sequence edit distance or common subsequence length; Anomaly Duration Difference Index: Determine the duration of the anomaly based on the anomaly start and end boundaries and calculate the difference; Recovery time difference index: The recovery time is determined based on the recovery completion criteria and the difference is calculated.

[0073] 2) Preset difference judgment conditions: When the experimental playback results meet the preset differential judgment conditions compared to the control playback results, the confidence level of the corresponding root cause candidate is increased and it is output as the root cause of the failure. The preset differential judgment conditions include one or more of the following conditions: The difference in recovery time reaches a preset threshold; The difference in the duration of the anomaly reaches a preset threshold; The alarm trigger sequence difference meets the preset sequence constraints.

[0074] Among them, the preset sequence constraints include at least the alarm sequence edit distance not exceeding the preset upper limit or the key alarm nodes no longer appearing, to ensure that the judgment rules can be implemented.

[0075] 3) Root Cause Candidate Confidence Update: Determining the root cause and corresponding confidence level based on the differential index includes updating the root cause candidate confidence level according to the consistency between the confidence score of the root cause candidate association and the differential index. Let the first... The initial confidence level of the root cause candidates is: The updated confidence level is ,but: ; in: In order to be with the first Credibility scores associated with each root cause candidate; The differential consistency score is used to characterize the degree of matching between the differential index and the root cause candidate. and are the update coefficients.

[0076] In a specific implementation, differential consistency scoring It can be obtained by normalizing and weighting the differential indicators associated with the root cause candidate. The normalization method can be threshold normalization or interval normalization to ensure that different differential indicators are comparable.

[0077] 4) Post-mortem report generation and archiving: The output of the failure review analysis results includes the generation of a debriefing report. The debriefing report should include at least a failure timeline, a list of evidence cited, root cause conclusions, differential indicators, remedial recommendations, and verification conclusions.

[0078] The system associates and archives the fault evidence package, control playback results, experimental playback results, and debriefing report, and generates a traceable task identifier. This traceable task identifier, along with the integrity verification summary, ensures that the same evidence package and playback output can be located during subsequent reviews, preventing inconsistencies in conclusions due to data changes.

[0079] Example 2 A device failure review and analysis system is provided in this embodiment, applicable to the method described in Embodiment 1. The system includes: 1) Review window management module, used to respond to the fault triggering event of the target device, obtain the fault occurrence time point, and determine the fault review time window based on the fault occurrence time point and the preset pre-fault review duration and post-fault review duration; 2) Data acquisition module, used to collect multi-source fault data within the fault review time window; 3) Consistency verification module, which is used to perform time consistency verification on multi-source fault data, and perform drift correction, out-of-order rearrangement and missing mark processing on multi-source fault data based on the verification results, so as to obtain multi-source fault data after consistency processing; 4) Fault Evidence Package Generation Module, used to generate fault evidence packages based on multi-source fault data after consistency processing. The fault evidence package includes the original data area, derived feature area, configuration snapshot area and consistency verification report area, and further includes a credibility score area, while generating an integrity verification summary. 5) Event chain reconstruction module, used to process the fault evidence package into event tuples to generate an event tuple set, and reconstruct the fault event chain based on the equipment topology and control dependency. 6) Root cause reasoning module, used to generate a root cause candidate set based on the failure event chain, and to cite information and confidence level for each root cause candidate as related evidence; 7) Differential playback module, used to generate playback scripts based on fault evidence packages, construct control group playback environment and experimental group playback environment and execute playback respectively to obtain control playback results and experimental playback results, and isolate or simulate external dependent interfaces. 8) Differential verification module, used to perform differential comparison between control playback results and experimental playback results, obtain differential index, update root cause candidate confidence based on differential index and output the root cause of failure; 9) The report archiving module is used to generate a debriefing report and link the fault evidence package, the comparison playback results, the experiment playback results and the debriefing report for archiving and storage, and generate a traceable task identifier.

[0080] The system described in this embodiment, through the collaboration of the above modules, enables the fault review process to have a unified time benchmark, solidified evidence package, differential playback verification, and traceable results, thereby ensuring that the review conclusions are reproducible, verifiable, and can be implemented.

[0081] The equipment failure retrospective analysis method and system of this application takes the solidification of multi-source evidence with a unified time benchmark as the input basis, the reconstruction of event chains as the reasoning line, and the differential playback verification between the control group and the experimental group as the conclusion verification method, forming a reproducible closed loop from data collection, evidence construction, root cause reasoning to verification output.

[0082] During operation, the system first responds to the fault triggering event of the target device through the review window management module, obtains the time point of the fault occurrence, and determines the fault review time window by combining the preset pre-fault review duration and post-fault review duration. Subsequently, the data acquisition module synchronously collects at least two types of multi-source fault data within the fault review time window, including sensor time-series data, controller status data, actuator feedback data, alarm and event log data, and configuration parameters and version information, thereby covering the key evidence sources involved in the fault cause, fault occurrence, and post-fault evolution process.

[0083] To address issues such as clock inconsistencies, different sampling periods, out-of-order data, and missing data in data from different sources, the consistency verification module uses anchor events as the alignment basis. Anchor events include at least one or more of the following: alarm trigger events, control command issuance events, and state variable transition events. The system matches the timestamps of anchor events in each data source with a reference time base. For data sources with clock drift, a linear drift model is used to correct the timestamps. Continuous signals with different sampling periods are resampled to ensure comparability of multi-source data under a unified time base. Simultaneously, out-of-order data is time-rearranged, missing segments are marked, and the alignment, correction, resampling, and missing tagging results are written to the consistency verification report area, forming a traceable verification record. After the above processing, consistent multi-source fault data is obtained, providing consistent input for subsequent analysis.

[0084] Based on this, the fault evidence package generation module structurally encapsulates the multi-source fault data after consistency processing to generate a fault evidence package. The fault evidence package includes at least an original data area, a derived feature area, a configuration snapshot area, and a consistency verification report area, and further includes a credibility scoring area, while also generating an integrity verification summary. The original data area carries multi-source data and its fragment location information under a unified time benchmark; the derived feature area extracts features such as change rate, mutation point, threshold out-of-bounds, statistical aggregation, or frequency domain features from the original data to enhance fault characterization capabilities; the configuration snapshot area solidifies the device model, firmware version, key parameters, and device topology relationship information corresponding to the fault occurrence time point, used to constrain subsequent causal inference and playback environment construction; the credibility scoring area calculates the credibility score of each data source based on indicators such as data integrity, time consistency, conflict consistency, and noise level, and records the type and location of numerical conflicts, state conflicts, or sequence conflicts, used for filtering and weighting evidence during the root cause reasoning stage; the integrity verification summary is used to verify the integrity of the evidence package content, ensuring that evidence is not missing or replaced during subsequent playback and review, thereby achieving evidence solidification and traceability.

[0085] Subsequently, the event chain reconstruction module performs event tuple processing on the fault evidence package, converting continuous sampled data into state change events, threshold overrun events, or abrupt change events, and converting control commands and actuator feedback into control action events and response confirmation events. Each event tuple is configured with a device identifier, component identifier or signal identifier, event type, event value, timestamp, and credibility score, while also supplementing data source identifiers and sampling period identifiers to ensure traceability of event sources. The system generates an event sequence based on the temporal relationship of the event tuples and, combined with the device topology and control dependencies in the configuration snapshot area, constructs causal constraints such as component connection constraints, control link constraints, and alarm association constraints. Under these causal constraints, the system constructs a directed event graph and extracts the critical path terminating at a fault alarm or key turning point event as the fault event chain. Key turning point events such as state abrupt changes, alarm escalation, and control mode switching are labeled to highlight the critical nodes and propagation paths of fault evolution.

[0086] The root cause reasoning module generates a root cause candidate set based on the fault event chain: First, it locates a set of suspected components based on key turning events and filters root cause candidates by combining equipment topology path constraints; Second, it calculates similarity based on a historical fault case library to rank root cause candidates, where the similarity is at least a comprehensive measure of the matching degree between event sequences, alarm sequences, and key state features; Third, it establishes an evidence citation list for each root cause candidate, which includes at least the event tuple identifier, the original data fragment location, and the configuration snapshot item, thereby forming a clear correspondence between root cause candidates and evidence, facilitating review and traceability.

[0087] To avoid misjudgments caused by drawing conclusions directly based on reasoning results, the system further implements a closed-loop verification through a differential replay module. The differential replay module generates a replay script based on the fault evidence package. The replay script includes at least the data source mapping relationship, data injection order, and interface call order. Within the isolated execution space, it loads the baseline configuration version corresponding to the fault occurrence time to construct the control group replay environment, and simultaneously loads the modified repair configuration version to construct the experimental group replay environment. Both replay environments execute the same replay script under the same fault evidence package input to ensure consistent input and order. During replay, external dependent interfaces are isolated or simulated by redirecting requests to the simulation service or by the interface adaptation layer returning preset response data, ensuring the determinism and reproducibility of the replay process and preventing external system fluctuations from introducing non-fault factors.

[0088] The differential verification module performs differential comparison between the control playback results and the experimental playback results to obtain differential indicators such as key state variable deviation, state transition sequence difference, alarm trigger sequence difference, abnormal duration difference, and recovery time difference. Verification judgment is then performed based on preset differential judgment conditions. When the experimental group meets preset thresholds or preset sequence constraints compared to the control group in terms of recovery time, abnormal duration, or alarm trigger sequence, it indicates that the repair configuration version has a verifiable impact on the fault performance. The system uses the differential indicators and the confidence score associated with the root cause candidates to update the confidence of the root cause candidates, thereby outputting the differentially verified root cause of the fault and its corresponding confidence. Finally, the report archiving module generates a debriefing report, which includes at least a fault timeline, a list of evidence citations, root cause conclusions, differential indicators, handling suggestions, and verification conclusions. The fault evidence package, control playback results, experimental playback results, and debriefing report are associated and archived, generating a traceable task identifier to support subsequent review, reproduction, and accountability closure.

[0089] Through the above workflow, this application realizes event chain reasoning based on a unified time benchmark and solidified evidence, and introduces a comparative or experimental differential playback verification mechanism to make the fault review conclusions reproducible, traceable and verifiable, thereby improving the reliability and engineering applicability of fault root cause determination.

[0090] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A method for retrospective analysis of equipment failures, characterized in that, Includes the following steps: S1. Respond to the fault triggering event of the target device, obtain the fault occurrence time point, and determine the fault review time window based on the fault occurrence time point and the preset pre-fault review duration and post-fault review duration. S2. Collect multi-source fault data within the fault review time window. The multi-source fault data shall include at least two of the following types of data: sensor time series data, controller status data, actuator feedback data, alarm and event log data, and configuration parameters and version information. S3. Perform time consistency verification on the multi-source fault data, and perform drift correction and out-of-order rearrangement on the multi-source fault data based on the verification results to obtain the multi-source fault data after consistency processing. S4. Generate a fault evidence package based on the multi-source fault data after consistency processing; S5. Perform event tuple processing on the fault evidence package to generate an event tuple set, and reconstruct the fault event chain based on the event tuple set; S6. Generate a root cause candidate set based on the fault event chain, and associate each root cause candidate with corresponding evidence citation information. S7. Generate a replay script based on the fault evidence package, construct the control group replay environment and the experimental group replay environment, and execute the replay script respectively to obtain the control replay results and the experimental replay results; S8. Compare the control playback results with the experimental playback results to obtain the difference index, and determine the root cause of the failure and the corresponding confidence level based on the difference index, and output the failure review analysis results.

2. The equipment failure retrospective analysis method according to claim 1, characterized in that, The time consistency check includes: Extract anchor events, which include at least one of alarm trigger events, control command issuance events, and state variable transition events; The timestamps of data from different sources are aligned based on anchor events, and resampling is performed on data with different sampling periods so that multi-source data can be compared under a unified time reference. Based on timestamp deviation calculation, drift correction is performed on timestamps with deviation, out-of-order data is rearranged in time, and missing data is marked with missing tags. The alignment results, drift correction results, and missing markers are written into the consistency verification report area to form a traceable verification record.

3. The equipment failure retrospective analysis method according to claim 2, characterized in that, The fault evidence package includes an original data area, a derived feature area, a configuration snapshot area, and a consistency verification report area, wherein: The configuration snapshot area includes at least two of the following: device model information, firmware version information, key parameter configuration items, and device topology relationship information. The derived feature region includes at least one or more of the following: rate of change features, mutation point features, threshold crossing features, statistical aggregation features, and frequency domain features. The derived features are generated from multi-source fault data after consistency processing. After the fault evidence package is generated, an integrity verification digest is further generated to verify the integrity of the fault evidence package content.

4. The equipment failure retrospective analysis method according to claim 3, characterized in that, The credibility scoring area is used to generate credibility scores for each data source. The credibility score is determined by at least two of the following indicators: data integrity indicator, time consistency indicator, conflict consistency indicator, and noise level indicator. When a data conflict is detected, the conflict type and conflict location are recorded. The conflict type includes at least one of numerical conflict, state conflict, and sequence conflict, which is used for evidence screening when generating the root cause candidate set.

5. The equipment failure retrospective analysis method according to claim 4, characterized in that, The event tuple includes device identifier, component identifier or signal identifier, event type, event value, timestamp, and credibility score; Event tuple processing includes: converting continuous sampled data into state change events, threshold overrun events, or mutation events; converting control commands and actuator feedback into control action events and response confirmation events; and supplementing event tuples with data source identifiers and sampling period identifiers to characterize the basis for event tuple generation.

6. The equipment failure retrospective analysis method according to claim 5, characterized in that, The reconstruction failure event chain includes: Generate an event sequence based on the temporal order of event tuples; Determine the causal constraints of events based on the equipment topology, control dependency, component association, or signal association. The causal constraints of events include two or more of the following: component connection constraints, control link constraints, and alarm association constraints. A directed event graph is constructed based on event sequences and event causal constraints, and the critical path is extracted from the directed event graph as a fault event chain. Key turning points in the fault event chain are marked. Key turning points include at least one of the following: state change, alarm escalation, and control mode switching.

7. The equipment failure retrospective analysis method according to claim 6, characterized in that, The generation of the root cause candidate set includes: Based on the key turning events in the fault event chain, a set of suspected components is identified, and root cause candidates are screened in combination with equipment topology path constraints. The similarity is calculated based on a historical fault case database and the candidate ranking results are output. The similarity is based on one or more of the following: event sequence similarity, alarm sequence similarity, and key state feature similarity. This is a list of evidence references for each root cause candidate. The list of evidence references includes the corresponding event tuple identifier, the location of the original data fragment, and the configuration snapshot item.

8. The equipment failure retrospective analysis method according to claim 7, characterized in that, The construction of the control group playback environment and the experimental group playback environment includes: Load the baseline configuration version corresponding to the time of the failure in the isolated execution space to build the control group replay environment, and load the modified repair configuration version to build the experimental group replay environment; The fault evidence package is used as a unified input to drive the same playback script to be executed in both the control group playback environment and the experimental group playback environment. During the replay execution, external dependent interfaces are isolated or simulated. Isolation or simulation processing includes redirecting external dependent interface requests to the simulation service or returning preset response data through the interface adaptation layer to ensure the determinism and reproducibility of the replay execution.

9. The equipment failure retrospective analysis method according to claim 8, characterized in that, The differential indicators include one or more of the following indicators: critical state quantity deviation indicators, state transition sequence difference indicators, alarm trigger sequence difference indicators, abnormal duration difference indicators, and recovery time difference indicators. The step of determining the root cause of the failure and its corresponding confidence level based on the differential index includes: updating the confidence level of the root cause candidate based on the consistency between the confidence score of the root cause candidate association and the differential index; When the experimental playback results meet the preset differential judgment conditions compared with the control playback results, the confidence of the corresponding root cause candidate is increased and the result is output as the root cause of the fault. The preset differential judgment conditions include one or more of the following conditions: the difference in recovery time reaches a preset threshold, the difference in alarm trigger sequence meets the preset sequence constraints, and the difference in abnormal duration reaches a preset threshold. The output of the fault review analysis results includes generating a review report, which includes a fault timeline, a list of evidence references, root cause conclusions, differential indicators, handling suggestions and verification conclusions. The fault evidence package, control playback results, experimental playback results and review report are associated and archived, and a traceable task identifier is generated.

10. A device failure review and analysis system applicable to the device failure review and analysis method according to any one of claims 1-9, characterized in that, include: The review window management module is used to respond to the fault triggering event of the target device, obtain the fault occurrence time point, and determine the fault review time window based on the fault occurrence time point and the preset pre-fault review duration and post-fault review duration. The data acquisition module is used to collect multi-source fault data within the fault review time window; The consistency verification module is used to perform time consistency verification on multi-source fault data, and based on the verification results, perform drift correction, out-of-order rearrangement and missing label processing on the multi-source fault data to obtain consistent multi-source fault data. The fault evidence package generation module is used to generate a fault evidence package based on multi-source fault data after consistency processing. The fault evidence package includes the original data area, the derived feature area, the configuration snapshot area and the consistency verification report area, and generates an integrity verification summary. The event chain reconstruction module is used to process the fault evidence package into event tuples to generate a set of event tuples, and reconstruct the fault event chain based on the device topology and control dependencies. The root cause reasoning module is used to generate a root cause candidate set based on the chain of failure events and to cite information and confidence level for each root cause candidate as evidence. The differential playback module is used to generate playback scripts based on the fault evidence package, construct the control group playback environment and the experimental group playback environment, and execute the playback respectively to obtain the control playback results and the experimental playback results, and isolate or simulate external dependent interfaces. The differential verification module is used to perform differential comparison between the control playback results and the experimental playback results to obtain differential indices, and update the root cause candidate confidence based on the differential indices and output the root cause of the failure. The report archiving module is used to generate debriefing reports and associate and archive the fault evidence package, comparison playback results, and experiment playback results with the debriefing report, generating a traceable task identifier.