A knowledge graph-driven power grid fault root cause analysis system and method
By constructing a power grid fault root cause analysis system through multi-source sensor data acquisition and time-series alignment algorithms, the system solves the problems of ambiguous causal relationships and unclear propagation paths in power grid fault analysis, and realizes accurate source tracing and prevention of power grid faults.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- STATE GRID ANHUI ELECTRIC POWER CO LTD ELECTRIC POWER SCI RES INST
- Filing Date
- 2026-04-02
- Publication Date
- 2026-06-30
AI Technical Summary
Existing power grid fault analysis methods are unable to fully present the complete evolution path from initial disturbance to final fault manifestation, making it impossible for operation and maintenance personnel to accurately determine the cause of fault escalation and formulate reliable preventive measures.
Real-time power grid data is collected by multiple source sensors, an ordered event list is generated using a time-series alignment algorithm, direct coupling relationships are determined based on electrical connection strength, causal relationships are expanded by combining a protection logic rule base, a fault propagation network is constructed, key inflection points are identified, and root causes are analyzed using a causal reasoning algorithm to generate an evolution path report.
It enables precise root cause analysis of power grid faults, improves the accuracy of fault tracing and proactive prevention, and provides strong technical support for the safe and stable operation of the power grid.
Smart Images

Figure CN122311622A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of information technology, and in particular to a knowledge graph-driven system and method for analyzing the root causes of power grid faults. Background Technology
[0002] As the lifeline of modern society, the safe and stable operation of the power system is directly related to the normal order of the national economy and people's lives. Once a fault occurs, it may cause local power outages or even large-scale power outages or system collapse. Therefore, it is extremely important to quickly and accurately find the root cause of the fault and understand its entire process of occurrence and development.
[0003] Current power grid fault analysis methods mostly rely on human experience or rule-based diagnostic systems. When faced with complex power grids, these methods can often only provide the surface phenomena and direct equipment involved in the fault, but cannot fully present the complete evolution path from the initial disturbance to the final manifestation of the fault. This makes it impossible for operation and maintenance personnel to clearly determine how the fault gradually expands and which links could have been blocked but were not, thus making it difficult to formulate reliable basis for subsequent targeted preventive measures.
[0004] The process from the occurrence to the manifestation of a power grid fault is a dynamic process involving multiple time scales, multiple devices, and multiple protection devices that are closely coupled. Millisecond-level electrical transient disturbances, second-level relay protection actions, and minute-level manual interventions are intertwined. These events not only have a strict sequential order, but also form a complex causal chain through electrical connections and protection logic. If events of different time granularities cannot be uniformly placed on a time axis and sequentially connected, it is impossible to accurately reconstruct the true path of fault propagation, and it is also difficult to identify which link is the key turning point that leads to the accelerated loss of control of the fault.
[0005] For example, in a line grounding fault, the initial insulation breakdown may only cause local voltage fluctuations. However, if the adjacent protection device operates beyond its cascading limit due to improper coordination, it will trip a wider range of circuit breakers, leading to busbar undervoltage and cascading trips of multiple lines. The entire process may be completed within a few hundred milliseconds. If the post-event analysis only sees the final tripped device and ignores the sequence of intermediate protection malfunctions or failures, it is impossible to truly understand why a small fault has evolved into a major power outage. Therefore, accurately capturing and reconstructing the complete event chain from root disturbance to system instability from massive amounts of heterogeneous data has become a key issue for achieving accurate root cause analysis and effective fault prevention. Summary of the Invention
[0006] This invention provides a knowledge graph-driven power grid fault root cause analysis system and method, mainly including: Real-time power grid data is collected by multi-source sensors to obtain event records including electrical transient disturbances, relay protection actions, and manual intervention behaviors, thus obtaining an initial event sequence; Based on the initial event sequence, a time-series alignment algorithm is used to process data at different time granularities to obtain an ordered list of events on a unified time axis. If the electrical connection strength between adjacent events in the ordered event list exceeds a preset threshold, it is determined to be a direct coupling, and a preliminary causal chain is established. For the initial causal chain, the protection logic rule base is obtained, and the matching protection device action mode is extracted from it to obtain the extended causal relationship; By extending causality, a fault propagation network is constructed using a graph model, and the propagation paths between network nodes are determined to obtain the complete event chain. If there are delayed events in the complete event chain, the key turning points are determined by comparing the delay duration with a unified timeline. Based on key inflection points, historical fault data is obtained, and causal reasoning algorithms are used to analyze the disturbance instability modes before and after the inflection points to obtain root cause classification results. Based on the root cause classification results, an evolutionary path report is generated, missing links in the report are identified, and supplementary preventive measures are recommended.
[0007] The technical solutions provided by the embodiments of the present invention may include the following beneficial effects: This invention discloses a method for root cause analysis and prevention of power grid faults. It collects real-time power grid data from multiple sensors to obtain event records including electrical transient disturbances, relay protection actions, and manual intervention, forming an initial event sequence. Subsequently, a time-series alignment algorithm is used to unify data at different time granularities, generating an ordered event list. Based on the electrical connection strength between adjacent events, direct coupling relationships are determined, initially constructing a causal chain. Further, matching protection action patterns are extracted from a protection logic rule base to expand the causal relationship. On this basis, a fault propagation network graph model is constructed to fully characterize the event chain and propagation path. For action delay events in the chain, key turning points are identified, and a causal reasoning algorithm is used to analyze the disturbance instability modes before and after the turning points, based on historical fault data, ultimately achieving accurate root cause classification. Based on the root cause classification results, an evolution path report is automatically generated, and missing links are intelligently filled in, providing targeted preventative measures suggestions. This invention effectively solves the core problems of difficult correlation of multi-source heterogeneous data, ambiguous causal relationships, unclear propagation paths, delayed root cause location, and lack of systematic formulation of preventive measures in complex power grid faults. It significantly improves the accuracy, timeliness, and proactiveness of fault tracing and provides strong technical support for the safe and stable operation of the power grid. Attached Figure Description
[0008] Figure 1 This is a flowchart of a knowledge graph-driven power grid fault root cause analysis system and method according to the present invention.
[0009] Figure 2This is a schematic diagram of a knowledge graph-driven power grid fault root cause analysis system and method according to the present invention.
[0010] Figure 3 This is another schematic diagram of a knowledge graph-driven power grid fault root cause analysis system and method according to the present invention. Detailed Implementation
[0011] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0012] like Figures 1-3 This embodiment of a knowledge graph-driven power grid fault root cause analysis system and method may specifically include: S101. Collect real-time power grid data through multi-source sensors, obtain event records including electrical transient disturbances, relay protection actions and manual intervention behaviors, and obtain an initial event sequence.
[0013] Real-time power grid data is collected from multiple sources of sensors to obtain an initial event sequence. The initial event sequence is then timestamped to obtain an aligned event sequence. Transient disturbance segments and protection action segments are extracted from the aligned event sequence to obtain a disturbance segment set and an action segment set. The time ranges of the disturbance segment set and the action segment set are matched. If the end time of the disturbance segment and the start time of the action segment differ within a preset time window, they are identified as related event pairs, resulting in a related event pair set. For each event pair in the related event pair set, corresponding human intervention records are obtained, resulting in an event pair set with intervention labels. Disturbance features and action features are extracted from the event pair set with intervention labels, and a decision tree algorithm is used to classify the disturbance features and action features, resulting in disturbance type labels and action type labels. The event pair set with intervention labels is grouped according to the disturbance type labels and action type labels, resulting in a grouped event set. For each group in the grouped event set, the frequency of human intervention actions is counted, resulting in intervention statistics for each group.
[0014] By collecting real-time data from the power grid using multi-source sensors, an initial event sequence containing various information such as voltage, current, and power can be obtained.
[0015] For example, multiple monitoring points at a substation reported abnormal fluctuations within a short period of time, forming the original sequence.
[0016] In one embodiment, the initial event sequence is timestamped using a unified high-precision clock reference to align all data to millisecond-level precision, resulting in an aligned event sequence. This eliminates the sequential disorder caused by device clock drift, ensuring the timing accuracy of subsequent analysis.
[0017] Specifically, transient disturbance segments and protection action segments are extracted from the aligned event sequence. Transient disturbance segments typically manifest as a sudden drop in voltage or a sudden increase in current exceeding a threshold for tens of milliseconds to several seconds, while protection action segments correspond to the time points when relay trip signals or circuit breaker opening commands occur, thus forming a set of disturbance segments and a set of action segments.
[0018] For example, in the event of a single-phase ground fault on a line, the disturbance segment records the process of the voltage of phase A dropping from 220kV to about 150kV, lasting about 120ms; the action segment records the tripping command issued by the distance protection device about 80ms after the fault occurred.
[0019] Preferably, matching is performed based on the time difference between the end time of the disturbance segment and the start time of the action segment. If this difference falls within a preset time window, such as ±200ms, the two are determined to be a related event pair. This time window matching method can effectively filter out irrelevant isolated events and improve the accuracy of association identification.
[0020] In one possible implementation, for each associated event pair, corresponding records of human intervention actions are further obtained. For example, a dispatcher might perform actions such as remotely controlling power transmission, adjusting protection settings, or manually tripping the circuit breaker within 30 seconds of a fault. This results in a set of event pairs labeled with intervention. These human intervention labels provide crucial causal clues for subsequent analysis.
[0021] For example, for a set of event pairs with intervention labels, features such as maximum voltage drop amplitude, recovery time, and harmonic content are extracted from disturbance segments, and features such as protection action delay, action phase, and whether reclosing was successful are extracted from action segments. A decision tree algorithm is used for classification to obtain disturbance type labels such as "single-phase grounding", "phase-to-phase short circuit", and "lightning strike", as well as action type labels such as "distance protection stage I", "zero-sequence protection", and "reclosing action".
[0022] Understandably, the event set is grouped according to the disturbance type label and action type label. For example, all "single-phase grounding + distance protection stage I + successful reclosing" events are grouped into one group, and "single-phase grounding + distance protection stage I + manual forced reclosing" events are grouped into another group, thus obtaining the grouped event set. For each group, the frequency of manual intervention behavior is counted.
[0023] For example, in the group "single-phase grounding + distance protection stage I + reclosing failure", manual intervention occurred 12 times, including 8 forced connections, 3 setting adjustments, and 1 other instance; in the group "single-phase grounding + distance protection stage I + reclosing success", manual intervention occurred only 2 times. These intervention statistics can reveal which disturbance types result in lower protection reliability and higher demand for manual intervention, providing data support for optimizing protection strategies, reducing false trips or failures to operate, and improving the power grid's self-healing capabilities, thus contributing to more intelligent fault diagnosis and handling.
[0024] S102. Based on the initial event sequence, use a time-series alignment algorithm to process data with different time granularities to obtain an ordered list of events on a unified time axis.
[0025] Step 1: Obtain initial event and sequence data from the repository. Perform timestamp standardization on data from different sources to obtain a preliminary event set. Step 2: Based on the preliminary event set, use a time-series alignment algorithm to match and adjust data at different time granularities to determine a unified time reference point. Step 3: Using the unified time reference point, arrange all events in chronological order to generate an ordered event structure based on a timeline. Step 4: For the ordered event structure, check for missing or duplicate events on the timeline. If anomalies are detected, fill in the missing data using interpolation methods to obtain complete sequence records. Step 5: Based on the complete sequence records, analyze the time interval distribution between events to determine if there are any abnormal intervals. If the interval exceeds a preset threshold, mark it as a node to be processed. Step 6: Obtain the data of the nodes to be processed, correct them using contextual event information, and adjust the abnormal intervals using a smoothing method to obtain the final ordered event list. Step 7: Using the final ordered event list, generate a structured time-series data file and store it in a designated database, completing the data integration process.
[0026] For example, in the field of real-time power grid data processing, timestamp standardization is particularly important for initial events and sequence data obtained from a repository. Timestamp standardization refers to unifying the time format of data records from different sources into a standard format. For example, adjusting the local time recorded by some devices to a unified Coordinated Universal Time (UTC) ensures that subsequent analysis will not produce errors due to inconsistent time formats.
[0027] In one possible implementation, suppose a power grid monitoring system collects data from multiple substations. Some substations record the time as "2023-10-01 08:00:00", while others record it as "10 / 01 / 2023 08:00:00". Through standardization, all time formats are unified to the form "2023-10-01 08:00:00", laying the foundation for subsequent time sequence alignment.
[0028] For example, in the application of time-series alignment algorithms, a unified time reference point can be set to adjust data with different time granularities. Suppose a sensor in a substation records voltage fluctuations every 5 seconds, while another device records current changes every 10 seconds. Using a time-series alignment algorithm, the data can be unified to a 5-second interval, with missing data points estimated through linear interpolation. This method ensures the correspondence of data from different devices on the timeline, facilitating subsequent event sequencing.
[0029] For example, when generating an ordered event structure based on a timeline, all events can be arranged chronologically to form a clear timeline. Suppose a voltage drop event occurs in the power grid on a certain day, recorded at 08:05:00, and subsequently, the relay protection device operates, recorded at 08:05:02. By arranging the events along the timeline, the relationship between the two events can be clearly seen, providing a basis for subsequent analysis.
[0030] In one possible implementation, interpolation is a common method for detecting missing or duplicate events on a timeline. For example, if a sensor records one data point per minute between 08:05:00 and 08:10:00, but data at 08:07:00 is missing, the missing value can be estimated by averaging the data from the two points before and after 08:07:00, thus ensuring sequence integrity.
[0031] For example, when analyzing the distribution of event time intervals, if it is found that the interval between two events is 30 minutes, while the normal interval should be within 5 minutes, then it is marked as a node to be processed. By combining contextual information, such as checking whether the failure to upload data is due to equipment failure, the abnormal interval can be corrected.
[0032] In one possible implementation, smoothing out anomaly intervals can be achieved by adjusting time points or supplementing intermediate data points. For example, if an anomaly interval is 20 minutes, it can be smoothed to a record point every 5 minutes by inserting intermediate estimates, thus improving data continuity.
[0033] For example, generating structured time-series data files and storing them in a database provides reliable data support for subsequent power grid operation analysis. Assuming all processed data is stored in a single file in chronological order, containing information such as voltage, current, and protection actions, it facilitates quick retrieval and access. These processing steps collectively ensure high-quality data integration, providing a solid foundation for power grid monitoring and anomaly early warning.
[0034] S103. If the electrical connection strength of adjacent events in the ordered event list exceeds a preset threshold, it is determined to be direct coupling, and a preliminary causal chain is established.
[0035] Obtain the timestamps and electrical parameters of all events in the ordered event sequence. For adjacent event pairs, calculate the electrical connection strength between each pair. If the electrical connection strength is greater than a preset threshold, the adjacent event pairs are determined to have a direct coupling relationship. Connect the adjacent event pairs determined to be directly coupled in chronological order to obtain a preliminary causal chain. Obtain the electrical connection strength value of each coupling relationship in the preliminary causal chain. Traverse the preliminary causal chain; if three consecutive events are directly coupled, mark this chain segment as a strongly correlated path. Extract key event nodes from the strongly correlated paths to form a simplified causal sub-chain.
[0036] For example, when processing ordered event sequences, one can start by analyzing timestamps and electrical parameters to examine the correlation between events in an electrical system. Suppose that in a power transmission network, an event sequence records the changes in the operating status of multiple devices within a substation. Each event includes a specific timestamp and corresponding electrical parameters, such as voltage and current values. Extracting this data can lay the foundation for subsequent causal analysis.
[0037] Specifically, the calculation of electrical connection strength for adjacent event pairs can be based on the changing trends of electrical parameters to determine the correlation between the two events. Suppose two adjacent events have timestamps of 10:00:00 and 10:00:05, with the first event having a voltage of 220V and the second event having a voltage of 218V, and the current values also showing slight fluctuations. This parameter variation may indicate some kind of electrical influence between the two events. The connection strength assessment can combine the magnitude of parameter changes and the time interval, setting a threshold; for example, a strength value greater than 0.8 is considered direct coupling. This method helps identify potential correlations between events.
[0038] For example, after determining direct coupling, these adjacent event pairs can be connected chronologically to form a preliminary causal chain. Suppose there are 10 events in a day's monitoring data, and 3 pairs of events are determined to be directly coupled, with the chronological order being event A to B, B to C, and C to D. Then the preliminary chain is ABCD. This chain reflects the possible event propagation paths in the electrical system, providing an intuitive basis for subsequent analysis.
[0039] Specifically, when three consecutive events are directly coupled, they can be marked as strongly correlated paths. For example, if the connection strengths of AB, BC, and CD in the above chain are all above a threshold, then the path ABCD is marked as a strongly correlated path. This marking method highlights potentially critical chains of influence in the system, facilitating the rapid identification of important events.
[0040] For example, when extracting key event nodes, nodes with a significant impact on the system can be selected from strongly correlated paths to form a simplified causal sub-chain. Suppose that in the ABCD path, the electrical parameter changes of events B and D are the most significant, such as a sudden voltage drop or a sudden current surge. Then, B and D can be extracted as key nodes, forming a simplified BD sub-chain. This simplification method reduces data redundancy, focuses on core events, and facilitates subsequent analysis and decision-making.
[0041] Specifically, numerical analysis of electrical connection strength can provide data support for path optimization. Assuming the strength value of AB is 0.85, BC is 0.9, and CD is 0.88, these values can be used to further evaluate the stability of the path. This refined analysis helps identify weak links in the chain, providing a reference for system maintenance.
[0042] For example, in practical applications, the extraction of strongly correlated paths and simplified subchains can help maintenance personnel quickly locate potential problem areas in electrical systems. Suppose that in the monitoring of a substation, the above method reveals that a certain path frequently exhibits strong correlations. Combined with critical node analysis, this can provide early warnings of potential equipment aging or overload risks. The application of this method can significantly improve system stability and response efficiency.
[0043] S104. For the initial causal chain, obtain the protection logic rule base, extract the matching protection device action mode from it, and obtain the extended causal relationship.
[0044] A preliminary causal chain is received from an external system. Using the key component identifiers in the preliminary causal chain, all operation modes of the corresponding protection device are retrieved from the protection logic rule base. For each retrieved protection device operation mode, a string exact matching method is used to compare it with the causal description in the preliminary causal chain. If a match is successful, the triggering condition and consequence event corresponding to that operation mode are added to the end of the causal chain, resulting in a first extended causal chain. Based on the newly added consequence event in the first extended causal chain, the associated next-level protection device operation modes are retrieved again from the protection logic rule base. Pattern matching is used to determine whether the newly added operation mode forms a causal connection with the end event of the first extended causal chain. If the connection is successful, the operation mode is continued to be added, resulting in a second extended causal chain. A cyclic expansion judgment is performed on the second extended causal chain. When no more matching next-level operation modes exist in the protection logic rule base, the final extended causal relationship is output.
[0045] For example, in the field of power system protection, after receiving a preliminary causal chain from an external system, it needs to be expanded to form a more complete causal network. The preliminary causal chain might include an initial event triggered by a primary equipment failure within a substation, such as a short circuit on a line causing an abnormal increase in current. For this event, the corresponding protection device action mode can be extracted from the protection logic rule base using key component identifiers, such as line numbers or circuit breaker numbers. Assuming the rule base records that the main protection device corresponding to this line will trigger a trip when the current exceeds twice the rated value, this action mode becomes the basis for expanding the chain.
[0046] Specifically, the comparison process for the operating modes of protection devices can be achieved through precise string matching. Assuming the causal description in the initial causal chain is "line short circuit - abnormal current increase," while the operating mode description in the rule base is "abnormal current increase - main protection trip," a logical connection is found between the two. Therefore, the consequence event "main protection trip" and its triggering condition "current exceeds twice the rated value" are added to the end of the chain, forming the first extended causal chain. This process ensures the logical continuity of the causal chain and provides a more comprehensive event sequence for subsequent analysis.
[0047] For example, after the first extended causal chain is formed, the newly added "main protection trip" event may trigger a wider impact. Based on this event, the action mode of the next level protection device can be extracted again from the rule base, such as the protection logic of the backup protection device connected to the line or adjacent equipment. Assuming there is a rule in the rule base described as "main protection trip - backup power supply switching", pattern matching reveals that it forms a causal connection with the end event "main protection trip" of the first extended causal chain. Then, the action mode "backup power supply switching" is added to the chain, forming a second extended causal chain. This extension process reflects the hierarchy and linkage of the protection system.
[0048] Specifically, the cyclical expansion judgment is the core of the entire process. For the second extended causal chain, the rule base is queried again. Suppose that the "backup power supply switching" event may trigger "temporary voltage fluctuation", but there is no further next-level action mode in the rule base that matches it. Then the expansion process terminates, and the final extended causal relationship is output. This cyclical mechanism ensures that the causal chain does not expand indefinitely, while covering all possible protection logic associations.
[0049] For example, from another perspective, the extraction of key component identifiers and the retrieval of rule bases can be achieved through a pre-defined mapping table. Suppose that each line and circuit breaker in a substation has a unique number, such as line number L-001 corresponding to main protection device P-001. The rule base stores the action modes of P-001 under different fault scenarios, such as tripping during a short circuit and alarming during an overload. Through this mapping relationship, relevant rules can be quickly located, improving expansion efficiency.
[0050] Specifically, pattern matching can also be implemented by combining priority rules. Assuming the rule base contains multiple action patterns related to "main protection tripping," such as "backup power supply switching" and "adjacent line protection activation," the most likely pattern can be selected for expansion based on the priority of the protection logic. For example, prioritizing power supply switching to ensure power supply continuity. This approach makes the expansion of the causal chain more aligned with actual operational needs. Through these multi-level and multi-faceted expansion methods, the resulting causal network comprehensively reflects the dynamic evolution of power system protection logic, providing a reliable basis for subsequent analysis and decision-making.
[0051] S105. By expanding the causal relationship, a fault propagation network is constructed using a graph model, the propagation path between network nodes is determined, and the complete event chain is obtained.
[0052] Obtain system operation logs and alarm records. Extract initial causal relationship pairs using a pre-established causal knowledge base. Transform the causal relationship pairs into a directed graph structure using a graph model, where nodes represent fault events and directed edges represent causal propagation directions. For each node in the graph model, expand its outgoing edges to obtain a set of directly downstream fault nodes. Determine if each downstream node has a known triggering condition; if so, add the downstream node to the propagation path sequence. Obtain the last node in the current propagation path sequence and continue expanding its outgoing edges according to the graph model to obtain the next layer of fault nodes. Repeat the expansion and determination process until the current node no longer generates new downstream nodes or reaches the maximum propagation level preset by the graph model. Traverse all propagation path sequences initiated by the initial fault nodes and merge path segments with containment relationships. Determine the complete chain sequence with the highest frequency among all propagation path sequences to obtain the main fault event chain.
[0053] For example, in the field of power system fault analysis, a series of methods can be used to construct fault propagation paths for processing system operation logs and alarm records. First, let's discuss obtaining system operation logs and alarm records.
[0054] Understandably, the system records equipment operating status and abnormal alarm information in real time. For example, if a circuit breaker in a substation trips at a certain time, the log will record the trip time as 14:30, and the alarm record will show an overload alarm. This data provides a foundation for subsequent analysis.
[0055] For example, a pre-established causal knowledge base plays a crucial role in constructing initial causal pairs. Assuming the knowledge base stores the causal rule "overload causes circuit breaker tripping," combining it with overload alarms and tripping records from the logs, the initial causal pair can be extracted: overload is the cause, and tripping is the result. This method can quickly pinpoint the initial associations of a fault.
[0056] For example, in the process of converting causal pairs into a directed graph structure, fault events can be treated as nodes, such as overload as node A and circuit breaker tripping as node B. Directed edges from A to B represent the direction of causal propagation. Such a graph model intuitively shows how a fault propagates from one event to another, facilitating subsequent extended analysis.
[0057] For example, when adding outgoing edges to each node in an extended graph model, suppose the downstream of node B might include the event "power outage," designated as node C. By querying the knowledge base, it is confirmed that the triggering condition for a power outage is a circuit breaker tripping. Then, node C is added to the propagation path sequence, forming a path from A to B and then to C. This extension method helps to discover the deeper impact of faults.
[0058] For example, regarding the continued expansion of the propagation path sequence, suppose that downstream of node C there is an event called "user power outage," which is designated as node D. If relevant triggering conditions exist in the knowledge base, the path is added to form a complete chain from overload to user power outage. Such layer-by-layer expansion can reveal the breadth of the fault's impact until a preset maximum propagation level is reached, such as limiting it to 5 levels, to avoid excessively long paths.
[0059] For example, when merging propagation path sequences, if some segments overlap in paths originating from different initial fault nodes—such as both paths containing a segment from "circuit breaker tripping to line power outage"—then these segments are merged to reduce redundancy. This approach improves analysis efficiency and clearly presents the main trunk of fault propagation.
[0060] For example, to identify the primary fault event chain, the most frequently occurring complete chain among all paths can be identified. Suppose the chain "overload - circuit breaker tripping - line power outage - customer power outage" occurs most frequently across multiple paths, for example, 8 times out of 10 paths, then it is identified as the primary fault event chain. This method focuses on the core issue, providing crucial information for subsequent fault localization and handling. Through this approach, not only can the overall picture of fault propagation be systematically analyzed, but key impact paths can also be effectively identified, supporting rapid response and recovery.
[0061] S106. If there is an action delay event in the complete event chain, the key turning point is determined by comparing the delay duration with the unified timeline.
[0062] Retrieve the complete sequence of action events in the event chain. For each sequence, determine if there are any delayed action events. If so, extract the delay duration of each event. Align and map all event timestamps using a unified timeline. Place the delay durations onto the corresponding coordinates on the unified timeline based on the aligned timestamp positions. Use a timeline coordinate comparison method to determine which delay durations fall within critical intervals. If a delay duration falls within a critical interval, identify the corresponding position as a critical turning point. Output the list of critical turning points in chronological order.
[0063] In one possible implementation, after obtaining the complete sequence of action events in the full event chain, it is first necessary to determine the delay of each sequence. Action delay events specifically refer to operations whose actual occurrence time is significantly later than the expected trigger time; for example, an automatic isolation action that the system should execute immediately after an alarm is delayed.
[0064] Specifically, when the timestamp of an action event is delayed by more than a preset threshold compared to the end time of its preceding causal event, it is determined to be a delayed event, and the specific duration of the delay is extracted.
[0065] For example, in a power dispatching system, the chain sequence is "line overload alarm → protection device operation → automatic load disconnection → backup power supply activation". The "automatic load disconnection" should ideally be completed within 5 seconds of the protection device's operation, but if it actually takes 18 seconds, it is considered a delayed event with a delay duration of 13 seconds.
[0066] It's important to note that aligning and mapping all event timestamps using a unified timeline is a crucial step. Projecting the timestamp of each event in the chain onto the same continuous timeline eliminates minor deviations caused by different device log sampling frequencies, creating a unified relative coordinate system.
[0067] For example, the earliest alarm event is taken as the zero point of the time axis, and other events are mapped to their relative second positions in sequence.
[0068] Preferably, each delay duration is placed at a corresponding coordinate on a unified timeline based on the aligned timestamp position. Specifically, the expected start time of the delayed event is used as the starting point for placement, and the corresponding delay duration is extended forward to form a delay interval.
[0069] For example, if an action is expected to be triggered at the 120th second of the timeline, but actually occurs at the 145th second, then a delay interval of 25 seconds is marked at the 120th second position.
[0070] Understandably, a time-axis coordinate comparison method is used to determine which delay durations fall within critical intervals. Critical intervals are typically predefined by business experts, such as the fault propagation acceleration phase or the critical stabilization window.
[0071] For example, in the above power scenario, the period from the 90th to the 150th second is defined as the critical interval, because if the delay is too long within this interval, it is very easy to cause a chain of tripping to escalate.
[0072] In one embodiment, if the delay duration falls within a critical interval, the corresponding position is determined to be a critical turning point.
[0073] For example, a 13-second load shedding delay falling within the interval of 120 to 145 seconds is marked as a critical inflection point. The list of critical inflection points is then output in chronological order, such as "Critical Inflection Point 1: 127 seconds, load shedding delay 13 seconds; Critical Inflection Point 2: 210 seconds, backup power supply activation delay 8 seconds." Specifically, identifying these critical inflection points clearly reveals the time window for fault escalation and the bottlenecks in human or system response, thus providing a precise basis for subsequent targeted optimization.
[0074] In one possible implementation, this method helps operations and maintenance personnel quickly locate slow response points, reduce the probability of similar events recurring, and provide data support for developing emergency response plans at the minute-by-minute level.
[0075] S107. Based on the key inflection points, obtain historical fault data, and use causal reasoning algorithms to analyze the disturbance instability modes before and after the inflection points to obtain root cause classification results.
[0076] Step 1: Acquire historical fault data, perform preliminary screening based on key turning points, and extract relevant records within the time periods before and after the turning points to obtain a turning point correlation dataset. Step 2: Based on the turning point correlation dataset, analyze the disturbance and instability patterns in the data before and after the turning points, and use statistical tools to quantify the pattern changes to determine the pattern change feature set. Step 3: For the pattern change feature set, apply causal inference algorithms to deduce the correlation between disturbance and instability patterns, determine the causal relationships between each pattern, and obtain a causal relationship network. Step 4: If significant correlation paths exist in the causal relationship network, extract the core nodes in the paths, and perform comparative analysis with the fault mode data to determine the potential root cause set. Step 5: Based on the potential root cause set and the turning point analysis results, classify the root causes by grouping them according to preset classification rules to obtain preliminary classification results. Step 6: For the preliminary classification results, use data analysis methods to verify the consistency of the classification results. If the verification reveals classification deviations, readjust the classification boundaries to obtain the final root cause classification results. Step 7: Based on the final root cause classification results, generate structured data records and store them in a preset database for subsequent business queries and analysis.
[0077] After obtaining historical fault data, we first conduct preliminary screening based on key turning points in time.
[0078] For example, in a power system voltage drop event, the inflection point was determined to be around 17:23:15. Subsequently, all telemetry records and alarm logs from 5 minutes before to 3 minutes after this time point were extracted to form an inflection correlation dataset. This dataset contains key information such as voltage, current, active power, reactive power, and protection device operation signals, providing a complete foundation for subsequent analysis. Analyzing disturbance modes and instability modes becomes the core step in this inflection correlation dataset.
[0079] Specifically, before the transition, the system exhibited a small-amplitude oscillation pattern with frequencies fluctuating between 0.1 Hz and 0.5 Hz. After the transition, it rapidly evolved into a large-amplitude low-frequency oscillation, with the amplitude increasing rapidly from 0.02 per unit to 0.18 per unit. Statistical tools were used to quantify parameters such as amplitude, period, and damping ratio, ultimately yielding a mode change feature set. The damping ratio dropped sharply from a positive value of 0.12 to a negative value of -0.09, indicating that the system transitioned from a stable state to an unstable region. Based on this mode change feature set, a causal inference algorithm was applied to further deduce the correlation between disturbances and instability.
[0080] In one embodiment, a structural causal model is used to conduct correlation analysis on three types of events: voltage drop, generator output change, and load tripping. It is found that voltage drop is the direct cause of the sharp drop in damping ratio, while generator output change is the upstream cause of voltage drop, forming a clear causal chain, thereby constructing a causal relationship network containing seven nodes.
[0081] Preferably, when a significant correlation path exists in the causal relationship network, such as the path of voltage drop → damping ratio decrease → power oscillation amplification, it is determined to be a strong correlation. In this case, the core nodes on the path, namely the voltage drop node and the damping ratio node, are extracted. Subsequently, a comparison with a historical fault mode database is performed. It is found that this voltage drop characteristic highly matches the known transient imbalance after line overload disconnection. Therefore, the potential root cause set is determined to mainly include two situations: line overload protection maloperation and excessive power flow transfer between adjacent lines. Based on the potential root cause set and the results of transition analysis, the root causes are classified. Through preset classification rules, the root causes are divided into three groups: equipment-related, operation mode-related, and relay protection-related. The equipment-related group corresponds to line overload protection maloperation, and the operation mode-related group corresponds to excessive power flow transfer. The preliminary classification results are clearly presented. Data analysis methods are used to verify the consistency of the preliminary classification results.
[0082] For example, comparing the power flow distribution curves before and after the transition with the protection setting curves revealed a 92% probability of protection maloperation, while the probability of excessive power flow transition was only 65%. Verification showed a deviation in the classification boundary, so the thresholds were readjusted, lowering the confidence level for the operating mode category. The final root cause classification result was primarily equipment-related, accounting for approximately 85%. Based on the final root cause classification results, structured records containing fields such as event time, root cause category, confidence level, and correlation path were generated and directly stored in the fault analysis database. This process enables subsequent queries to quickly locate the technical characteristics of similar events and provides operators with targeted improvement measures, such as optimizing protection settings or adjusting system operating modes, thereby significantly reducing the probability of similar fault recurrence and improving the overall stability of the power grid.
[0083] S108. Based on the root cause classification results, generate an evolutionary path report, identify missing links in the report, and obtain supplementary preventive measures recommendations.
[0084] By organizing the root cause classification results, key category information is extracted to form an initial classification dataset. Based on the classification dataset, a pre-established path construction model is used to generate corresponding evolutionary path data and identify the main nodes in the path. If there are breaks in the connections between nodes in the generated evolutionary path data, the missing node information is obtained through comparison with historical data to obtain the complete path structure. For the complete path structure, the logical relationships between each node are analyzed to identify uncovered missing links and potential risk points. Starting from the identified missing links, combined with risk point data, a preliminary plan for preventive measures is constructed, and targeted response strategies are determined. The strategy content in the preliminary plan is obtained and matched with a historical case database to propose specific supplementary suggestions, resulting in a final list of measures. Based on the final list of measures, an analysis report containing the evolutionary path and preventive strategies is generated, and the completeness of the report content is ensured.
[0085] For example, in the root cause classification results of power system voltage instability, key category information was extracted, including load surge type, line disconnection type, and generator tripping type, forming an initial classification dataset. This dataset records the main category to which each fault event belongs and its frequency percentage.
[0086] In one possible implementation, a pre-established path construction model is used to process the classification dataset and generate corresponding evolutionary path data.
[0087] Specifically, for the load surge type, the model output path is: initial small disturbance → rapid load increase → continuous voltage drop → low voltage protection action → system disconnection, with the main nodes being the load node, voltage monitoring point and protection action point in sequence.
[0088] It should be noted that if there are interruptions in the connections between nodes in the generated evolution path data, such as a lack of intermediate transition between a voltage drop node and a protection action node, historical data comparison can be used to retrieve records of insufficient reactive power compensation or line power flow transfer in similar events. This allows for the supplementation of missing nodes and the acquisition of a complete path structure, such as adding a stage where reactive power reserves are depleted. For the complete path structure, analyzing the logical relationships between nodes can identify uncovered missing links. For example, in a load surge path, the lag in dynamic reactive power support response may not be fully reflected. This missing link is identified as a potential risk point that could accelerate voltage collapse. Based on the identified missing links and risk point data, a preliminary preventative measure plan is constructed.
[0089] Preferably, the strategies for addressing dynamic reactive power response lag include adding a fast reactive power compensation device and optimizing automatic voltage control parameters.
[0090] Specifically, after obtaining the strategy content from the preliminary plan, it is matched with a historical case database.
[0091] For example, a regional power grid experienced a large-scale power outage due to a similar lag problem. After matching the data, supplementary recommendations were made: reduce the response time of reactive power compensation devices from the current 0.5 seconds to less than 0.2 seconds, and conduct regular online assessments of reactive power margin. In another embodiment, for line-cutting paths, supplementary recommendations may include strengthening relay protection setting verification and introducing wide-area backup protection functions to avoid cascading effects caused by a single component failure. The final list of measures generates an analysis report containing the evolution path and prevention strategies. The report clearly marks the complete node sequence of the load surge path, the location of risk points, and corresponding strategies to ensure completeness. This report can be used to guide operators to intervene in advance and reduce the probability of instability.
[0092] Understandably, the above process, through complete path completion and targeted strategy matching, shifts prevention efforts from passive response to proactive control, effectively improving system stability margin, reducing the number of large-scale failures, and thus ensuring the safe and reliable operation of the power grid.
[0093] If the technical solution of this application involves the collection, processing, or application of personal information, the relevant products have, before implementing any personal information processing activities, fully and clearly informed individuals of the processing rules in accordance with the "Personal Information Protection Law of the People's Republic of China" and other current laws and regulations, and obtained their voluntary and explicit consent. If sensitive personal information is involved, the product has obtained the individual's separate consent before processing, and such consent is given in an explicit manner. For example, prominent signs are set up in the area where information collection devices such as cameras are located, clearly indicating "Entering is considered as consent to the collection of personal information"; or through pop-ups, checkboxes, user-initiated uploads, etc., under the premise of clearly listing the processor's identity, processing purpose, processing method, and information type, the user actively completes the authorization operation. The above mechanisms ensure that all personal information processing activities are based on legal authorization and fully comply with national compliance requirements regarding personal information protection.
[0094] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to any specific implementation. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.
Claims
1. A knowledge graph-driven system and method for analyzing the root causes of power grid faults, characterized in that, The method includes: Real-time power grid data is collected by multi-source sensors to obtain event records including electrical transient disturbances, relay protection actions, and manual intervention behaviors, thus obtaining an initial event sequence; Based on the initial event sequence, a time-series alignment algorithm is used to process data at different time granularities to obtain an ordered list of events on a unified time axis. If the electrical connection strength between adjacent events in the ordered event list exceeds a preset threshold, it is determined to be a direct coupling, and a preliminary causal chain is established. For the initial causal chain, the protection logic rule base is obtained, and the matching protection device action mode is extracted from it to obtain the extended causal relationship; By extending causality, a fault propagation network is constructed using a graph model, and the propagation paths between network nodes are determined to obtain the complete event chain. If there are delayed events in the complete event chain, the key turning points are determined by comparing the delay duration with a unified timeline. Based on key inflection points, historical fault data is obtained, and causal reasoning algorithms are used to analyze the disturbance instability modes before and after the inflection points to obtain root cause classification results. Based on the root cause classification results, an evolutionary path report is generated, missing links in the report are identified, and supplementary preventive measures are recommended.
2. The knowledge graph-driven power grid fault root cause analysis system and method according to claim 1, characterized in that, The process involves collecting real-time power grid data through multi-source sensors to obtain event records including electrical transient disturbances, relay protection actions, and manual intervention behaviors, resulting in an initial event sequence, including: The initial event sequence is obtained by collecting real-time data from the power grid using multi-source sensors. The initial event sequence is timestamped to obtain the aligned event sequence. Extract transient disturbance fragments and protection action fragments from the aligned event sequence to obtain a set of disturbance fragments and a set of action fragments; Match the time range of the perturbation fragment set with the time range of the action fragment set. If the difference between the end time of the perturbation fragment and the start time of the action fragment is within a preset time window, they are determined to be a pair of related events, and a set of related event pairs is obtained. For each event pair in the set of related event pairs, obtain the corresponding human intervention behavior record to obtain a set of event pairs with intervention annotations; Perturbation features and action features are extracted from the set of event pairs with intervention annotations. The decision tree algorithm is used to classify the perturbation features and action features to obtain perturbation type labels and action type labels. The event set with intervention labels is grouped according to the disturbance type label and the action type label to obtain the grouped event set; For each group in the grouped event set, the frequency of human intervention behavior is counted to obtain the intervention statistics for each group.
3. The knowledge graph-driven power grid fault root cause analysis system and method according to claim 1, characterized in that, The process involves processing data at different time granularities using a time-series alignment algorithm based on the initial event sequence to obtain an ordered list of events on a unified timeline, including: Step 1: Obtain initial event and sequence data from the repository, and perform timestamp standardization on data from different sources to obtain a preliminary set of events; Step 2: Based on the initially compiled event set, use a time-series alignment algorithm to match and adjust data at different time granularities to determine a unified time reference point; Step 3: Using a unified time reference point, arrange all events in chronological order to generate an ordered event structure based on a timeline; Step 4: For the ordered event structure, check whether there are missing or duplicate events on the timeline. If an anomaly is detected, fill in the missing data using interpolation methods to obtain a complete sequence record. Step 5: Based on the complete sequence record, analyze the distribution of time intervals between events to determine if there are any abnormal intervals. If the interval exceeds the preset threshold, mark it as a node to be processed. Step 6: Obtain the data of the node to be processed, correct it by combining it with the context event information, adjust the abnormal interval by using a smoothing method, and obtain the final ordered event list; Step 7: Generate a structured time-series data file from the final ordered list of events, store it in the designated database, and complete the data integration process.
4. The knowledge graph-driven power grid fault root cause analysis system and method according to claim 1, characterized in that, If the electrical connection strength between adjacent events in the ordered event list exceeds a preset threshold, it is determined to be a direct coupling, and a preliminary causal chain is established, including: Obtain the timestamps and electrical parameters of all events in an ordered event sequence; For each pair of adjacent events, calculate the electrical connection strength between them. If the electrical connection strength is greater than a preset threshold, it is determined that the adjacent event pair has a direct coupling relationship; Adjacent event pairs that are determined to be directly coupled are connected sequentially in chronological order to obtain a preliminary causal chain; Obtain the electrical connection strength value for each coupling relationship in the initial causal chain; Traverse the initial causal chain. If there is direct coupling between three consecutive events, mark the chain segment as a strongly correlated path. Key event nodes are extracted through strongly correlated paths to form a simplified causal subchain.
5. The knowledge graph-driven power grid fault root cause analysis system and method according to claim 1, characterized in that, The process involves obtaining a protection logic rule base for the initial causal chain, extracting matching protection device action patterns from it, and obtaining extended causal relationships, including: Receive initial causal chains from external systems; By identifying the key components in the initial causal chain, all operating modes of the corresponding protection device are obtained from the protection logic rule base; For the obtained protection device operation mode, a string exact matching method is used to compare it with the causal description in the preliminary causal chain; If the match is successful, the triggering condition and consequence event corresponding to the action mode are added to the end of the causal chain to obtain the first extended causal chain; Based on the newly added consequence events in the first extended causal chain, the associated next-level protection device operation mode is retrieved again from the protection logic rule base; The pattern matching is used to determine whether the newly added action pattern forms a causal connection with the end event of the first extended causal chain. If the connection is established, the action pattern is added to the next event to obtain the second extended causal chain. Perform a loop extension judgment on the second extended causal chain. When there is no matching next-level action pattern in the protection logic rule base, output the final extended causal relationship.
6. The knowledge graph-driven power grid fault root cause analysis system and method according to claim 1, characterized in that, The process involves extending causal relationships, constructing a fault propagation network using a graph model, determining the propagation paths between network nodes, and obtaining a complete event chain, including: Obtain system operation logs and alarm records; Initial causal relationship pairs are extracted using a pre-established causal knowledge base; A graph model is used to transform causal pairs into a directed graph structure, where nodes represent fault events and directed edges represent the direction of causal propagation. For each node in the graph model, expand its outgoing edges to obtain the set of directly downstream faulty nodes; Determine if each downstream node has a known triggering condition; if so, add the downstream node to the propagation path sequence. Obtain the last node in the current propagation path sequence, and continue to expand its outgoing edges according to the graph model to obtain the next layer of fault nodes; Repeat the expansion and judgment process until the current node no longer generates new downstream nodes or reaches the maximum propagation level preset by the graph model. Traverse all propagation path sequences initiated by the initial fault nodes and merge path segments that have an inclusion relationship; The most frequent complete chain sequence among all propagation path sequences is identified to obtain the main failure event chain.
7. The knowledge graph-driven power grid fault root cause analysis system and method according to claim 1, characterized in that, If a delayed event exists in the complete event chain, the key turning point is determined by comparing the delay duration with a unified timeline, including: Retrieve the complete sequence of action events in the full event chain; Check each element of the sequence for any action delay events; If there are action delay events, extract the delay duration of each delay event; All event timestamps are aligned and mapped using a unified timeline; Place the delay duration onto the corresponding coordinates of the unified timeline based on the aligned timestamp position; The time axis coordinate comparison method is used to determine which delay durations fall within the critical intervals; If the delay time falls within the critical interval, the corresponding position is determined to be a critical turning point; After obtaining the list of key turning points, output them in chronological order.
8. The knowledge graph-driven power grid fault root cause analysis system and method according to claim 1, characterized in that, The process involves acquiring historical fault data based on key inflection points, analyzing disturbance instability modes before and after the inflection points using causal reasoning algorithms, and obtaining root cause classification results, including: Step 1: Obtain historical fault data, perform preliminary screening based on key turning points, extract relevant records within the time period before and after the turning point, and obtain the turning point association dataset; Step 2: Based on the transition association dataset, analyze the disturbance and instability patterns in the data before and after the transition, use statistical tools to quantify the pattern changes, and determine the feature set of pattern changes; Step 3: For the feature set of mode changes, apply the causal reasoning algorithm to deduce the correlation between the disturbance mode and the instability mode, determine the causal relationship between each mode, and obtain the causal relationship network; Step 4: If there are significant correlation paths in the causal relationship network, extract the core nodes in the paths, and conduct comparative analysis with the failure mode data to determine the set of potential root causes. Step 5: Based on the set of potential root causes and the results of the transition analysis, classify the root causes and group them according to the preset classification rules to obtain preliminary classification results; Step Six: Based on the preliminary classification results, use data analysis methods to verify the consistency of the classification results. If the verification finds a classification deviation, readjust the classification boundaries to obtain the final root cause classification results. Step 7: Based on the final root cause classification results, generate structured data records and store them in a preset database for subsequent business queries and analysis.
9. The knowledge graph-driven power grid fault root cause analysis system and method according to claim 1, characterized in that, The process involves generating an evolutionary path report based on root cause classification results, identifying missing links in the report, and obtaining supplementary preventative measures recommendations, including: By organizing the root cause classification results, key category information is extracted from the classification results to form an initial classification dataset; Based on the classification dataset, a model is built using a pre-established path to generate corresponding evolutionary path data and determine the main nodes in the path; If there are interruptions in the connections between nodes in the generated evolution path data, the missing node information can be obtained by comparing with historical data to get the complete path structure; For the complete path structure, analyze the logical relationships between each node, identify the missing links that are not covered, and determine potential risk points; Starting from the identified missing links, and combining risk point data, we will construct a preliminary plan for preventive measures and determine targeted response strategies. Obtain the strategic content from the preliminary plan, match it with the historical case database, propose specific supplementary suggestions, and obtain the final list of measures; The final list of measures is used to generate an analytical report that includes evolutionary paths and prevention strategies, ensuring the completeness of the report's content.