An operation and maintenance event analysis method and device based on root cause aggregation and a storage medium

By employing multi-dimensional aggregation and root cause analysis, the disconnect between the accuracy of alarm aggregation and root cause analysis is resolved, enabling efficient operation and maintenance event analysis, generating clear causal chains and impact scope views, and supporting operation and maintenance decision-making and knowledge accumulation.

CN122247834APending Publication Date: 2026-06-19CHANJET INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHANJET INFORMATION TECH CO LTD
Filing Date
2026-05-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, alarm aggregation methods are difficult to accurately capture dynamically changing system dependencies, leading to missed aggregations and false aggregations. Root cause analysis is disconnected from event aggregation, making it impossible to automatically identify the root causes of multiple derivative events, and making it difficult for operations and maintenance personnel to grasp the overall picture of the fault.

Method used

By collecting raw alarm information and topology metadata from monitoring tools, CMDB, and call chain systems in real time, multi-dimensional aggregation analysis based on time, topology, and text semantics is performed. Combined with dynamic exponential backoff algorithm and natural language processing, preliminary events are generated, and root cause analysis is used to aggregate them into the main events, outputting a tree structure or hierarchical list.

Benefits of technology

It significantly improves the accuracy of alarm aggregation, enables global root cause localization, accurately locates core issues, provides clear output results, reduces the amount of information interference for operation and maintenance personnel, and forms a knowledge base for operation and maintenance, providing data support for fault prediction and intelligent decision-making.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122247834A_ABST
    Figure CN122247834A_ABST
Patent Text Reader

Abstract

This invention proposes a method, device, and storage medium for operation and maintenance event analysis based on root cause aggregation, relating to the field of system performance maintenance technology. The method includes: real-time acquisition of raw alarm information and related topology metadata, forming an operation and maintenance event set; real-time aggregation analysis of the operation and maintenance event set to generate multiple preliminary events; independent root cause analysis of each preliminary event generated by the first-level aggregation to determine the root cause of each preliminary event; aggregation analysis of multiple preliminary events based on the root cause of each preliminary event to aggregate each preliminary event into the corresponding root cause's main event; and visualization of the main events in the form of a tree structure or hierarchical list. This invention integrates root cause analysis and clustering, significantly improving the accuracy of alarm aggregation, achieving global root cause localization, and accurately locating core issues.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of system performance maintenance technology, specifically to a method, apparatus, and storage medium for operation and maintenance event analysis based on root cause aggregation. Background Technology

[0002] With the widespread adoption of cloud-native and microservice architectures, the complexity of enterprise IT systems has increased dramatically, leading to a massive influx of alerts from monitoring systems and overwhelming operations and maintenance personnel. Existing technologies suffer from the following main shortcomings: Limitations of alarm aggregation techniques: Existing alarm aggregation methods mostly rely on fixed time windows, simple text matching, or pre-configured static rules. These methods struggle to accurately capture dynamically changing system dependencies, easily leading to missed aggregations (failure to aggregate related alarms with the same root cause) and incorrect aggregations (mistakenly merging unrelated alarms), causing subsequent analysis to be based on flawed information.

[0003] Root cause analysis is disconnected from event aggregation: Traditional root cause analysis tools typically perform isolated analysis on individual alarms or aggregated events. When large-scale failures occur, the analysis results are often fragmented, failing to automatically identify and present multiple derivative events caused by a single root cause, making it difficult for operations and maintenance personnel to grasp the full picture of the failure from a global perspective. Summary of the Invention

[0004] In view of one or more technical defects in the prior art, the present invention proposes the following technical solution.

[0005] A method for analyzing operational events based on root cause aggregation, the method comprising: The data collection process involves collecting raw alarm information and related topology metadata from the system's monitoring tools, configuration management database (CMDB), and call chain system in real time, and then combining the raw alarm information and related topology metadata into a set of operation and maintenance events. The first aggregation step involves performing real-time aggregation analysis on the set of operation and maintenance events to generate multiple preliminary events. The root cause analysis step involves performing an independent root cause analysis on each preliminary event generated by the first-level aggregation to determine the root cause of each preliminary event. The second aggregation step involves performing aggregation analysis on the multiple preliminary events based on the root cause generated by each preliminary event, so as to aggregate each preliminary event into the main event of the corresponding root cause; In the output step, the main event is visualized in the form of a tree structure or a hierarchical list.

[0006] Clearly reveal the complete causal chain and scope of impact from "root cause → affected initial event → original alarm".

[0007] Furthermore, the first aggregation step involves: aggregating the operation and maintenance event set based on the time dimension, using a dynamic exponential backoff algorithm to determine the aggregation time window, capturing and aggregating original alarm information with temporal correlation to obtain first aggregated data; aggregating the operation and maintenance event set based on the topology dimension, identifying the dependencies between alarm objects based on CMDB and call chain data, and aggregating the original alarm information based on topology metadata to obtain second aggregated data; aggregating the operation and maintenance event set based on text semantics, using natural language processing technology to calculate the semantic similarity of the text of the original alarm information, and merging the original alarm information with semantic similarity greater than a first threshold to obtain third aggregated data; and the aggregation decision engine comprehensively adjudicating the first, second, and third aggregated data according to a preset strategy to generate multiple preliminary events, each of which contains a list of alarm information determined to be related and its preliminary assessed impact range.

[0008] Furthermore, the aggregation decision engine performs a comprehensive decision on the first, second, and third aggregated data according to a preset strategy to generate multiple preliminary events. The operation is as follows: the first, second, and third aggregated data are subjected to an intersection operation, and the result is used as the first preliminary event set. The first, second, and third aggregated data are subjected to an intersection operation in pairs to obtain the second, third, and fourth preliminary event sets. The first preliminary event set is subtracted from the second, third, and fourth preliminary event sets respectively to obtain the fifth, sixth, and seventh preliminary event sets. The original data corresponding to the events in the fifth, sixth, and seventh preliminary event sets are calculated to have a topological dependency relationship and the semantic similarity of the text corresponding to the original alarm information is greater than a second threshold. The result is used as the eighth preliminary event set. The union of the first and eighth preliminary event sets is calculated, and the multiple events in the union are used as multiple preliminary events.

[0009] Furthermore, the root cause analysis step is performed as follows: First, query the change management system to perform change correlation analysis, and check whether there were any changes to relevant configuration items before the event corresponding to the original alarm information occurred. If so, the changed configuration item is taken as the root cause of the initial event. If not, combine performance indicators and error logs, and trace the topology dependency relationship based on the topology metadata to determine the source of the original alarm information as the root cause of the initial event.

[0010] Furthermore, the second aggregation step is as follows: based on the root cause generated by each preliminary event, search for a main event with the same root cause in the main event set. If the search result shows that a corresponding main event exists, then associate the preliminary event with the main event. If the search result is empty, then determine whether there is another preliminary event with the same root cause as the preliminary event. If so, then create a new main event and associate the preliminary event and the other preliminary event together with the new main event.

[0011] Furthermore, the root cause of the preliminary event points to the same underlying fault entity or the same fault mode, and other original alarm information with the same underlying fault entity or the same fault mode is searched in the operation and maintenance event set, and the events corresponding to the other original alarm information are associated with the main event with the same root cause.

[0012] This invention also proposes an operation and maintenance event analysis device based on root cause aggregation, the device comprising: The data acquisition unit collects raw alarm information and related topology metadata from the system's monitoring tools, configuration management database (CMDB), and call chain system in real time, and combines the raw alarm information and related topology metadata into a set of operation and maintenance events. The first aggregation unit performs real-time aggregation analysis on the set of operation and maintenance events to generate multiple preliminary events; The root cause analysis unit performs independent root cause analysis on each preliminary event generated by the first-level aggregation to determine the root cause of each preliminary event. The second aggregation unit performs aggregation analysis on the multiple preliminary events based on the root cause generated by each preliminary event, so as to aggregate each preliminary event into the main event of the corresponding root cause; The output unit visualizes the main event in the form of a tree structure or a hierarchical list.

[0013] Clearly reveal the complete causal chain and scope of impact from "root cause → affected initial event → original alarm".

[0014] Furthermore, the first aggregation unit operates as follows: It aggregates the operation and maintenance event set based on the time dimension, using a dynamic exponential backoff algorithm to determine the aggregation time window, capturing and aggregating original alarm information with temporal correlation to obtain first aggregated data; it aggregates the operation and maintenance event set based on the topology dimension, identifying dependencies between alarm objects based on CMDB and call chain data, and aggregating the original alarm information based on topology metadata to obtain second aggregated data; it aggregates the operation and maintenance event set based on text semantics, using natural language processing technology to calculate the semantic similarity of the text of the original alarm information, merging original alarm information with semantic similarity greater than a first threshold to obtain third aggregated data; the aggregation decision engine comprehensively adjudicates the first, second, and third aggregated data according to a preset strategy, generating multiple preliminary events, each preliminary event containing a list of alarm information determined to be related and its preliminary assessed impact range.

[0015] Furthermore, the aggregation decision engine performs a comprehensive decision on the first, second, and third aggregated data according to a preset strategy to generate multiple preliminary events. The operation is as follows: the first, second, and third aggregated data are subjected to an intersection operation, and the result is used as the first preliminary event set. The first, second, and third aggregated data are subjected to an intersection operation in pairs to obtain the second, third, and fourth preliminary event sets. The first preliminary event set is subtracted from the second, third, and fourth preliminary event sets respectively to obtain the fifth, sixth, and seventh preliminary event sets. The original data corresponding to the events in the fifth, sixth, and seventh preliminary event sets are calculated to have a topological dependency relationship and the semantic similarity of the text corresponding to the original alarm information is greater than a second threshold. The result is used as the eighth preliminary event set. The union of the first and eighth preliminary event sets is calculated, and the multiple events in the union are used as multiple preliminary events.

[0016] Furthermore, the root cause analysis unit operates as follows: First, it queries the change management system to perform change correlation analysis, checking whether any related configuration items have changed before the event corresponding to the original alarm information occurred. If so, the changed configuration item is taken as the root cause of the initial event. If not, it combines performance indicators and error logs, and traces topology dependencies based on the topology metadata to determine the source of the original alarm information as the root cause of the initial event.

[0017] Furthermore, the operation of the second aggregation unit is as follows: based on the root cause generated by each preliminary event, search for a main event with the same root cause in the main event set; if the search result shows that a corresponding main event exists, then associate the preliminary event with the main event; if the search result is empty, then determine whether there is another preliminary event with the same root cause as the preliminary event; if so, then create a new main event and associate the preliminary event and the other preliminary event together with the new main event.

[0018] The present invention also proposes a computer-readable storage medium storing computer program code, which, when executed by a computer, performs any of the methods described above.

[0019] The technical effect of this invention is as follows: This invention provides a method, apparatus, and storage medium for operation and maintenance event analysis based on root cause aggregation. The method includes: a collection step S101, collecting raw alarm information and related topology metadata from the system's monitoring tools, configuration management database (CMDB), and call chain system in real time, and assembling the raw alarm information and related topology metadata into an operation and maintenance event set; a first aggregation step S102, performing real-time aggregation analysis on the operation and maintenance event set to generate multiple preliminary events; a root cause analysis step S103, performing independent root cause analysis on each preliminary event generated by the first-level aggregation to determine the root cause of each preliminary event; a second aggregation step S104, performing aggregation analysis on the multiple preliminary events based on the root cause of each preliminary event to aggregate each preliminary event into the main event corresponding to the root cause; and an output step S105, visually presenting the main events in the form of a tree structure or hierarchical list. Through hierarchical display, it can be displayed in the form of "root cause → affected preliminary event → raw alarm," thereby showing the complete causal chain and scope of impact, facilitating users to determine the cause, fault point, and scope of impact of the problem. This invention integrates root cause analysis and clustering. Specifically, after real-time aggregation analysis of the operational event set to generate multiple preliminary events, independent root cause analysis is performed on each preliminary event to determine its root cause. Then, based on the root cause, aggregation is performed again, aggregating the multiple preliminary events to their corresponding root cause's main event. Other unaggregated events are then identified based on the root cause. This allows all events with the same cause to be aggregated into the corresponding root cause's main event, and the results are visualized in a tree structure or hierarchical list. This significantly improves the accuracy of alarm aggregation, enables global root cause localization, and accurately identifies core issues. The output results are clear and highly interpretable. Attached Figure Description

[0020] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings.

[0021] Figure 1 This is a flowchart of an operation and maintenance event analysis method based on root cause aggregation according to an embodiment of the present invention.

[0022] Figure 2 This is a structural diagram of an operation and maintenance event analysis device based on root cause aggregation according to an embodiment of the present invention. Detailed Implementation

[0023] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.

[0024] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0025] Figure 1 This invention illustrates a method for analyzing operational events based on root cause aggregation, the method comprising: In step S101, raw alarm information and related topology metadata are collected in real time from the system's monitoring tools, configuration management database (CMDB), and call chain system. The raw alarm information and related topology metadata are then combined into a set of operation and maintenance events. The various monitoring tools can be Prometheus, Zabbix, etc., used to monitor system performance, such as monitoring database systems.

[0026] In the first aggregation step S102, the set of operation and maintenance events is aggregated and analyzed in real time to generate multiple preliminary events. Root cause analysis step S103: Perform independent root cause analysis on each preliminary event generated by the first-level aggregation to determine the root cause of each preliminary event. The second aggregation step S104 involves performing aggregation analysis on the multiple preliminary events based on the root cause generated by each preliminary event, so as to aggregate each preliminary event into the main event of the corresponding root cause; In step S105, the main event is visualized in the form of a tree structure or a hierarchical list. Through hierarchical display, it can be shown as "Root Cause → Affected Initial Events → Original Alarm," thus displaying the complete causal chain and scope of impact, facilitating user identification of the cause, fault point, and scope of impact of the problem.

[0027] The key inventive concept of this invention addresses the deficiencies in the prior art by integrating root cause analysis and clustering. Specifically, after real-time aggregation analysis of the operational event set to generate multiple preliminary events, independent root cause analysis is performed on each preliminary event to determine its root cause. Then, aggregation is performed again based on the root cause, aggregating each preliminary event into the corresponding root cause's main event. Other unaggregated events are then identified based on the root cause, allowing all events with the same cause to be aggregated into the corresponding root cause's main event, visualized in a tree structure or hierarchical list. This significantly improves the accuracy of alarm aggregation: by introducing a second level of root cause-based deep aggregation, the "missed aggregation" problem that may occur with the first level of real-time aggregation is effectively solved, ensuring maximum merging of related alarms and greatly reducing the amount of information interference for operations personnel. It achieves global root cause localization: breaking the limitations of traditional isolated analysis of individual events, it can automatically identify and integrate multiple fault events caused by the same root cause, helping operations personnel quickly grasp the overall impact of the fault and accurately locate the core problem. The output is clear and highly interpretable: the final generated master event provides a complete "root cause-phenomenon" view with a clear structure, greatly improving the efficiency and accuracy of operation and maintenance decisions. It also fosters the accumulation of operation and maintenance knowledge: the results of the second-level aggregation (i.e., the mapping relationship between root causes and phenomenon sets) can be automatically recorded and accumulated in a knowledge base, providing data support for future fault prediction, automatic repair, and intelligent decision-making, enabling the operation and maintenance system to continuously evolve. This is one of the key inventive concepts of this invention.

[0028] In one embodiment, the first aggregation step S102 operates as follows: The operation and maintenance event set is aggregated based on the time dimension, and a dynamic exponential backoff algorithm is used to determine the aggregation time window to capture and aggregate original alarm information with temporal correlation to obtain first aggregated data; the operation and maintenance event set is aggregated based on the topology dimension, and based on CMDB and call chain data, the dependencies between alarm objects are identified, and the original alarm information is aggregated based on topology metadata to obtain second aggregated data; the operation and maintenance event set is aggregated based on text semantics, and natural language processing technology is used to calculate the semantic similarity of the text of the original alarm information, and original alarm information with semantic similarity greater than a first threshold is merged to obtain third aggregated data; the aggregation decision engine comprehensively adjudicates the first, second, and third aggregated data according to a preset strategy, generating multiple preliminary events, each preliminary event containing a list of alarm information determined to be related and its preliminary assessed impact range.

[0029] In this invention, a preliminary aggregation operation is first performed based on three dimensions: time dimension, CMDB and call chain data, and text semantics. Then, the output results are comprehensively evaluated to generate a final preliminary event set, thereby ensuring that the preliminary events are related and that events caused by the same reason can be identified in the future. This is another important inventive point of this application.

[0030] In one embodiment, the operation of the aggregation decision engine to comprehensively adjudicate the first, second, and third aggregated data according to a preset strategy and generate multiple preliminary events is as follows: The first, second, and third aggregated data are subjected to an intersection operation, and the result is used as a first preliminary event set. The first, second, and third aggregated data are subjected to pairwise intersection operations to obtain a second, third, and fourth preliminary event set. The first preliminary event set is subtracted from the second, third, and fourth preliminary event sets respectively to obtain a fifth, sixth, and seventh preliminary event set. For each pairwise event in the fifth, sixth, and seventh preliminary event sets, the original data corresponding to the events have a topological dependency relationship, and the semantic similarity of the corresponding original alarm information text is greater than a second threshold. The result is used as an eighth preliminary event set. The union of the first and eighth preliminary event sets is calculated, and the multiple events in the union are used as multiple preliminary events.

[0031] In this application, to process the results based on three dimensions—time, CMDB and call chain data, and text semantics—and ensure accurate preliminary aggregation events, the calculation principle is as follows: events present in all three output sets are necessarily preliminary events and are designated as the first preliminary event set. Then, the intersection of each pair of output sets in the three dimensions is calculated, and the intersection of the three sets is subtracted to determine the pairwise intersection of the three sets. Furthermore, for each pairwise intersection, events corresponding to events with topological dependencies and whose corresponding original alarm information text semantic similarity is greater than a second threshold are designated as the eighth preliminary event set. The union of the first and eighth preliminary event sets is calculated, and multiple events in the union are designated as multiple preliminary events. This method ensures that the number of calculated preliminary events is not excessive, thereby ensuring the calculation speed of subsequent root cause analysis. This is one of the important inventive points of this application.

[0032] In one embodiment, the root cause analysis step S103 operates as follows: First, the change management system is queried to perform change correlation analysis. The system checks whether any configuration items have changed before the event corresponding to the original alarm information occurred. If so, the changed configuration item is taken as the root cause of the initial event. If not, performance metrics and error logs are combined, and topology dependencies are traced based on the topology metadata to determine the source of the original alarm information as the root cause of the initial event. This approach breaks the limitations of traditional isolated analysis of single events, automatically identifying and integrating multiple fault events caused by the same root cause. This helps maintenance personnel quickly grasp the overall impact of the fault and accurately locate the core problem. This is another inventive concept of this application.

[0033] In one embodiment, the operation of the second aggregation step S104 is as follows: based on the root cause generated by each preliminary event, search for a main event with the same root cause in the main event set; if the search result is that a corresponding main event exists, then associate the preliminary event with the main event; if the search result is empty, then determine whether there is another preliminary event with the same root cause as the preliminary event; if so, then create a new main event and associate the preliminary event and the other preliminary event together with the new main event.

[0034] In one embodiment, the root cause of the initial event points to the same underlying fault entity or the same fault mode. Other original alarm information with the same underlying fault entity or the same fault mode is searched within the maintenance event set, and the events corresponding to these other original alarm information are associated with the main event of the same root cause. This process identifies events missed in the first aggregation operation through root cause identification, ensuring maximum merging of associated alarms. This is another important inventive point of this application.

[0035] In this invention, by introducing a second-level, root-cause-based deep aggregation, the "missed aggregation" problem that may occur in the first-level real-time aggregation is effectively solved, ensuring the maximum merging of associated alarms and greatly reducing the amount of interfering information for operation and maintenance personnel. The results of the second-level aggregation (i.e., the mapping relationship between root causes and phenomenon sets) can be automatically recorded and stored in the knowledge base, providing data support for future fault prediction, automatic repair, and intelligent decision-making, enabling the operation and maintenance system to have continuous evolution capabilities.

[0036] Figure 2 This invention illustrates an operational event analysis device based on root cause aggregation, the device comprising: The acquisition unit 201 collects raw alarm information and related topology metadata from the system's monitoring tools, configuration management database (CMDB), and call chain system in real time, and combines the raw alarm information and related topology metadata into a set of operation and maintenance events. The various monitoring tools can be Prometheus, Zabbix, etc., used to monitor the system's performance, such as monitoring the database system.

[0037] The first aggregation unit 202 performs real-time aggregation analysis on the set of operation and maintenance events to generate multiple preliminary events; The root cause analysis unit 203 performs independent root cause analysis on each preliminary event generated by the first-level aggregation to determine the root cause of each preliminary event. The second aggregation unit 204 performs aggregation analysis on the multiple preliminary events based on the root cause generated by each preliminary event, so as to aggregate each preliminary event into the main event of the corresponding root cause; Output unit 205 visualizes the main event in the form of a tree structure or hierarchical list. Through hierarchical display, it can be shown in the form of "root cause → affected initial event → original alarm," thereby displaying the complete causal chain and scope of impact, facilitating user identification of the cause, fault point, and scope of impact of the problem.

[0038] The key inventive concept of this invention addresses the deficiencies in the prior art by integrating root cause analysis and clustering. Specifically, after real-time aggregation analysis of the operational event set to generate multiple preliminary events, independent root cause analysis is performed on each preliminary event to determine its root cause. Then, aggregation is performed again based on the root cause, aggregating each preliminary event into the corresponding root cause's main event. Other unaggregated events are then identified based on the root cause, allowing all events with the same cause to be aggregated into the corresponding root cause's main event, visualized in a tree structure or hierarchical list. This significantly improves the accuracy of alarm aggregation: by introducing a second level of root cause-based deep aggregation, the "missed aggregation" problem that may occur with the first level of real-time aggregation is effectively solved, ensuring maximum merging of related alarms and greatly reducing the amount of information interference for operations personnel. It achieves global root cause localization: breaking the limitations of traditional isolated analysis of individual events, it can automatically identify and integrate multiple fault events caused by the same root cause, helping operations personnel quickly grasp the overall impact of the fault and accurately locate the core problem. The output is clear and highly interpretable: the final generated master event provides a complete "root cause-phenomenon" view with a clear structure, greatly improving the efficiency and accuracy of operation and maintenance decisions. It also fosters the accumulation of operation and maintenance knowledge: the results of the second-level aggregation (i.e., the mapping relationship between root causes and phenomenon sets) can be automatically recorded and accumulated in a knowledge base, providing data support for future fault prediction, automatic repair, and intelligent decision-making, enabling the operation and maintenance system to continuously evolve. This is one of the key inventive concepts of this invention.

[0039] In one embodiment, the first aggregation unit 202 operates as follows: It aggregates the operation and maintenance event set based on the time dimension, using a dynamic exponential backoff algorithm to determine the aggregation time window, capturing and aggregating original alarm information with temporal correlation to obtain first aggregated data; it aggregates the operation and maintenance event set based on the topology dimension, identifying dependencies between alarm objects based on CMDB and call chain data, and aggregating the original alarm information based on topology metadata to obtain second aggregated data; it aggregates the operation and maintenance event set based on text semantics, using natural language processing technology to calculate the semantic similarity of the text of the original alarm information, merging original alarm information with semantic similarity greater than a first threshold to obtain third aggregated data; the aggregation decision engine comprehensively adjudicates the first, second, and third aggregated data according to a preset strategy, generating multiple preliminary events, each preliminary event containing a list of alarm information determined to be related and its preliminary assessed impact range.

[0040] In this invention, a preliminary aggregation operation is first performed based on three dimensions: time dimension, CMDB and call chain data, and text semantics. Then, the output results are comprehensively evaluated to generate a final preliminary event set, thereby ensuring that the preliminary events are related and that events caused by the same reason can be identified in the future. This is another important inventive point of this application.

[0041] In one embodiment, the aggregation decision engine performs a comprehensive decision on the first, second, and third aggregated data according to a preset strategy to generate multiple preliminary events. The operation is as follows: the first, second, and third aggregated data are subjected to an intersection operation, and the result is used as a first preliminary event set. The first, second, and third aggregated data are subjected to an intersection operation in pairs to obtain a second, third, and fourth preliminary event set. The first preliminary event set is subtracted from the second, third, and fourth preliminary event sets to obtain a fifth, sixth, and seventh preliminary event set. The original data corresponding to the events in the fifth, sixth, and seventh preliminary event sets are calculated to have a topological dependency relationship and the semantic similarity of the text corresponding to the original alarm information is greater than a second threshold. The result is used as an eighth preliminary event set. The union of the first and eighth preliminary event sets is calculated, and the multiple events in the union are used as multiple preliminary events.

[0042] In this application, to process the results based on three dimensions—time, CMDB and call chain data, and text semantics—and ensure accurate preliminary aggregation events, the calculation principle is as follows: events present in all three output sets are necessarily preliminary events and are designated as the first preliminary event set. Then, the intersection of each pair of output sets in the three dimensions is calculated, and the intersection of the three sets is subtracted to determine the pairwise intersection of the three sets. Furthermore, for each pairwise intersection, events corresponding to events with topological dependencies and whose corresponding original alarm information text semantic similarity is greater than a second threshold are designated as the eighth preliminary event set. The union of the first and eighth preliminary event sets is calculated, and multiple events in the union are designated as multiple preliminary events. This method ensures that the number of calculated preliminary events is not excessive, thereby ensuring the calculation speed of subsequent root cause analysis. This is one of the important inventive points of this application.

[0043] In one embodiment, the root cause analysis unit 203 operates as follows: First, it queries the change management system to perform change correlation analysis, checking whether any related configuration items have changed before the event corresponding to the original alarm information occurred. If so, the changed configuration item is taken as the root cause of the initial event. If not, it combines performance indicators and error logs, and traces topology dependencies based on the topology metadata to determine the source of the original alarm information as the root cause of the initial event. This approach breaks the limitations of traditional isolated analysis of single events, automatically identifying and integrating multiple fault events caused by the same root cause, helping maintenance personnel quickly grasp the overall impact of the fault and accurately locate the core problem. This is another inventive concept of this application.

[0044] In one embodiment, the operation of the second aggregation unit 204 is as follows: based on the root cause generated by each preliminary event, search for a main event with the same root cause in the main event set; if the search result is that a corresponding main event exists, then associate the preliminary event with the main event; if the search result is empty, then determine whether there is another preliminary event with the same root cause as the preliminary event; if so, then create a new main event and associate the preliminary event and the other preliminary event together with the new main event.

[0045] In one embodiment, the root cause of the initial event points to the same underlying fault entity or the same fault mode. Other original alarm information with the same underlying fault entity or the same fault mode is searched within the maintenance event set, and the events corresponding to these other original alarm information are associated with the main event of the same root cause. This process identifies events missed in the first aggregation operation through root cause identification, ensuring maximum merging of associated alarms. This is another important inventive point of this application.

[0046] In this invention, by introducing a second-level, root-cause-based deep aggregation, the "missed aggregation" problem that may occur in the first-level real-time aggregation is effectively solved, ensuring the maximum merging of associated alarms and greatly reducing the amount of interfering information for operation and maintenance personnel. The results of the second-level aggregation (i.e., the mapping relationship between root causes and phenomenon sets) can be automatically recorded and stored in the knowledge base, providing data support for future fault prediction, automatic repair, and intelligent decision-making, enabling the operation and maintenance system to have continuous evolution capabilities.

[0047] One embodiment of the present invention provides a computer storage medium storing a computer program. When the computer program on the computer storage medium is executed by a processor, the above-described method is implemented. The computer storage medium may be a hard disk, DVD, CD, flash memory, or other storage device.

[0048] For ease of description, the above apparatus is described by dividing it into various functional units. Of course, in implementing this application, the functions of each unit can be implemented in one or more software and / or hardware.

[0049] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solution of this application, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the apparatus described in various embodiments or some parts of the embodiments of this application.

[0050] Finally, it should be noted that the above embodiments are for illustration only and not for limiting the technical solutions of the present invention. Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the present invention without departing from the spirit and scope of the present invention. Any modifications or partial substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A method for operation and maintenance event analysis based on root cause aggregation, characterized in that, The method includes: The data collection process involves collecting raw alarm information and related topology metadata from the system's monitoring tools, configuration management database (CMDB), and call chain system in real time, and then combining the raw alarm information and related topology metadata into a set of operation and maintenance events. The first aggregation step involves performing real-time aggregation analysis on the set of operation and maintenance events to generate multiple preliminary events. The root cause analysis step involves performing an independent root cause analysis on each preliminary event generated by the first-level aggregation to determine the root cause of each preliminary event. The second aggregation step involves performing aggregation analysis on the multiple preliminary events based on the root cause generated by each preliminary event, so as to aggregate each preliminary event into the main event of the corresponding root cause; In the output step, the main event is visualized in the form of a tree structure or a hierarchical list.

2. The method of claim 1, wherein, The first aggregation step involves: aggregating the operation and maintenance event set based on the time dimension, using a dynamic exponential backoff algorithm to determine the aggregation time window to capture original alarm information with temporal correlation, and aggregating to obtain first aggregated data; aggregating the operation and maintenance event set based on the topology dimension, identifying the dependencies between alarm objects based on CMDB and call chain data, and aggregating the original alarm information based on topology metadata to obtain second aggregated data; aggregating the operation and maintenance event set based on text semantics, using natural language processing technology to calculate the semantic similarity of the original alarm information text, and merging the original alarm information with semantic similarity greater than a first threshold to obtain third aggregated data; and the aggregation decision engine comprehensively adjudicating the first, second, and third aggregated data according to a preset strategy to generate multiple preliminary events, each of which contains a list of alarm information determined to be related and its preliminary assessed impact range.

3. The method of claim 2, wherein, The root cause analysis steps are as follows: First, query the change management system to perform change correlation analysis, and check whether there were any changes to relevant configuration items before the event corresponding to the original alarm information occurred. If so, the changed configuration item is taken as the root cause of the initial event. If not, combine performance indicators and error logs, and trace the topology dependency relationship based on the topology metadata to determine the source of the original alarm information, which is taken as the root cause of the initial event.

4. The method of claim 3, wherein, The second aggregation step is as follows: based on the root cause generated by each preliminary event, search for a main event with the same root cause in the main event set. If the search result shows that a corresponding main event exists, then associate the preliminary event with the search result with the main event. If the search result is empty, then determine whether there is another preliminary event with the same root cause as the preliminary event with the empty search result. If so, then create a new main event and associate the preliminary event with the empty search result and the other preliminary event with the same root cause with the new main event.

5. The method of claim 4, wherein, The root cause of each preliminary event points to the same underlying fault entity or the same fault mode. Other original alarm information with the same underlying fault entity or the same fault mode is searched in the operation and maintenance event set, and the events corresponding to the other original alarm information are associated with the main event with the same root cause.

6. A maintenance event analysis device based on root cause aggregation, characterized in that, The device includes: The data acquisition unit collects raw alarm information and related topology metadata from the system's monitoring tools, configuration management database (CMDB), and call chain system in real time, and combines the raw alarm information and related topology metadata into a set of operation and maintenance events. The first aggregation unit performs real-time aggregation analysis on the set of operation and maintenance events to generate multiple preliminary events; The root cause analysis unit performs independent root cause analysis on each preliminary event generated by the first-level aggregation to determine the root cause of each preliminary event. The second aggregation unit performs aggregation analysis on the multiple preliminary events based on the root cause generated by each preliminary event, so as to aggregate each preliminary event into the main event of the corresponding root cause; The output unit visualizes the main event in the form of a tree structure or a hierarchical list.

7. The apparatus according to claim 6, characterized in that, The first aggregation unit operates as follows: It aggregates the operation and maintenance event set based on the time dimension, using a dynamic exponential backoff algorithm to determine the aggregation time window to capture original alarm information with temporal correlation, and then aggregates it to obtain first aggregated data; it aggregates the operation and maintenance event set based on the topology dimension, identifying the dependencies between alarm objects based on CMDB and call chain data, and aggregating the original alarm information based on topology metadata to obtain second aggregated data; it aggregates the operation and maintenance event set based on text semantics, using natural language processing technology to calculate the semantic similarity of the original alarm information text, and merging original alarm information with semantic similarity greater than a first threshold to obtain third aggregated data; the aggregation decision engine comprehensively adjudicates the first, second, and third aggregated data according to a preset strategy, generating multiple preliminary events, each preliminary event containing a list of alarm information determined to be related and its preliminary assessed impact range.

8. The apparatus according to claim 7, characterized in that, The root cause analysis unit operates as follows: First, it queries the change management system to perform change correlation analysis, checking whether any related configuration items have changed before the event corresponding to the original alarm information occurred. If so, the changed configuration item is taken as the root cause of the initial event. If not, it combines performance indicators and error logs, and traces topology dependencies based on the topology metadata to determine the source of the original alarm information, which is taken as the root cause of the initial event.

9. The apparatus according to claim 8, characterized in that, The operation of the second aggregation unit is as follows: based on the root cause generated by each preliminary event, search for a main event with the same root cause in the main event set. If the search result shows that a corresponding main event exists, then associate the preliminary event with the search result with the main event. If the search result is empty, then determine whether there is another preliminary event with the same root cause as the preliminary event with the empty search result. If so, then create a new main event and associate the preliminary event with the empty search result and the other preliminary event with the same root cause with the new main event.

10. A computer storage medium, characterized in that, The computer storage medium stores a computer program, which, when executed by a processor, implements the method described in any one of claims 1-5.