Anomaly cause analysis method and apparatus, device, and storage medium
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-11-21
- Publication Date
- 2026-06-18
Smart Images

Figure CN2025136864_18062026_PF_FP_ABST
Abstract
Description
Methods, apparatus, equipment and storage media for analyzing the causes of abnormalities
[0001] This application claims priority to Chinese Patent Application No. 202411853941.X, filed on December 13, 2024, entitled “Method, Apparatus, Device and Storage Medium for Anomaly Cause Analysis”, the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of network technology, and in particular to methods, apparatus, devices and storage media for anomaly cause analysis. Background Technology
[0003] Network anomalies refer to a state in which a network is unable to provide normal services or has reduced service quality due to hardware problems, software problems, network attacks, or other reasons.
[0004] In related technologies, a series of predefined anomaly rules are used to identify device operation data reported by network devices. If the device operation data meets the conditions of the anomaly rules, it is marked as a network anomaly. Then, the relationship between the network anomaly and the network device behavior is manually analyzed to determine the root cause of the network anomaly.
[0005] However, the above-mentioned scheme of setting anomaly rules can only analyze the cause of anomalies in the device operation data of a single functional area, and it is difficult to process device operation data across multiple functional areas; moreover, the anomaly rules cannot cover all anomalies, resulting in low accuracy of the anomaly cause analysis results. Summary of the Invention
[0006] This application provides a method, apparatus, device, and storage medium for anomaly cause analysis, which can process equipment operation data across domains and improve the accuracy of anomaly cause analysis results.
[0007] To achieve the above objectives, this application adopts the following technical solution:
[0008] Firstly, a method for analyzing the causes of network anomalies is provided. This method includes: identifying abnormal data in the device operation data of each domain in at least two different functional areas of a network environment to obtain network anomaly events; and searching a multi-domain knowledge graph based on these network anomaly events to obtain the causes of the anomalies.
[0009] A network anomaly event indicates abnormal behavior or abnormal network event in a single domain; the multi-domain knowledge graph includes historical network anomaly events in multiple domains, information on the causes of historical network anomaly events, and the cascading relationships of historical network anomaly events between different domains.
[0010] The solution provided in this application aggregates historical network anomaly events and their causes from multiple domains into a single multi-domain knowledge graph, and reflects the cascading relationships between historical network anomaly events in different domains within the multi-domain knowledge graph. During anomaly cause analysis, network anomaly events in a single domain can be comprehensively analyzed within this multi-domain knowledge graph to obtain the causes of anomalies involving one or more domains. That is, when analyzing the causes of network anomalies in a particular domain, not only can the cause information within that domain be determined, but also the cause information in other domains, achieving cross-domain anomaly cause analysis. Furthermore, the multi-domain knowledge graph includes various network anomaly events from multiple domains, providing broader coverage, and the correlation between network anomaly events in different domains improves the accuracy of anomaly cause analysis results. When this solution is applied to multi-source, multi-modal scenarios, the device operation data from multiple sources and modalities only needs to be reflected in the multi-domain knowledge graph, avoiding the feature loss caused by merging multi-source, multi-modal data, which leads to inaccurate cause analysis.
[0011] One possible implementation involves using nodes in a multi-domain knowledge graph to record historical network anomalies. Edges between first and second nodes indicate cascading relationships between the historical network anomalies indicated by each node. The first and second nodes can be any nodes in the multi-domain knowledge graph. Based on these network anomalies, a search is performed within the multi-domain knowledge graph to identify the causes of the anomalies. Specifically, this can be achieved by using the node recording each network anomaly as the starting node and performing a reverse search within the multi-domain knowledge graph to obtain destination nodes. The cause information corresponding to each destination node is then fused to obtain the anomaly cause. Using the node recording each network anomaly as the starting node and performing a reverse search within a multi-domain knowledge graph encompassing multiple domains allows for the collection of more comprehensive cause information, avoiding blind spots in anomaly cause analysis caused by isolated information from a single domain. By fusing the cause information obtained from the reverse search, a more accurate anomaly cause can be obtained, thereby supporting the development of more effective network anomaly resolution strategies.
[0012] Another possible implementation, the anomaly cause analysis method provided in this application, further includes: if there is no node in the multi-domain knowledge graph for recording the first network anomaly event, constructing a third node for recording the first network anomaly event and connecting it to the fourth node; the first network anomaly event is any one of the network anomaly events. Specifically, the entity that caused the historical network anomaly event indicated by the fourth node is the same as or has a connection relationship with the entity that caused the first network anomaly event; or, the event correlation between the historical network anomaly event indicated by the fourth node and the first network anomaly event is greater than or equal to a threshold value.
[0013] In the absence of a node recording the first network anomaly event, constructing a new node can promptly fill this information gap, ensuring the integrity of the multi-domain knowledge graph. At the same time, connecting the newly created node with existing nodes in the multi-domain knowledge graph that have a connection relationship with the new node is equivalent to attaching the new node to an existing node with which it has a relationship. This allows the multi-domain knowledge graph to adapt to different network anomaly events, enhancing the overall flexibility and adaptability of the multi-domain knowledge graph.
[0014] Another possible implementation, the anomaly cause analysis method provided in this application, further includes: acquiring historical network anomaly events and cause information for each individual domain in multiple domains; determining the cascading relationships between historical network anomaly events in multiple domains; and generating a multi-domain knowledge graph based on the historical network anomaly events and cause information for each individual domain, as well as the cascading relationships. By determining the cascading relationships between historical network anomaly events in different domains, the mutual influence relationships between network anomaly events in different domains can be revealed, thereby achieving cross-domain determination of anomaly causes.
[0015] Another possible implementation involves multiple domains, including storage, computing, or network domains. Determining the cascading relationships between historical network anomaly events across multiple domains can be specifically implemented by: acquiring the operational relationships corresponding to historical network anomaly events in each single domain. These operational relationships include at least one of the following: job scheduling chain relationships describing the flow path relationships of historical network anomaly events across different domains; program call relationships describing the mutual invocation relationships of historical network anomaly events across different domains; and communication records describing the communication relationships of historical network anomaly events across different domains. Historical network anomaly events with related operational relationships are considered as having cascading relationships. By analyzing these operational relationships, potential correlations between network anomaly events in different domains can be revealed, thereby continuously improving and enriching the structure of the multi-domain knowledge graph, making it contain more comprehensive information.
[0016] Another possible implementation involves treating historical network anomaly events with related execution relationships as cascading historical network anomaly events. Specifically, this can be achieved as follows: Based on job scheduling chain relationships, historical network anomaly events with related relationships in the storage and compute domains are treated as cascading historical network anomaly events. Based on program call relationships, historical network anomaly events with related relationships in the network and storage domains are treated as cascading historical network anomaly events. Based on communication records and job scheduling chain relationships, historical network anomaly events with related relationships in the network and compute domains are treated as cascading historical network anomaly events. Based on the characteristics of different execution relationships, network anomaly events in the corresponding domains are selected for potential correlation analysis, thereby enabling cross-domain connections of network anomaly events.
[0017] Another possible implementation is that each node in the multi-domain knowledge graph corresponds to a root cause tree. The root cause tree of the fifth node indicates a set of causal information that led to the historical network anomaly recorded by the fifth node. By fusing one or more sets of causal information corresponding to the destination node, the anomaly cause can be obtained. Specifically, this can be achieved by fusing the root cause trees corresponding to the destination nodes. By fusing the root cause trees corresponding to the destination nodes, the fused root cause tree can locate the anomaly cause more quickly, improving the efficiency of anomaly cause analysis.
[0018] Another possible implementation involves acquiring historical network anomaly events and their causes from each individual domain across multiple domains. Specifically, this can be achieved by receiving historical network anomaly events and their causes reported by each individual domain within the multiple domains. These historical network anomaly events and their causes can be reported based on historical information collection requests or proactively. Reporting historical network anomaly events and their causes based on historical information collection requests allows for customization of historical information collection requests according to specific analytical needs, such as controlling the timing and frequency of data collection, thus enabling better utilization of historical network anomaly events and their causes. Reporting historical network anomaly events and their causes based on proactive reporting allows for real-time reporting of historical physical anomalies, reducing manual intervention and improving automation.
[0019] Another possible implementation involves identifying anomalous data in the device operation data of each of the multiple domains to obtain network anomaly events. Specifically, this can be achieved by identifying anomalous data in the device operation data of each of the multiple domains, resulting in anomalous operation data for each domain. Based on the anomaly type and severity of the anomalous operation data, network anomaly events for each domain are generated. By directly identifying device operation data, potential network anomalies can be detected in time before they occur or worsen. Furthermore, the anomaly type and severity provided by the anomaly data identification can more accurately determine network anomaly events.
[0020] Another possible implementation, the anomaly cause analysis method provided in this application, further includes: receiving device operation data reported by each of multiple domains. The device operation data reported by each domain is either based on an operation data collection request or is proactively reported. Reporting device operation data based on an operation data collection request allows for customization of operation data collection according to specific analysis needs, thus enabling better utilization of operation data collection. Reporting operation data proactively allows for real-time reporting of operation data, reducing manual intervention and improving automation.
[0021] Secondly, an anomaly cause analysis device is provided, which includes: an identification module and a search module. Wherein:
[0022] The identification module is used to identify abnormal data in the device operation data of each of the multiple domains to obtain network abnormal events. A network abnormal event indicates abnormal behavior or abnormal network event in a single domain; the multiple domains include at least two different functional areas in the network environment.
[0023] The search module is used to search in a multi-domain knowledge graph based on network anomaly events to obtain the anomaly causes that led to the network anomaly events. The multi-domain knowledge graph includes historical network anomaly events in multiple domains, information on the causes of historical network anomaly events, and the cascading relationships of historical network anomaly events between different domains.
[0024] One possible implementation involves using nodes in a multi-domain knowledge graph to record historical network anomaly events. An edge between a first node and a second node indicates a cascading relationship between the historical network anomaly events indicated by each node. The first and second nodes can be any nodes in the multi-domain knowledge graph. The search module further performs the following: using the node recording each network anomaly event in the multi-domain knowledge graph as the starting node, it performs a reverse search within the multi-domain knowledge graph to obtain the destination node. Then, it merges one or more sets of cause information corresponding to the destination node to obtain the cause of the anomaly.
[0025] In another possible implementation, the anomaly cause analysis device provided in this application further includes: a construction module. The construction module is used to: if there is no node in the multi-domain knowledge graph for recording the first network anomaly event, construct a third node for recording the first network anomaly event and connect it to the fourth node; the first network anomaly event is any one of the network anomaly events.
[0026] Among them, the entity that caused the historical network anomaly event indicated by the fourth node is the same as or has a connection relationship with the entity that caused the first network anomaly event; or, the correlation between the historical network anomaly event indicated by the fourth node and the first network anomaly event is greater than or equal to a threshold value.
[0027] Another possible implementation involves the aforementioned building module further used to: acquire historical network anomaly events and their causes for each individual domain across multiple domains; determine the cascading relationships between historical network anomaly events across multiple domains; and generate a multi-domain knowledge graph based on the historical network anomaly events and their causes for each individual domain, as well as the cascading relationships.
[0028] Another possible implementation involves multiple domains, including storage domains, compute domains, or network domains. The aforementioned building module is further configured to: obtain the runtime relationships corresponding to historical network anomaly events for each single domain. These runtime relationships include at least one of the following: a job scheduling chain relationship describing the flow path relationship of historical network anomaly events across different domains; a program call relationship describing the mutual call relationship of historical network anomaly events across different domains; and communication records describing the communication relationship of historical network anomaly events across different domains. Historical network anomaly events with associated runtime relationships are considered as having a cascading relationship.
[0029] Another possible implementation is that the aforementioned building module is also used to: classify historical network anomaly events that are related in the storage domain and the compute domain as historical network anomaly events with a cascading relationship based on job scheduling chain relationships; classify historical network anomaly events that are related in the network domain and the storage domain as historical network anomaly events with a cascading relationship based on program call relationships; and classify historical network anomaly events that are related in the network domain and the compute domain as historical network anomaly events with a cascading relationship based on communication records and job scheduling chain relationships.
[0030] Another possible implementation is that each node in the multi-domain knowledge graph corresponds to a root cause tree, and the root cause tree of the fifth node indicates a set of cause information that led to the historical network anomaly recorded by the fifth node; the search module mentioned above is also used to: merge the root cause trees corresponding to the destination nodes to obtain the anomaly cause.
[0031] Another possible implementation is that the aforementioned construction module is also used to: receive historical network anomaly events and cause information reported by each individual domain in multiple domains. The historical network anomaly events and cause information are reported based on historical information collection requests, or proactively reported.
[0032] In another possible implementation, the aforementioned identification module is further used to: identify abnormal data in the device operation data of each of the multiple domains, thereby obtaining abnormal operation data for each domain. Based on the abnormality type and severity of the abnormal operation data, network abnormal events are generated for each domain.
[0033] In another possible implementation, the aforementioned identification module is further configured to: receive device operation data reported by each of the multiple domains. The device operation data reported by each domain is either based on an operation data collection request or is reported proactively.
[0034] The technical effects of any implementation method in the second aspect can be found in the technical effects of any implementation method in the first aspect mentioned above, and will not be repeated here.
[0035] Thirdly, a computer device is provided, comprising: a processor and a memory, wherein the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the anomaly cause analysis method as described above.
[0036] Fourthly, a computer-readable storage medium is provided, wherein at least one computer program is stored in the computer-readable storage medium, and the at least one computer program is loaded and executed by a processor to implement the anomaly cause analysis method as described above.
[0037] Fifthly, a computer program product is provided, comprising a computer program or instructions, which, when executed by a processor, implements the anomaly cause analysis method described above.
[0038] Sixthly, embodiments of this application provide a chip system including at least one processor and at least one interface circuit. The at least one interface circuit is used to perform transceiver functions and send instructions to the at least one processor. When the at least one processor executes the instructions, the at least one processor executes to implement the anomaly cause analysis method as described above.
[0039] The solutions provided in aspects three through six above are used to implement the methods provided in any one of the implementation methods in aspect one above, and their specific implementations will not be described in detail here. The technical effects corresponding to any one of the implementation methods provided in aspects three through six above can be found in the technical effects corresponding to any one of the implementation methods in aspect one above, and will not be described in detail here.
[0040] It should be noted that any of the possible implementations of any of the above aspects can be combined, provided that the solutions do not contradict each other. Attached Figure Description
[0041] Figure 1 is a schematic diagram of the execution of a job task provided by an exemplary embodiment;
[0042] Figure 2 is a schematic diagram of the architecture of a computer system provided in an exemplary embodiment of this application;
[0043] Figure 3 is a schematic diagram of the architecture of a multi-domain operation and maintenance device provided in an exemplary embodiment of this application;
[0044] Figure 4 is a flowchart of constructing a multi-domain knowledge graph provided by an exemplary embodiment of this application;
[0045] Figure 5 is a schematic diagram of a multi-domain knowledge graph provided in an exemplary embodiment of this application;
[0046] Figure 6 is a flowchart of an anomaly cause analysis method provided in an exemplary embodiment of this application;
[0047] Figure 7 is a flowchart of an anomaly cause analysis method provided in another exemplary embodiment of this application;
[0048] Figure 8 is a schematic diagram showing the display effect of the abnormal cause provided in an exemplary embodiment of this application;
[0049] Figure 9 is a schematic diagram of anomaly cause analysis provided by an exemplary embodiment of this application;
[0050] Figure 10 is a schematic diagram of an anomaly cause analysis method provided in another exemplary embodiment of this application;
[0051] Figure 11 is a schematic diagram of the structure of an anomaly cause analysis device provided in an exemplary embodiment of this application;
[0052] Figure 12 is a schematic diagram of the structure of a computer device provided in an exemplary embodiment of this application. Detailed Implementation
[0053] In the embodiments of this application, in order to clearly describe the technical solutions of the embodiments of this application, the terms "first" and "second" are used to distinguish identical or similar items with essentially the same function and effect. Those skilled in the art will understand that the terms "first" and "second" do not limit the quantity or execution order, and the terms "first" and "second" are not necessarily different. The technical features described by "first" and "second" have no sequential or size order.
[0054] In the embodiments of this application, the terms "exemplary" or "for example" are used to indicate that something is an example, illustration, or description. Any embodiment or design that is described as "exemplary" or "for example" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or design. Specifically, the use of terms such as "exemplary" or "for example" is intended to present the relevant concepts in a specific manner to facilitate understanding.
[0055] In the embodiments of this application, at least one can also be described as one or more, and multiple can be two, three, four or more, and this application does not impose any restrictions.
[0056] Furthermore, the network architecture and scenarios described in the embodiments of this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided in the embodiments of this application. As those skilled in the art will know, with the evolution of network architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.
[0057] It should be noted that all information (including but not limited to device information, personal information of the target, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.), and signals involved in this application have been authorized by the target or fully authorized by all parties, and the collection, use, and processing of related data must comply with the relevant laws, regulations, and standards of the relevant countries and regions. For example, the device operation data, historical network anomaly events, and cause information involved in this application were all obtained with full authorization.
[0058] For example, the industry-standard solutions for analyzing the causes of anomalies include the following two methods, which are briefly described below.
[0059] Method 1: For device operation data in each domain, set specific anomaly rules. These rules are typically based on experience and define the conditions under which device operation data can be considered abnormal. Analyze the device operation data in a single domain based on these rules. If the device operation data meets the conditions of the anomaly rule, it is marked as a network anomaly. For example, a preset anomaly rule might be: "If the server response time exceeds 200 milliseconds, it is considered abnormal." Obtain server device operation data; when a server's response time exceeds this threshold, a network anomaly event is generated according to the preset rule. Then, manually analyze the relationship between the network anomaly event and server behavior to determine the root cause of the server response time delay.
[0060] Figure 1 illustrates the execution of a task. As shown in Figure 1(a), when task 10 begins, the computer checks the corresponding operating environment 20, specifically the network element data in the software stack, computing domain, network domain, and storage domain. A network element is a hardware or software entity that performs a specific function in the network. Targeted anomaly rules are manually configured for the network elements 30 to monitor their operation. Simultaneously with the check of the operating environment 20, task 40 is started. If the network element data is abnormal, i.e., the operating environment 20 is abnormal, task 50 is interrupted, and anomaly detection 60, anomaly delimitation 70, and manual isolation 80 are performed. After resolving the operating environment anomaly, task 90 is resumed.
[0061] Specifically, after the computer equipment determines that the operating environment 20 is abnormal and interrupts the task 50, it identifies the network anomaly event that caused the abnormality of the operating environment 20 through anomaly detection 60. After identifying the network anomaly event through anomaly detection 60, anomaly delimitation 70 is performed to determine the location and scope of the network anomaly event. Subsequently, technicians take physical or logical actions to isolate the anomaly location.
[0062] Optionally, physical operations include: replacing malfunctioning network element devices, reconnecting cables, etc. Logical operations include: reconfiguring network element devices, disabling network element device ports, etc.
[0063] As shown in Figure 1(b), which illustrates the results of anomaly cause analysis, it can be seen that the industry-standard anomaly cause analysis scheme can cover the vast majority of network anomaly events. Unidentified network anomaly events (i.e., unknown anomalies) account for approximately 22%. For identified network anomaly events, anomaly cause analysis is performed, and approximately 9% of the anomaly analysis results are incorrect.
[0064] Figure 1(c) illustrates the anomaly cause analysis process. In conventional solutions, a timeout detection mechanism is used during the anomaly detection process. For example, a 30-minute timeout detection means that if no update or signal is received from a process in a task within 30 minutes, the process may be considered abnormal. After a process has been abnormal for 30 minutes, a 30-minute continuation training is initiated; that is, another process is started or a notification is issued after 30 minutes. After identifying a network anomaly, anomaly cause analysis is required, and the computer equipment is controlled to execute recovery decisions. Figure 1(c) shows a table corresponding to some anomaly cause analysis methods and their recovery decisions.
[0065] However, the above methods can only analyze the causes of anomalies in equipment operation data within a single functional area, making it difficult to process equipment operation data across multiple functional areas. Furthermore, the anomaly rules cannot cover all anomalies, resulting in low accuracy in the anomaly cause analysis results. In addition, the anomaly detection process requires a 30-minute wait, making it time-consuming and inefficient.
[0066] Method 2: Merge device operation data from different domains and analyze the merged data to determine the cause of the anomaly. While theoretically feasible, this method requires encoding the device operation data to merge it, which can lead to the loss of key features and lower accuracy in anomaly analysis. Therefore, it is rarely used in Artificial Intelligence for IT Operations (AIOps) scenarios.
[0067] Therefore, the two methods for analyzing the causes of anomalies described above have low accuracy and low efficiency.
[0068] Based on this, this application provides an anomaly cause analysis method. It identifies anomaly data in the device operation data of each domain within at least two different functional areas in a network environment to obtain network anomaly events. Based on these network anomaly events, a search is performed in a multi-domain knowledge graph to obtain the anomaly causes leading to the network anomaly events. The multi-domain knowledge graph includes historical network anomaly events from multiple domains, information on the causes of these historical anomaly events, and the cascading relationships between historical network anomaly events across different domains. During anomaly cause analysis, network anomaly events from a single domain can be comprehensively analyzed within this multi-domain knowledge graph to obtain anomaly causes involving one or more domains, achieving cross-domain anomaly cause analysis. Furthermore, the multi-domain knowledge graph includes various anomaly events from multiple domains, providing broader coverage, and the correlation between network anomaly events across different domains improves the accuracy of the anomaly cause analysis results.
[0069] The solutions provided by the embodiments of this application will be described in detail below with reference to the accompanying drawings.
[0070] Figure 2 shows a schematic diagram of the computer system architecture. The computer system includes a multi-domain operation and maintenance device 203 and a multi-domain network. The multi-domain network includes at least two different functional areas in the network environment, such as a computing domain 201 and a network domain 202. A domain can consist of one or more network element devices. The multi-domain operation and maintenance device 203 is used to monitor and analyze the operation status of network element devices in computing domain 201 and network domain 202, the links between network element devices, and the services running on the network element devices.
[0071] For example, the multi-domain operation and maintenance device 203 can simultaneously monitor and analyze multiple network element devices in multiple domains. For instance, the multi-domain operation and maintenance device 203 monitors the device operating parameters of network element devices in multiple domains. It identifies a network anomaly event "communication signal interruption" in the device operating parameters of network domain 202, and simultaneously identifies a network anomaly event "application service failure" in the device operating parameters of computing domain 201. The multi-domain operation and maintenance device 203 analyzes the causes of the identified network anomalies "communication signal interruption" and "application service failure," concluding that the network anomaly event "communication signal interruption" in network domain 202 caused the network anomaly event "application service failure" in computing domain 201. Through further analysis, the multi-domain operation and maintenance device 203 concludes that the cause of the network anomaly event "communication signal interruption" in network domain 202 is that the optical fiber between network element device 1 and network element device 2 in network domain 202 is broken.
[0072] Optionally, the device operation data includes at least one of logs, key performance indicators (KPIs), infrastructure topology information, and alarm information, but is not limited thereto, and the embodiments of this application do not specifically limit this. Infrastructure topology information is a graphical representation of the infrastructure, showing the connections and dependencies between system components (such as servers, network devices, storage, and applications).
[0073] In some embodiments, the multi-domain operation and maintenance device 203 may be a server. The server may be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides cloud computing services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN), and basic cloud computing services such as big data.
[0074] Specifically, as shown in Figure 3, the architecture of the multi-domain operation and maintenance equipment includes a multi-domain business system 301, an anomaly data identification system 302, and an anomaly cause analysis system 303. This is illustrated by the three domains: computing domain, storage domain, and network domain.
[0075] The multi-domain business system 301 is used to schedule and manage job tasks executed in multiple domains, and at the same time, to monitor and collect device operation data in each domain. For example, the multi-domain business system 301 schedules job tasks executed on network element devices through a job scheduler, and manages job tasks executed on network element devices through a job information manager.
[0076] After collecting device operation data from multiple domains, anomaly identification (302) is performed on the device operation data for each domain. Anomaly identification methods include first-error node analysis, whitelist log matching, and unknown fault detection. First-error node analysis involves analyzing device operation data to identify the first node where an anomaly occurred, thus helping to quickly locate the source of the anomaly. Whitelist log matching involves filtering normal or expected device operation data, leaving only data that does not conform to the whitelist. Unknown fault detection uses a generative pre-trained transformer (GPT) to identify abnormal operation data within the device operation data.
[0077] After identifying abnormal operating data in the device's operational data, network anomaly events are generated for each domain based on the anomaly type and severity of the abnormal data. Optionally, after determining the anomaly type and severity of the abnormal operating data, the network anomaly event corresponding to the abnormal operating data can be obtained by calling a large language models (LLM) to search in the network anomaly event database. After obtaining the network anomaly events, the multi-domain knowledge graph in the anomaly cause analysis system 303 is used as input to perform anomaly cause analysis on the network anomaly events to obtain the anomaly causes leading to the network anomalies. After determining the anomaly causes, corresponding repair instructions are formulated based on the anomaly causes, and the network element devices in the computing domain, storage domain, or network domain are controlled to execute the repair instructions to maintain the normal state of the device's operating data.
[0078] Before performing anomaly analysis on network anomalies in each of multiple domains, a multi-domain knowledge graph needs to be constructed. The multi-domain knowledge graph includes historical network anomalies across multiple domains, information on the causes of these anomalies, and the cascading relationships between historical network anomalies across different domains. Nodes in the multi-domain knowledge graph record historical network anomalies. Edges between the first and second nodes indicate the cascading relationships between the historical network anomalies indicated by each node. The first and second nodes can be any nodes in the multi-domain knowledge graph.
[0079] Figure 4 is a flowchart illustrating the construction of a multi-domain knowledge graph according to an exemplary embodiment of this application. This method can be executed by a computer device. For example, the computer device can be the multi-domain operation and maintenance device 203 illustrated in Figure 2 or the multi-domain operation and maintenance device illustrated in Figure 3. As shown in Figure 4, the method for constructing a multi-domain knowledge graph provided in this embodiment of the application may include:
[0080] Step 401: The computer device obtains historical network anomaly events and their causes.
[0081] A network anomaly event is used to indicate abnormal behavior or abnormal network events within a single domain. Alternatively, a network anomaly event is used to indicate events that cause the network to be unable to provide normal service or to degrade the quality of service.
[0082] Optionally, network anomaly events include network data anomaly events and network failure events. The scope or degree of impact of network data anomaly events and network failure events differs.
[0083] Network data anomalies refer to events that cause unexpected behavior in network services, but do not affect normal user operation. Examples include increased response time in storage systems, increased network traffic, and packet delays.
[0084] A network failure event is an event that causes a complete or partial interruption of network services, affecting normal user experience. Examples include network devices becoming completely unresponsive or network connections being interrupted.
[0085] The cause information includes the reason for the network anomaly, or the remedial experience recorded for that network anomaly. For example, the network anomaly is: the network service has completely lost response. The corresponding cause information includes: network equipment (such as switches, routers, servers) is damaged, network interface is faulty, cable is damaged or poorly connected, restarting the network service, or updating patches, etc.
[0086] In some embodiments, a computer device acquires historical network anomaly events and cause information for each individual domain in multiple domains.
[0087] Optionally, multiple domains include storage domains, computing domains, or network domains. A storage domain refers to the functional area within a computer device used for storing and managing data, such as hard disk drives (HDDs), solid-state drives (SSDs), and storage area networks (SLANs). A computing domain refers to the functional area within a computer device used for data processing and computational capabilities, such as servers, processors, and processor accelerators. A network domain refers to the functional area within a computer device responsible for data transmission and communication, such as switches, routers, gateways, and firewalls.
[0088] For example, the computer device receives historical network anomaly events and cause information reported by each individual domain in multiple domains; wherein the historical network anomaly events and cause information are reported based on historical information collection requests or are reported proactively.
[0089] For example, a computer device sends a historical information collection request to each of multiple domains. Upon receiving the request, each domain reports historical network anomaly events and their causes to the computer device based on the request. Optionally, the historical information collection request can define the time period corresponding to the reported historical network anomaly events and their causes; for example, it can report historical network anomaly events and their causes from the past year.
[0090] For example, historical network anomaly events and their causes actively reported by each individual domain across multiple domains of a computer device. Or, historical network anomaly events and their causes periodically reported by each individual domain.
[0091] Step 402: The computer device determines the cascading relationships between historical network anomaly events in multiple domains.
[0092] Among them, cascading relationship refers to the association between historical network anomalies in the same domain or different domains.
[0093] Optionally, the cascading relationship includes at least one of the following: causal relationship, temporal relationship, and influence relationship, but is not limited thereto. The embodiments of this application do not specifically limit this.
[0094] Causality refers to the necessary connection between a first historical network anomaly and a second historical network anomaly. For example, excessive application usage leads to processor overload, or data loss in the storage system leads to an increase in storage system response events.
[0095] Temporal relationship refers to the chronological order between the first and second historical network anomalies. For example, the first historical network anomaly occurred before the second historical network anomaly.
[0096] An impact relationship refers to the influence of a first historical network anomaly on a second historical network anomaly. For example, processor overload affects database access speed.
[0097] In some embodiments, the computer device acquires the operational relationships corresponding to historical network anomaly events for each single domain. Historical network anomaly events that are related in the operational relationships are considered as historical network anomaly events with a cascading relationship.
[0098] Optionally, the operational relationship includes at least one of the following: job scheduling chain relationship, program call relationship, and communication record.
[0099] The job scheduling chain relationship describes the flow path relationship of historical network anomaly events between different domains. The program call relationship describes the mutual call relationship of historical network anomaly events between different domains. The communication record describes the communication relationship of historical network anomaly events between different domains.
[0100] For example, a computer device, based on job scheduling chain relationships, treats historical network anomaly events that are related in the storage domain and the computing domain as historical network anomaly events with cascading relationships.
[0101] For example, the job scheduling chain is as follows: A task on compute node A needs to read data from storage server B. A historical network anomaly event in the storage domain is: Storage server B experienced a network outage. A historical network anomaly event in the compute domain is: Compute node A was unable to execute the task. It is evident that because storage server B experienced a network outage, the task on compute node A was unable to read data, resulting in task execution failure. Therefore, there is a causal relationship between the historical network anomaly events in the compute domain and the storage domain; that is, a cascading relationship exists.
[0102] Computer devices, based on program call relationships, treat historical network anomaly events that are related in the network domain and storage domain as historical network anomaly events with cascading relationships.
[0103] For example, the program call relationship is as follows: the result data of task C needs to be stored in storage server B. The historical network anomaly event in the storage domain is: data storage on storage server B failed. The historical network anomaly event in the network domain is: the network switch malfunctioned. It is evident that the failure of the network switch caused the data storage failure on storage server B. Therefore, there is a causal relationship between the historical network anomaly events in the network domain and the historical network anomaly events in the storage domain; that is, there is a cascading relationship.
[0104] Computer equipment uses communication records and job scheduling chains to identify historical network anomalies that are related in the network domain and computing domain as cascading historical network anomalies.
[0105] For example, the communication records and job scheduling chain relationship is as follows: Computational task A runs on server 1 and needs to read data from the database on server 2. The historical network anomaly event in the computation domain is: Computational task A failed to read data. The historical network anomaly event in the network domain is: The network switch malfunctioned. It is evident that the network switch malfunction caused communication failure between server 1 and server 2. Therefore, there is a causal relationship between the historical network anomaly events in the network domain and the historical network anomaly events in the computation domain; that is, there is a cascading relationship.
[0106] Step 403: The computer device generates a multi-domain knowledge graph based on the historical network anomaly events and cause information of each single domain, as well as the cascading relationships.
[0107] For example, when constructing a multi-domain knowledge graph, the nodes that record historical network anomalies in different domains are connected based on the cascading relationship of historical network anomalies between different domains to obtain a multi-domain knowledge graph.
[0108] Figure 5 shows a schematic diagram of the multi-domain knowledge graph. Figure 5 includes historical network anomaly events in computing domain 501: application service failure and application errors. Historical network anomaly events in network domain 502 include: interactive machine failure. Historical network anomaly events in storage domain 503 include: storage service unavailability. Because the storage service is unavailable in storage domain 503, it causes application errors in computing domain 501. There is a cascading relationship between the storage service unavailability in storage domain 503 and the application errors in computing domain 501. Therefore, in the multi-domain knowledge graph, the node where the storage service is unavailable is connected to the node where the application errors occur. Similarly, because the interactive machine failure in network domain 502 causes application errors in computing domain 501, there is a cascading relationship between the interactive machine failure in network domain 502 and the application errors in computing domain 501. Therefore, in the multi-domain knowledge graph, the node where the interactive machine failure is located is connected to the node where the application errors occur.
[0109] In summary, when constructing a multi-domain knowledge graph, by determining the cascading relationships between historical network anomalies in different domains, we can reveal the potential connections between network anomalies in different domains, thereby continuously improving and enriching the structure of the multi-domain knowledge graph and enabling it to contain more comprehensive information.
[0110] The above embodiments describe the construction process of a multi-domain knowledge graph. The following will describe the method for analyzing the causes of anomalies.
[0111] Figure 6 is a flowchart of an anomaly cause analysis method provided in an exemplary embodiment of this application. This method can be executed by a computer device. For example, the computer device can be the multi-domain operation and maintenance device 203 shown in Figure 2 or the multi-domain operation and maintenance device shown in Figure 3. The multi-domain knowledge graph can be the multi-domain knowledge graph constructed by the construction method described in Figure 4 above, or it can be a multi-domain knowledge graph constructed in other ways; this application does not limit this. As shown in Figure 6, the anomaly cause analysis method provided in this embodiment of the application may include:
[0112] Step 602: The computer equipment identifies abnormal data in the device operation data of each of the multiple domains to obtain network abnormal events.
[0113] Among them, the device operation data reflects the operating status, performance, and health status of devices within the functional area of the network environment.
[0114] Anomaly data identification refers to identifying abnormal data in the equipment's operating data.
[0115] Optionally, the methods for identifying abnormal data include: determining whether the device operation data is abnormal by setting a threshold or confidence interval; or, identifying the device operation data by using a pre-trained neural network model, but not limited thereto, and the embodiments of this application do not specifically limit this.
[0116] It is understandable that each of the multiple domains may have 0, 1, or more network anomaly events. That is, if the device is operating normally, the domain may not be able to identify any network anomaly events.
[0117] Step 604: The computer device searches the multi-domain knowledge graph based on the network anomaly event to obtain the anomaly cause that led to the network anomaly event.
[0118] Among them, abnormal cause refers to the reason that leads to abnormal network events.
[0119] Optionally, the cause of the anomaly can be one cause information in a multi-domain knowledge graph or multiple cause information in a domain knowledge graph. This application embodiment does not specifically limit this.
[0120] For example, after identifying a network anomaly event, a search is performed in a multi-domain knowledge graph that includes historical network anomaly events and cause information from multiple domains. If the search meets the search termination condition, one or more sets of cause information corresponding to the last node in the search process are taken as the anomaly cause that led to the network anomaly event.
[0121] Optionally, the search termination conditions include at least one of the following: the search reaches the endpoint in the search path, the search process reaches a preset depth, the computing resources occupied by the search process reach a resource threshold, and the time taken by the search process reaches a time threshold, but are not limited thereto, and the embodiments of this application do not specifically limit this.
[0122] For example, if the search termination condition is "the end point in the search path is found", the search is performed in the multi-domain knowledge graph based on network anomaly events. The search terminates when the current node has no edge, that is, when the current node is the last node on the search path.
[0123] When the search termination condition is "the search process has reached a preset depth", the preset depth is five nodes. Based on network anomaly events, the search is conducted in the multi-domain knowledge graph. When the fifth node is found in the search path, the search terminates.
[0124] If the search termination condition is "the computing resources used in the search process reach the resource threshold", assuming the resource threshold is 50%, the search is performed in the multi-domain knowledge graph based on network anomaly events. The search terminates when the CPU utilization corresponding to the search process reaches 50%.
[0125] If the search termination condition is "the time taken for the search process reaches the time threshold", assuming the time threshold is 30 seconds, the search is performed in the multi-domain knowledge graph based on network anomaly events, and the search terminates after 30 seconds.
[0126] In some embodiments, a specified search scope is determined by selecting a search scope; the computer device searches within the specified search scope in a multi-domain knowledge graph based on the network anomaly event to obtain the cause of the anomaly event.
[0127] Optionally, the specified search scope may include, but is not limited to, the entire multi-domain knowledge graph, a portion of a domain within the multi-domain knowledge graph, or a specified range within the multi-domain knowledge graph.
[0128] In summary, the solution provided in this embodiment proposes an anomaly cause analysis method. This method aggregates historical network anomaly events and their causes from multiple domains into a single multi-domain knowledge graph, and reflects the cascading relationships between these events. During anomaly cause analysis, network anomaly events from a single domain can be comprehensively analyzed within this multi-domain knowledge graph to obtain the causes of anomalies involving one or more domains, thus achieving cross-domain anomaly cause analysis. Furthermore, the multi-domain knowledge graph includes various anomaly events from multiple domains, providing broader coverage, and the correlation between network anomaly events across different domains improves the accuracy of the anomaly cause analysis results.
[0129] Figure 7 is a flowchart of an anomaly cause analysis method provided in an exemplary embodiment of this application. This method can be executed by a computer device. For example, the computer device can be the multi-domain operation and maintenance device 203 shown in Figure 2 or the multi-domain operation and maintenance device shown in Figure 3. The multi-domain knowledge graph can be the multi-domain knowledge graph constructed by the construction method described in Figure 4 above, or it can be a multi-domain knowledge graph constructed in other ways; this application does not limit this. As shown in Figure 7, the anomaly cause analysis method provided in this embodiment of the application may include:
[0130] Step 701: The computer device acquires device operation data.
[0131] For example, the computer device receives device operation data reported by each of the multiple domains.
[0132] The device operation data reported by each domain is either based on an operation data collection request or is reported proactively.
[0133] For example, a computer device sends a runtime data collection request to each of multiple domains. Upon receiving the runtime data collection request, each domain reports its runtime data to the computer device based on the request. Optionally, the runtime data collection request can specify a time period, such as reporting runtime data for the past hour.
[0134] For example, each domain within a computer device actively reports device operation data. This could be done periodically by each domain reporting device operation data.
[0135] Step 702: The computer equipment identifies abnormal data in the device operation data of each of the multiple domains to obtain network abnormal events.
[0136] For example, the computer device identifies abnormal data in the device operation data of each of multiple domains, obtaining abnormal operation data for each domain. Based on the abnormality type and severity of the abnormal operation data, the computer device generates network abnormal events for each domain.
[0137] Abnormal operating data refers to abnormal data in the equipment operating data; or, abnormal operating data refers to data in the equipment operating data that exceeds a set threshold.
[0138] After acquiring device operational data for each domain (e.g., logs, KPIs, infrastructure topology information, alarm information, etc.), anomaly detection algorithms are used to identify abnormal operational data. The abnormal operational data is then identified to determine its anomaly type (e.g., abnormal traffic, abnormal latency, abnormal error rate, etc.) and severity (e.g., a tenfold increase in traffic indicates a high severity). Based on the anomaly type and severity, corresponding network anomaly events are generated for each domain. For example, if network traffic increases tenfold at a certain moment, the generated network anomaly event would be: Abnormal Increase in Traffic.
[0139] Alternatively, the anomaly detection algorithm may employ the standard deviation method, box plot method, or generative adversarial network model to detect abnormal operating data.
[0140] Step 703: The computer device takes the node that records each network anomaly event in the multi-domain knowledge graph as the starting node, and performs a reverse search in the multi-domain knowledge graph to obtain the destination node.
[0141] In this context, a destination node refers to an entity in a multi-domain knowledge graph that serves as the endpoint or target node of a relationship.
[0142] For example, a destination node refers to the endpoint in the reverse search process. Alternatively, a destination node refers to a node without edges found during the reverse search process. Alternatively, a destination node refers to a node found when a predetermined search depth is reached during the reverse search process. Alternatively, a destination node refers to the last node in the search path when the reverse search terminates. Alternatively, a destination node refers to the target node in the search path corresponding to the reverse search process, where the target node can be selected according to actual needs.
[0143] Optionally, if the search termination condition is "the end point in the search path is found", the destination node is the end point on that search path.
[0144] When the search termination condition is "the search process has reached the preset depth", the preset depth is five nodes, and the destination node is the fifth node in the search path.
[0145] If the search termination condition is "the computing resources used in the search process reach the resource threshold", assuming the resource threshold is 50%, the destination node is the node found when the CPU utilization corresponding to the search process reaches 50%.
[0146] If the search termination condition is "the time taken for the search process reaches the time threshold", assuming the time threshold is 30 seconds, the destination node is the node found when the search time reaches 30 seconds.
[0147] For example, if there is a node in the multi-domain knowledge graph that records the first network anomaly event, the node that records the first network anomaly event in the multi-domain knowledge graph is taken as the starting node, and a reverse search is performed in the multi-domain knowledge graph. That is, the search is performed in reverse along the relationships in the multi-domain knowledge graph to obtain the destination node.
[0148] If there is no node in the multi-domain knowledge graph that records the first network anomaly event, a third node is constructed in the multi-domain knowledge graph to record the first network anomaly event and connected to the fourth node; the first network anomaly event is any one of the network anomaly events.
[0149] In this context, the entity that triggered the historical network anomaly event indicated by the fourth node is the same as or has a connection relationship with the entity that triggered the first network anomaly event. For example, if the entity identified as the processor in the first network anomaly event is a processor, the node in the historical network anomaly event of the multi-domain knowledge graph where the processor entity is also located is selected as the fourth node and connected to the newly constructed third node. Alternatively, if the entity identified as the processor in the first network anomaly event is a processor, the node containing the entity "accelerator" in the historical network anomaly event of the multi-domain knowledge graph that has a connection relationship with the processor is selected as the fourth node and connected to the newly constructed third node.
[0150] Alternatively, the historical network anomaly event indicated by the fourth node has an event correlation degree with the first network anomaly event that is greater than or equal to a threshold value. If there is no node in the multi-domain knowledge graph that records the first network anomaly event, a third node is constructed in the multi-domain knowledge graph to record the first network anomaly event. By calculating the event correlation degree between the third node and the node containing the historical network anomaly event in the multi-domain knowledge graph, nodes with an event correlation degree greater than or equal to the threshold value are designated as the fourth node and connected to the newly constructed third node.
[0151] Step 704: The computer device merges one or more sets of cause information corresponding to the destination node to obtain the cause of the anomaly.
[0152] Specifically, each node in the multi-domain knowledge graph records a set of cause information. In step 704, the set or more sets of cause information recorded by one or more host nodes obtained in step 702 are merged as the above-mentioned abnormal cause.
[0153] Fusion refers to integrating one or more sets of causal information together.
[0154] For example, fusion can be used to remove duplicate or similar cause information from one or more sets of cause information. Optionally, if the cause information also corresponds to a confidence value, fusion can also be used to delete cause information from one or more sets of cause information whose confidence value is lower than a confidence threshold.
[0155] When the cause information is represented in the form of a root cause tree, fusion refers to merging multiple root cause trees and removing duplicates of the same or similar nodes in the root cause trees.
[0156] For example, a node records a set of cause information, which can be multi-level cause information, and multi-level cause information can be represented in the form of a root cause tree.
[0157] Optionally, each node in the multi-domain knowledge graph corresponds to a root cause tree. The root cause tree of the fifth node indicates a set of causal information leading to the historical network anomaly recorded by the fifth node. After a destination node is found, the destination node also corresponds to a root cause tree. The intermediate nodes in the root cause tree are the direct or indirect causes of the network anomaly, and the leaf nodes in the root cause tree are the root causes of the network anomaly.
[0158] For example, if the historical network anomaly event corresponding to the destination node is "data loss", and the destination node corresponds to a root cause tree, then the first-level causes in the root cause tree are: disk performance degradation and storage network congestion. The second-level causes in the root cause tree are: disk hardware failure and network configuration error. Among them, the disk hardware failure node in the root cause tree is a child node of the node in the root cause tree that represents disk performance degradation; the network configuration error node in the root cause tree is a child node of the node in the root cause tree that represents storage network congestion.
[0159] Correspondingly, after obtaining the destination node, the root cause tree corresponding to the destination node can be merged to obtain the cause of the anomaly.
[0160] When the same destination node is found in different domains, the same destination nodes in different domains are merged, and the same cause information in the root cause tree corresponding to the destination node is deduplicated to obtain the final cause of the anomaly.
[0161] In some embodiments, after fusing the root cause tree to obtain the cause of the anomaly, the final anomaly cause is rendered using topography (Topo) linked rendering technology and displayed on the user interface, thus clearly showing the cause of the anomaly. Figure 8 shows a schematic diagram of the display effect of the anomaly cause. The cause information shown in the figure includes: switch 16, switch 68, compute node 435, compute node 519, and neural processing unit 1 (NPU1), where NPU1 is the most fundamental cause of the anomaly. The user interface not only displays the anomaly cause in schematic form but also shows the details of the anomaly cause and the corresponding anomaly handling solution. Reconfiguring NPU1 according to the proposed solution can resolve the network anomaly.
[0162] In summary, the solution provided in this embodiment proposes an anomaly cause analysis method. By utilizing a multi-domain knowledge graph, when performing anomaly cause analysis, network anomaly events can be comprehensively analyzed based on single-domain network anomaly events within the multi-domain knowledge graph. This allows for the identification of anomaly causes involving one or more domains, achieving cross-domain anomaly cause analysis. Furthermore, the multi-domain knowledge graph includes various anomaly events from multiple domains, providing broader coverage. The correlation between network anomaly events across different domains also enhances the accuracy of anomaly cause analysis results.
[0163] For example, Figure 9 shows a schematic diagram of anomaly cause analysis. This method can be executed by a computer device. For example, the computer device can be the multi-domain operation and maintenance device 203 shown in Figure 2 or the multi-domain operation and maintenance device shown in Figure 3.
[0164] Computer equipment obtains operational data from each of multiple domains through the Customer Contact Center Agent (CC Agent) 901, and processes this data using the Kafka 902 messaging system based on ZooKeeper. This processing includes data cleaning and transformation. Data cleaning removes invalid, erroneous, or incomplete data from the operational data. Data transformation converts the data into a standardized format, such as standardized timestamps or unified units.
[0165] After processing the device operation data for each domain, anomaly data identification 903 is performed on the device operation data for each domain to obtain network anomaly event 904. After obtaining network anomaly event 904, a search is performed in the multi-domain knowledge graph 908 based on network anomaly event 904 to obtain the anomaly cause 909 that caused network anomaly event 904.
[0166] Specifically, after obtaining network anomaly event 904, the network anomaly event is mounted in the multi-domain knowledge graph 908. That is, in the multi-domain knowledge graph 908, taking the node recording network anomaly event 904 in the multi-domain knowledge graph 908 as the starting node, a reverse search is performed in the multi-domain knowledge graph 908 to obtain the destination node. Then, the root cause trees corresponding to the destination nodes are merged 907 to finally obtain the anomaly cause 909 that led to network anomaly event 904.
[0167] The following example illustrates the anomaly cause analysis process. Figure 10 shows a schematic diagram of the anomaly cause analysis method. This method can be executed by a computer device. For example, the computer device can be the multi-domain operation and maintenance device 203 shown in Figure 2 or the multi-domain operation and maintenance device shown in Figure 3.
[0168] The computer device acquires device operation data from computing domain 1001 and identifies abnormal data within this data, resulting in a network anomaly event in computing domain 1001: "Online service interruption." Next, it searches the multi-domain knowledge graph for a node recording "Online service interruption." If such a node exists, it uses it as the starting point and performs a reverse search within the multi-domain knowledge graph to find the destination nodes: a destination node in network domain 1002 recording "Network connectivity problem" and a destination node in storage domain 1003 recording "Storage service unavailable."
[0169] After obtaining the destination node, the one or more sets of cause information corresponding to the destination node are merged to obtain the cause of the anomaly. That is, the primary causes of the final anomaly are: insufficient network bandwidth, router failure, and hard drive damage; the secondary causes are: hardware failure, configuration failure, disk aging, and disk read / write errors. Among them, insufficient network bandwidth, router failure, hardware failure, and configuration failure belong to the cause information in network domain 1002, while hard drive damage, disk aging, and disk read / write errors belong to the cause information in storage domain 1003.
[0170] In summary, the multi-domain knowledge graph not only demonstrates the cascading relationships between historical network anomalies within a single domain, but also the cascading relationships between historical network anomalies across domains. When conducting anomaly cause analysis, it is possible to comprehensively analyze network anomalies based on historical network anomaly events and cause information from multiple domains, ultimately obtaining the anomaly causes involving different domains, thus realizing cross-domain anomaly cause analysis and improving the accuracy of anomaly cause analysis results.
[0171] The foregoing mainly describes the solution provided in this application. Accordingly, this application also provides an anomaly cause analysis device, which is used to implement the above-described method embodiments.
[0172] In some embodiments, the anomaly cause analysis device includes hardware structures and / or software modules corresponding to the execution of each function in order to achieve the above-described functions. Those skilled in the art will readily recognize that, based on the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein, this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in hardware or by computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0173] This application embodiment can divide the anomaly cause analysis device into functional modules according to the above method embodiment. For example, each function can be divided into its own functional module, or two or more functions can be integrated into one processing module. The integrated module can be implemented in hardware or as a software functional module. It should be noted that the module division in this application embodiment is illustrative and only represents one logical functional division. In actual implementation, there may be other division methods.
[0174] In some embodiments, this application provides an anomaly cause analysis device, which is used to implement the functions of the anomaly cause analysis device in the above-described anomaly cause analysis method embodiments. Figure 11 shows a schematic diagram of the anomaly cause analysis device. The anomaly cause analysis device may include an identification module 1101, a search module 1102, and a construction module 1103.
[0175] Specifically, the identification module 1101 is used to execute step 602 in the method illustrated in Figure 6, and steps 701 and 702 in the method illustrated in Figure 7. The search module 1102 is used to execute step 604 in the method illustrated in Figure 6, and steps 703 and 704 in the method illustrated in Figure 7. The construction module 1103 is used to execute steps 401, 402, and 403 in the method illustrated in Figure 4.
[0176] As shown in Figure 12, the computer device provided in this embodiment may include a processor 1201, a bus 1202, a communication interface 1203, and a memory 1204. The processor 1201, the memory 1204, and the communication interface 1203 communicate with each other via the bus 1202. It should be understood that this application does not limit the number of processors and memories in the computer device.
[0177] Bus 1202 can be a PCI bus, an Extended Industry Standard Architecture (EISA) bus, or a UB bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, only one line is used in Figure 12, but this does not imply that there is only one bus or one type of bus. Bus 1202 can include pathways for transmitting information between various components of the computer device (e.g., memory 1204, processor 1201, communication interface 1203).
[0178] Processor 1201 may include any one or more processors such as CPU, graphics processing unit (GPU), microprocessor (MP), or digital signal processor (DSP).
[0179] The memory 1204 may include volatile memory, such as random access memory (RAM). The processor 1201 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD).
[0180] The communication interface 1203 uses transceiver modules, such as, but not limited to, network interface cards and transceivers, to enable communication between computer devices and other devices or communication networks.
[0181] The memory 1204 stores executable program code, and the processor 1201 executes the executable program code to implement the functions of the exception cause analysis device, the exception cause analysis device, or the CPU core in the aforementioned method embodiments. That is, the memory 1204 stores instructions for executing the above-mentioned exception cause analysis method.
[0182] In another aspect, a computer-readable storage medium is provided, wherein at least one computer program is stored in the computer-readable storage medium, and the at least one computer program is loaded and executed by a processor to implement the anomaly cause analysis method as provided in the above-described method embodiments.
[0183] On the other hand, a computer program product is provided, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, implements the abnormal cause analysis method described above.
[0184] On another front, a chip system is provided, including at least one processor and at least one interface circuit, wherein the at least one interface circuit is used to perform transceiver functions and send instructions to the at least one processor, and when the at least one processor executes the instructions, the at least one processor executes to implement the anomaly cause analysis method as described above.
[0185] The method steps in this embodiment can be implemented in hardware or by a processor executing software instructions. The software instructions can consist of corresponding software modules, which can be stored in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, portable hard disks, CD-ROMs, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, enabling the processor to read information from and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and storage medium can reside in an ASIC. Alternatively, the ASIC can reside in a computer device. Of course, the processor and storage medium can also exist as discrete components in the computer device.
[0186] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of this application are performed entirely or partially. The computer can be a general-purpose computer, a special-purpose computer, a computer network, a network device, a user equipment, or other programmable device. The computer program or instructions can be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another. For example, the computer program or instructions can be transferred from one website, computer, server, or data center to another website, computer, server, or data center via wired or wireless means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium, such as a floppy disk, hard disk, or magnetic tape; it can also be an optical medium, such as a digital video disc (DVD); or it can be a semiconductor medium, such as a solid-state drive (SSD). The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for analyzing the causes of anomalies, characterized in that, The method includes: Abnormal data identification is performed on the device operation data of each of the multiple domains to obtain network abnormal events. Each network abnormal event indicates abnormal behavior or abnormal network event in a single domain. The multiple domains include at least two different functional areas in the network environment. Based on the network anomaly event, a search is performed in a multi-domain knowledge graph to obtain the anomaly cause of the network anomaly event. The multi-domain knowledge graph includes historical network anomaly events of the multiple domains, information on the causes of the historical network anomaly events, and the cascading relationship of historical network anomaly events between different domains.
2. The method according to claim 1, characterized in that, The nodes in the multi-domain knowledge graph are used to record the historical network anomaly events. The edge between the first node and the second node indicates the cascading relationship between the historical network anomaly events indicated by the first node and the second node, and the first node and the second node are any nodes in the multi-domain knowledge graph. The step of searching a multi-domain knowledge graph based on the network anomaly event to obtain the anomaly cause of the network anomaly event includes: Using the node that records each network anomaly event in the multi-domain knowledge graph as the starting node, a reverse search is performed in the multi-domain knowledge graph to obtain the destination node; The cause of the anomaly is obtained by fusing one or more sets of cause information corresponding to the destination node.
3. The method according to claim 2, characterized in that, The method further includes: If there is no node in the multi-domain knowledge graph for recording the first network anomaly event, a third node for recording the first network anomaly event is constructed and connected to the fourth node; the first network anomaly event is any one of the network anomaly events. Wherein, the entity that caused the historical network anomaly event indicated by the fourth node is the same as or has a connection relationship with the entity that caused the first network anomaly event; or, the correlation between the historical network anomaly event indicated by the fourth node and the first network anomaly event is greater than or equal to a threshold value.
4. The method according to any one of claims 1 to 3, characterized in that, The method further includes: Obtain historical network anomaly events and their causes for each of the multiple domains; Determine the cascading relationships among the historical network anomaly events in the multiple domains; The multi-domain knowledge graph is generated based on the historical network anomaly events and their causes in each single domain, as well as the cascading relationships.
5. The method according to claim 4, characterized in that, The plurality of domains includes a storage domain, a computing domain, or a network domain; determining the cascading relationships among the historical network anomaly events in the plurality of domains includes: Obtain the operational relationship corresponding to the historical network anomaly event for each single domain. The operational relationship includes at least one of the following: job scheduling chain relationship, program call relationship, and communication record. The job scheduling chain relationship is used to describe the flow path relationship of the historical network anomaly event between different domains. The program call relationship is used to describe the mutual call relationship of the historical network anomaly event between different domains. The communication record is used to describe the communication relationship of the historical network anomaly event between different domains. Historical network anomaly events that are related in the operational relationship are considered as historical network anomaly events with a cascading relationship.
6. The method according to claim 5, characterized in that, The step of treating historical network anomaly events that are associated in the operational relationship as historical network anomaly events with a cascading relationship includes: Based on the job scheduling chain relationship, historical network anomaly events that are related in the storage domain and the computing domain are regarded as historical network anomaly events with a cascading relationship. Based on the program call relationship, historical network anomaly events that are associated in the network domain and the storage domain are regarded as historical network anomaly events with a cascading relationship. Based on the communication records and the job scheduling chain relationship, historical network anomaly events that are related in the network domain and the computing domain are regarded as historical network anomaly events with a cascading relationship.
7. The method according to any one of claims 2 to 6, characterized in that, The nodes in the multi-domain knowledge graph correspond to a root cause tree, and the root cause tree of the fifth node indicates a set of causal information that led to the historical network anomaly recorded by the fifth node. The step of fusing one or more sets of cause information corresponding to the destination node to obtain the cause of the anomaly includes: The root cause tree corresponding to the host node is merged to obtain the cause of the anomaly.
8. The method according to any one of claims 4 to 6, characterized in that, The step of obtaining historical network anomaly events and cause information for each of the multiple domains includes: The system receives the historical network anomaly events and their causes reported by each of the multiple domains. The historical network anomaly events and the cause information are reported based on historical information collection requests or proactively reported.
9. The method according to any one of claims 1 to 8, characterized in that, The process of identifying abnormal data in the device operation data of each of the multiple domains to obtain network abnormal events includes: Abnormal data identification is performed on the device operation data in each of the multiple domains to obtain abnormal operation data in each domain; Based on the anomaly type and severity of the abnormal operation data, the network anomaly event in each domain is generated.
10. The method according to any one of claims 1 to 9, characterized in that, The method further includes: Receive device operation data reported by each of the multiple domains respectively; The device operation data reported by each domain is either reported based on an operation data collection request or reported proactively.
11. An anomaly cause analysis device, characterized in that, The device includes: The identification module is used to identify abnormal data in the device operation data of each of the multiple domains to obtain network abnormal events. Each network abnormal event indicates abnormal behavior or abnormal network event in a single domain. The multiple domains include at least two different functional areas in the network environment. The search module is used to search in a multi-domain knowledge graph based on the network anomaly event to obtain the anomaly cause of the network anomaly event. The multi-domain knowledge graph includes historical network anomaly events of the multiple domains, cause information of the historical network anomaly events, and cascading relationships of historical network anomaly events between different domains.
12. A computer device, characterized in that, The computer device includes a processor and a memory, wherein the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the anomaly cause analysis method as described in any one of claims 1 to 10.
13. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores at least one computer program, which is loaded and executed by a processor to implement the anomaly cause analysis method as described in any one of claims 1 to 10.
14. A computer program product, characterized in that, The computer program product includes a computer program or instructions that, when executed by a processor, implement the anomaly cause analysis method as described in any one of claims 1 to 10.