Anomaly detection processing method and apparatus

By acquiring abnormal events and dependencies in a distributed architecture, identifying common dependent components, and performing state indicator detection, the problem of low efficiency in anomaly detection is solved, achieving efficient and accurate anomaly detection.

CN122240370APending Publication Date: 2026-06-19BEIJING QIYI CENTURY SCI & TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING QIYI CENTURY SCI & TECH CO LTD
Filing Date
2026-03-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In distributed architectures, anomaly detection is inefficient, and existing methods cannot effectively identify whether different abnormal system components originate from the same dependent component, causing the investigation direction to deviate from the root cause, resulting in low efficiency and a high risk of misjudgment.

Method used

By obtaining the set of abnormal events and the dependencies between system components in the distributed architecture, the common dependent components of multiple abnormal system components are identified, and their status indicators are detected to generate abnormal detection results.

🎯Benefits of technology

It improves the efficiency and accuracy of anomaly detection, accurately identifying whether different abnormal system components originate from the same dependent component, reducing false positives, and improving the accuracy of anomaly detection and repair efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240370A_ABST
    Figure CN122240370A_ABST
Patent Text Reader

Abstract

This invention provides an anomaly detection and processing method and apparatus. The method includes: acquiring a set of anomaly events under a distributed architecture and the dependencies between system components under the distributed architecture; determining multiple abnormal system components in the distributed architecture based on the anomaly events in the set of anomaly events; reading the dependent components of each abnormal system component based on the dependencies; determining the common dependent components of the multiple abnormal system components based on the dependent components of each abnormal system component; performing state indicator detection on the common dependent components; identifying the common dependent components that fail the state indicator detection as abnormal dependent components; and generating anomaly detection results for the distributed architecture based on the abnormal dependent components, thereby improving the efficiency of anomaly detection in the distributed architecture.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to an anomaly detection and processing method and apparatus. Background Technology

[0002] With the popularization of microservices and cloud-native architecture, software systems are becoming increasingly complex. A business request often needs to be completed collaboratively across multiple independently deployed service systems. While this architecture improves system elasticity and development efficiency, it also makes it extremely difficult to locate and troubleshoot abnormal issues. Currently, when online business experiences anomalies, such as failed user requests or a surge in response latency, the troubleshooting methods still rely on isolated investigations and manual source tracing to analyze anomalies from a single system perspective. Specifically, to improve investigation efficiency, a troubleshooting reminder can be sent to at least one investigation team. After each team checks logs, metrics, and configurations, the investigation scope is gradually narrowed down through information synchronization. However, isolated investigations and manual source tracing are inefficient, and due to the lack of a global dependency view, when a shared resource fails, all services that depend on that resource will report errors simultaneously. However, each team, operating in an information silo, may mistakenly believe that the problem is with its own system and conduct a full self-check, causing the investigation direction to deviate from the true root cause. Furthermore, since each team only analyzes the anomaly logs of its own system without effective means of correlation analysis, it is very easy to misjudge them as multiple independent anomaly events. Summary of the Invention

[0003] The purpose of this invention is to provide an anomaly detection processing method and apparatus to solve the problem of low anomaly detection efficiency.

[0004] In a first aspect of this invention, an anomaly detection and processing method is provided, comprising: Obtain the set of exception events under the distributed architecture, as well as the dependencies between system components under the distributed architecture; Based on the abnormal events in the abnormal event set, identify multiple abnormal system components of the distributed architecture; Based on the aforementioned dependency relationships, the dependent components of each abnormal system component are read, and based on the dependent components of each abnormal system component, the common dependent components of the multiple abnormal system components are determined. The common dependent components are subjected to status indicator detection. Common dependent components that fail the status indicator detection are identified as abnormal dependent components. Anomaly detection results of the distributed architecture are generated based on the abnormal dependent components.

[0005] In a second aspect of the present invention, an anomaly detection and processing apparatus is also provided, comprising: The acquisition unit is used to acquire a set of abnormal events under the distributed architecture, as well as the dependencies between system components under the distributed architecture. A processing unit is configured to determine multiple abnormal system components of the distributed architecture based on the abnormal events in the abnormal event set. The processing unit is further configured to read the dependent components of each abnormal system component based on the dependency relationship, and determine the common dependent components of the multiple abnormal system components based on the dependent components of each abnormal system component. The processing unit is further configured to perform status indicator detection on the common dependent components, obtain common dependent components that fail the status indicator detection as abnormal dependent components, and generate the abnormal detection result of the distributed architecture based on the abnormal dependent components.

[0006] In another aspect of the present invention, an electronic device is also provided, comprising: a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; the memory is used to store computer programs; and the processor, when executing the program stored in the memory, implements any of the above-described anomaly detection and processing methods.

[0007] In another aspect of the present invention, a computer-readable storage medium is also provided, wherein instructions are stored therein, which, when executed on a computer, cause the computer to perform any of the above-described anomaly detection processing methods.

[0008] In another aspect of the present invention, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute any of the above-described anomaly detection and processing methods.

[0009] The anomaly detection and processing method provided in this embodiment of the invention first identifies multiple abnormal system components under a distributed architecture based on abnormal events in an abnormal event set. Then, by leveraging the dependencies between system components under the distributed architecture, it obtains the dependent components of each abnormal system component. Next, it identifies the common dependent components of the multiple abnormal system components, performs state indicator detection on the common dependent components, and identifies common dependent components that fail the state indicator detection as abnormal dependent components. Finally, it generates anomaly detection results for the distributed architecture based on the abnormal dependent components. By pre-configuring the dependencies between system components under the distributed architecture, it enables the acquisition of the dependent components of each abnormal system component through these dependencies, further identifies common dependent components, and then performs state indicator detection on these common dependent components. This ensures that for the same dependent component corresponding to different abnormal system components, only one state indicator detection is required, improving the detection efficiency of dependent components and accurately identifying whether the anomalies of different abnormal system components originate from the same dependent component, thus improving both detection efficiency and anomaly detection accuracy. Attached Figure Description

[0010] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below.

[0011] Figure 1 This is a flowchart of the anomaly detection and processing method provided in an embodiment of the present invention; Figure 2 A schematic diagram of the implementation environment for the anomaly detection and processing method provided in this embodiment of the invention; Figure 3 This is a schematic diagram of an anomaly detection process applied to an anomaly detection scenario in a distributed architecture, provided by an embodiment of the present invention. Figure 4 This is a schematic diagram of the anomaly detection and processing device provided in an embodiment of the present invention; Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0012] The technical solutions of the embodiments of the present invention will now be described with reference to the accompanying drawings. It should be noted that, unless otherwise specified, the embodiments and features of the present invention can be combined with each other.

[0013] Numerous specific details are set forth in the following description in order to provide a full understanding of the invention, but the invention may also be practiced in other ways different from those described herein. Obviously, the embodiments in the specification are only some, not all, of the embodiments of the invention.

[0014] In embodiments of the present invention, the terms "exemplary" or "for example" are used to indicate that something is an example, illustration, or description. Any embodiment or design described as "exemplary" or "for example" in embodiments of the present invention should not be construed as being more preferred or advantageous than other embodiments or designs. Specifically, the use of the terms "exemplary" or "for example" is intended to present the relevant concepts in a specific manner. Furthermore, in the description of embodiments of the present invention, unless otherwise stated, "a plurality of" means two or more.

[0015] Reference Figure 1 As shown, the anomaly detection and processing method provided in this embodiment of the invention specifically includes the following steps S101 to S104.

[0016] Step S101: Obtain the set of exception events under the distributed architecture, as well as the dependencies between system components under the distributed architecture.

[0017] In practical applications, a business request often requires the collaboration of multiple independently deployed systems. Therefore, this embodiment provides a distributed architecture to achieve efficient business response. The distributed architecture in this embodiment refers to a pattern where the functionality of an application system is broken down into multiple loosely coupled system components with independent deployment capabilities. These components can interact and collaborate via network protocols to jointly achieve business goals.

[0018] In practical implementation, the set of abnormal events under the distributed architecture and the dependencies between system components under the distributed architecture are obtained. The system components in this embodiment include functional units corresponding to independently deployable systems. Optionally, the system components include at least one of the following: service nodes, databases, caches, message middleware, and load balancers.

[0019] It should be noted that the system components in this embodiment may include service nodes and shared resources. Shared resources may include databases, caches, message brokers, and / or load balancers. The dependencies in this embodiment include upstream and downstream dependencies between system components in a distributed architecture.

[0020] In this embodiment, the set of abnormal events includes a set of abnormal events composed of abnormal events. During specific execution, to improve the efficiency of anomaly detection and avoid having too many common dependent components determined from the set of abnormal events, which could affect the efficiency of subsequent status indicator detection, the set of abnormal events in this embodiment can be a set of abnormal events composed of highly correlated abnormal events, i.e., a set of abnormal events composed of related abnormal events; based on this, the set of abnormal events can be replaced with a set of related abnormal events.

[0021] Specifically, in obtaining the set of abnormal events under a distributed architecture, event correlation detection can be performed on candidate abnormal event sets, and the candidate abnormal event sets that pass the event correlation detection are determined as the abnormal event set. Specifically, a sliding window-based clustering algorithm can be used to partition multiple abnormal events under the distributed architecture to obtain at least one candidate abnormal event set. For example, the T-Dense (Time-based Dense) clustering algorithm can be used to partition multiple abnormal events under the distributed architecture to obtain at least one candidate abnormal event set.

[0022] To improve the effectiveness of the obtained candidate anomaly events and avoid constructing a candidate anomaly event set based on anomaly events with large time intervals, which could lead to inaccurate anomaly detection results after determining the candidate anomaly event set as the anomaly event set, this embodiment can divide multiple anomaly events according to the time information of each anomaly event in the distributed architecture to obtain at least one candidate anomaly event set. Based on this, in an optional implementation of this embodiment, the following operations are performed during the process of obtaining the anomaly event set in the distributed architecture: Obtain multiple exception events under the distributed architecture, sort the multiple exception events according to the time information of each exception event, and obtain a list of exception events; The abnormal events in the abnormal event list are divided and processed according to a preset time window to obtain at least one candidate abnormal event set; Perform event correlation detection on each candidate abnormal event set, and determine the candidate abnormal event set that passes the event correlation detection as the abnormal event set.

[0023] In the specific execution process, after obtaining multiple abnormal events under the distributed architecture, a time window-based clustering algorithm is used to divide the multiple abnormal events to obtain at least one candidate abnormal event set. Then, event correlation detection is performed on each candidate abnormal event set, and the candidate abnormal event set that passes the event correlation detection is determined as the abnormal event set.

[0024] Specifically, the multiple abnormal events are sorted according to their time information to obtain an abnormal event list. Then, the abnormal events in the abnormal event list are divided according to a preset time window to obtain at least one candidate abnormal event set. Finally, event correlation detection is performed on each candidate abnormal event set, and the candidate abnormal event set that passes the event correlation detection is determined as the abnormal event set.

[0025] For example, abnormal events include e1, e2, ..., en. Each abnormal event is sorted according to time information. Then, according to a preset time window W (5 minutes), a candidate abnormal event set is constructed based on the abnormal events falling within the same preset time window W. After obtaining the candidate abnormal event set, event association detection is performed on at least one candidate abnormal event set, and the candidate abnormal event set that passes the event association detection is determined as the abnormal event set.

[0026] In specific implementation, to improve the efficiency of subsequent status indicator detection and avoid a large number of common dependent components corresponding to the divided abnormal event sets due to low correlation between abnormal events in the abnormal event set, in this embodiment, if the event similarity of most any two abnormal events in the candidate abnormal event set is greater than a similarity threshold, then the event association detection of the candidate abnormal event set is determined to be passed; wherein, if the number of event similarities greater than the similarity threshold is greater than a preset similarity number, then the event similarity of most any two abnormal events is determined to be greater than the similarity threshold. In an optional implementation provided in this embodiment, during the event association detection of the first candidate abnormal event set in at least one candidate abnormal event set, the event similarity of any two abnormal events in the first candidate abnormal event set is first calculated to obtain multiple event similarities, and then the number of similarities greater than the similarity threshold among the multiple event similarities is calculated. If the number of similarities meets a first preset condition, the event association detection of the first candidate abnormal event set is determined to be passed; if the number of similarities does not meet the first preset condition, the event association detection of the first candidate abnormal event set is determined to be failed. Wherein, the first candidate abnormal event set is any one of the at least one candidate abnormal event set.

[0027] Specifically, in calculating the event similarity between any two anomalous events, we can first extract the first anomalous information of the first anomalous event and the second anomalous information of the second anomalous event, respectively. Then, we calculate the similarity between the first and second anomalous information as the event similarity. To improve the calculation efficiency of the similarity between the first and second anomalous information, we can first extract features from the first and second anomalous information respectively to obtain first and second features, and then calculate the similarity between the first and second features. In this way, by converting the text into vector form of first and second features through feature extraction, and by calculating the similarity between the first and second features in vector form as the event similarity, the efficiency of similarity calculation is improved.

[0028] In this embodiment, feature extraction models can be used to extract features from the first abnormal information and the second abnormal information respectively.

[0029] For example, the first and second anomalous information can be input into the Bag-of-Words Model or TF-IDF (Term Frequency - Inverse Document Frequency) model in an NLP (Natural Language Processing) model for feature extraction to obtain the first and second features. Then, the cosine similarity between the first and second features can be calculated as the event similarity between the first and second anomalous events.

[0030] After calculating the event similarity between any two abnormal events and obtaining multiple event similarities, if the number of similarities greater than a similarity threshold among the multiple event similarities satisfies a first preset condition, the event association detection of the first candidate abnormal event set is determined to be successful. The first preset condition includes a similarity number greater than a preset similarity number; the preset similarity number can be pre-configured or calculated based on the number of abnormal events included in the first candidate abnormal event set and a preset ratio; this embodiment does not impose any limitations on this.

[0031] For example, with a similarity threshold of 0.7, the number of events in the first candidate abnormal event set with a similarity greater than 0.7 is calculated. A pre-configured number of similarities, m, is used. If the number of similarities is greater than m, the event association detection for the first candidate abnormal event set is considered successful; otherwise, it is considered unsuccessful. Alternatively, if the number of abnormal events in the first candidate abnormal event set is n, with a pre-configured ratio of 80%, if the number of similarities is greater than 80%n, the event association detection for the first candidate abnormal event set is considered successful; otherwise, it is considered unsuccessful. If the event association detection for the first candidate abnormal event set is successful, the first candidate abnormal event set is determined to be an abnormal event set. If the event association detection for the first candidate abnormal event set is unsuccessful, no action is taken, or event association detection is performed on the next candidate abnormal event set.

[0032] The above describes the event association detection process for the first candidate abnormal event set. For any candidate abnormal event set in at least one candidate abnormal event set, the above method can be used for event association detection. Please refer to the above relevant content for details. This embodiment will not repeat the details here.

[0033] In this embodiment, abnormal events are clustered based on a preset time window, and an abnormal event set constructed from highly correlated abnormal events is obtained by calculating event similarity. This improves the effectiveness of the abnormal event set under the distributed architecture and also improves the processing efficiency of subsequent processing based on the abnormal event set.

[0034] In this embodiment, the multiple abnormal events under the distributed architecture include abnormal events generated from system data obtained from multiple data acquisition system components. Generating abnormal events from system service data obtained from multiple data acquisition system components improves the comprehensiveness of the obtained abnormal events. In one optional implementation of this embodiment, the multiple abnormal events under the distributed architecture are obtained in the following way: Acquire system service data collected by at least one data acquisition system component, and check whether the system operation indicators in the system service data are less than the operation indicator threshold; if so, generate an abnormal event based on the system service data; otherwise, do not process it.

[0035] The data acquisition system components may include monitoring system components, logging system components, and / or status metric acquisition components. The monitoring system components collect business and system metrics from each service node as system service data. For example, business metrics may include QPS (Queries Per Second), success rate, and / or latency; system metrics may include CPU (Central Processing Unit) and / or memory. The logging system components centrally collect and index application logs and error logs from all system components as system service data. The status metric acquisition components can be deployed on shared resources to collect the real-time health status of these resources as system service data, such as MySQL (an open-source relational database management system) connection count and CPU utilization.

[0036] After acquiring system service data collected by at least one data acquisition system component, it is checked whether the system service data is abnormal. If so, an abnormal event is generated based on the abnormal system service data; otherwise, no processing is performed. To improve the convenience of system service data detection and to ensure that the abnormal system service data is comprehensive and accurate enough in the subsequent abnormal event generation process, in this embodiment, after acquiring the system service data, the system service data is first standardized to obtain event objects. Then, anomaly detection is performed on the event objects, and abnormal events are determined based on the event objects that fail the anomaly detection.

[0037] During the standardization process of system service data, data cleaning and formatting can be performed, and tags such as timestamps and service names can be added. This allows the timestamps to be read as time information after an anomaly event is generated based on the system service data, and the anomaly service component can be identified based on the service name, improving the efficiency of identifying time information and anomaly service components. In specific execution, timestamp and service name tags can be added to the system service data to obtain the event object; the timestamp and service name can be determined based on the system service data.

[0038] In this embodiment, the event object may include at least one of the following: service name, timestamp, error type, error message, and resource indicators. The resource indicators may include one or more of the aforementioned system service data, such as QPS, success rate and / or latency, CPU, memory, application logs and error logs, associated database address, MySQL connection count, and / or CPU utilization. It should be noted that the aforementioned exception information may include error type and / or error message.

[0039] During anomaly detection of event objects, it can be checked whether the success rate, a system performance indicator for the event object, is less than a success rate threshold. If so, the anomaly detection for that event object is determined to have failed, and the event object is identified as an abnormal event. The success rate threshold can be 99%.

[0040] It should be noted that the above method is used to obtain multiple abnormal events under a distributed architecture.

[0041] In practical implementation, the data acquisition system components may also include a dependency management system component, which stores the dependencies between system components in a distributed architecture. Obtaining the dependencies between system components in a distributed architecture is essentially obtaining the dependencies stored by the dependency management system component.

[0042] It should be noted that the stored dependencies can be pre-configured, or they can be obtained by acquiring data from the real-time call chain, thereby enabling automatic discovery and updating of dependencies and improving their accuracy. The real-time call chain includes Open Telemetry.

[0043] It should also be noted that, in order to improve the convenience and efficiency of reading dependent components based on dependencies, dependencies can be stored in the form of a graph database, such as Neo4j (a graph-oriented database management system).

[0044] Step S102: Based on the abnormal events in the abnormal event set, identify multiple abnormal system components of the distributed architecture.

[0045] Based on the set of abnormal events in the distributed architecture and the dependencies between system components in the distributed architecture, multiple abnormal system components in the distributed architecture are identified according to the abnormal events in the set of abnormal events.

[0046] The abnormal system components in this embodiment include abnormal service nodes.

[0047] In the specific execution process, to avoid the impact of identifying duplicate abnormal system components on processing efficiency, this embodiment provides an optional implementation method. In the process of determining multiple abnormal system components of the distributed architecture based on abnormal events in the abnormal event set, the initial abnormal system component corresponding to each abnormal event in the abnormal event set is first obtained. Then, the initial abnormal system components are deduplicated to obtain multiple abnormal system components. Furthermore, in this embodiment, the determined abnormal system component can also be a single component; that is, the multiple abnormal system components determined in this embodiment can be replaced by at least one abnormal system component.

[0048] Specifically, based on the abnormal events in the abnormal event set, multiple abnormal system components of the distributed architecture are identified. In the process of identifying multiple abnormal service nodes of the distributed architecture based on the abnormal events in the abnormal event set, the initial abnormal service node corresponding to each abnormal event in the abnormal event set is obtained. The initial abnormal service node is deduplicated to obtain multiple abnormal service nodes. In this way, by deduplicating the initial abnormal service nodes, the processing efficiency is improved, and the repeated processing of the same initial abnormal service node in the future is avoided, which will affect the processing efficiency.

[0049] For example, read the service name of each abnormal event in the abnormal event set to obtain the initial abnormal service node corresponding to each abnormal event, and then perform deduplication on the initial abnormal service nodes to obtain multiple abnormal service nodes.

[0050] Step S103: Read the dependent components of each abnormal system component based on the dependency relationship, and determine the common dependent components of multiple abnormal system components based on the dependent components of each abnormal system component.

[0051] In practice, after identifying multiple abnormal system components in the distributed architecture, the dependent components of each abnormal system component are read based on their dependencies, and the common dependent components of the abnormal system components are determined based on the dependent components of each abnormal system component.

[0052] The dependent components in this embodiment include shared resources. That is, step S103 can be replaced by reading the dependent resources of each abnormal service node based on the dependency relationship, and determining the common dependent resources of multiple abnormal service nodes based on the dependent resources of each abnormal service node, and forming a new implementation method with other processing procedures provided in this embodiment.

[0053] The dependent components of each abnormal system component include the direct upstream dependent components of each abnormal system component. The common dependent components of the determined multiple abnormal system components can be the common dependent components of one or more abnormal system components. That is, step S103 can be replaced by reading the dependent components of each abnormal system component based on the dependency relationship, and determining the common dependent components based on the dependent components of each abnormal system component, and forming a new implementation method with other processing procedures provided in this embodiment.

[0054] To improve the accuracy of identifying common dependent components, in one optional implementation of this embodiment, common dependent components are obtained by calculating the intersection of the dependent components of each abnormal system component.

[0055] For example, if the abnormal service nodes include service node A, service node B, and service node C, and the dependency relationship shows that service node A, service node B, and service node C all depend on MySQL instance M1, then MySQL instance M1 is identified as a common dependent component.

[0056] It should be noted that the common dependent component in this embodiment can be one or more, that is, the common dependent component can be replaced by a list of common dependent components. For example, the list of common dependent components is R, R={r1, r2, ..., rn}; where r is a common dependent component.

[0057] Step S104: Perform status indicator detection on common dependent components, obtain common dependent components that fail the status indicator detection as abnormal dependent components, and generate anomaly detection results for the distributed architecture based on the abnormal dependent components.

[0058] In this embodiment, the status indicators may include the component status indicators of commonly dependent components collected by the status indicator collection component.

[0059] In practice, after identifying common dependent components, status indicators are checked on these components. Anomaly detection results for the distributed architecture are generated based on common dependent components that fail the status indicator checks. Specifically, status indicator checks are performed on common dependent components, and those that fail are identified as anomalous dependent components. Anomaly detection results for the distributed architecture are then generated based on these anomalous dependent components.

[0060] In the specific execution process, in order to ensure that the obtained component status indicators of the common dependent components are real-time, in an optional implementation of this embodiment, corresponding second preset conditions can be configured in advance for different common dependent components. During the process of detecting the status indicators of the common dependent components, the component status indicators of the common dependent components collected by the status indicator collection component are first obtained, and then it is detected whether the component status indicators meet the corresponding second preset conditions. If yes, it is determined that the status indicator detection of the common dependent components has passed; if no, it is determined that the status indicator detection of the common dependent components has failed.

[0061] The second preset condition in this embodiment can be that the component status indicator is greater than the corresponding indicator threshold. That is, if the component status indicator of a commonly dependent component is greater than the corresponding indicator threshold, the status indicator detection of the commonly dependent component is determined to have passed; if the component status indicator of a commonly dependent component is less than or equal to the corresponding indicator threshold, the status indicator detection of the commonly dependent component is determined to have failed. Alternatively, the second preset condition can be that the number of MySQL connections is full and the CPU utilization rate is greater than the utilization rate threshold. It should be noted that the above description of the second preset condition is exemplary, and the specific second preset condition can be configured according to the actual scenario. This embodiment does not limit it here.

[0062] It should be noted that, when the commonly dependent component is a shared service resource that provides external connection access, the second preset condition for the commonly dependent component may include reaching the maximum number of connections; wherein, the shared service resource that provides external connection access may be a database (e.g., a MySQL instance) and / or a cache. When the commonly dependent component is a shared resource whose core service capabilities are affected by CPU computing power, the second preset condition for the commonly dependent component may include a CPU utilization rate greater than a utilization threshold; wherein, the shared resource whose core service capabilities are affected by CPU computing power may include a MySQL instance, message middleware, and / or a load balancer.

[0063] During the specific execution process, it can be determined whether the number of MySQL connections of the common dependent components collected by the status metric collection component is full. If so, it is determined that the status metric detection of the common dependent components has passed; otherwise, it is determined that the status metric detection has failed. And / or, it can be determined whether the CPU utilization of the common dependent components collected by the status metric collection component is greater than the utilization threshold (95%). If so, it is determined that the status metric detection has passed; otherwise, it is determined that the status metric detection has failed.

[0064] For example, if the common dependent component is MySQL instance M1, the component status indicators of MySQL instance M1 may include the number of connections. During the status indicator detection of MySQL instance M1, it is checked whether the number of connections of MySQL instance M1 is full. If yes, it is determined that the status indicator detection of the common dependent component MySQL instance M1 has passed; otherwise, it is determined that the status indicator detection of the common dependent component MySQL instance M1 has failed.

[0065] For example, if the common dependent component is a message middleware, the component status indicators of the message middleware may include CPU utilization. During the status indicator detection of the message middleware, it is checked whether the CPU utilization of the message middleware is greater than the utilization threshold of 95%. If it is, it is determined that the status indicator detection of the common dependent component of the message middleware has passed; otherwise, it is determined that the status indicator detection of the common dependent component of the message middleware has failed.

[0066] In addition, the second preset condition for the common dependent component may also include the MySQL connection count being full and the CPU utilization rate being greater than the utilization rate threshold; for example, the common dependent component is MySQL instance M1, and the component status indicators of MySQL instance M1 may include the connection count and CPU utilization rate; during the status indicator detection of MySQL instance M1, it is detected whether the connection count of MySQL instance M1 is full and whether the CPU utilization rate is greater than the utilization rate threshold of 95%. If yes, it is determined that the status indicator detection of the common dependent component MySQL instance M1 has passed; if no, it is determined that the status indicator detection of the common dependent component MySQL instance M1 has failed.

[0067] In the specific execution process of detecting the status indicators of common dependent components and identifying common dependent components that fail the status indicator detection as abnormal dependent components, if the number of connections of a common dependent component is not full and / or the CPU utilization is less than or equal to the utilization threshold, then the common dependent component is regarded as an abnormal dependent component.

[0068] It should be noted that the above descriptions of detecting the status indicators of the MySQL instance by checking whether the number of connections of the MySQL instance is full when the common dependent component is a MySQL instance, and detecting the status indicators of the message middleware by checking whether the CPU utilization of the message middleware is greater than the utilization threshold when the common dependent component is a message middleware, are merely illustrative. In the actual execution process, the common dependent component can also be other shared system components, such as service nodes. The corresponding component status indicators and the second preset conditions can be configured according to actual needs, and this embodiment does not limit them here.

[0069] In practical implementation, to improve the efficiency of status indicator detection, in this embodiment, when a list of common dependent components or multiple common dependent components are obtained, the common dependent components can first be sorted in descending order according to the number of abnormal system components corresponding to each common dependent component to obtain the order information of each common dependent component. Then, status indicator detection is performed on the common dependent components according to the order information of each common dependent component. The number of abnormal system components corresponding to a common dependent component includes the number of system components that depend on the common dependent component and have experienced anomalies.

[0070] In specific implementation, based on the status indicator detection of commonly dependent components, commonly dependent components that fail the status indicator detection are identified as abnormal dependent components, and anomaly detection results for the distributed architecture are generated based on these abnormal dependent components. In an optional implementation provided in this embodiment, to improve the comprehensiveness of the generated anomaly detection results, during the process of generating anomaly detection results for the distributed architecture based on abnormal dependent components, the target abnormal event corresponding to the abnormal dependent component is first obtained from the abnormal event set, then the time information of the target abnormal event is read, and finally, anomaly detection results containing at least the time information and component information of the abnormal dependent component are generated.

[0071] To further enhance user awareness and comprehensiveness of anomaly detection results, the results may also include component information of the abnormal system components and error logs corresponding to commonly dependent components. These error logs can be collected through a log system. This embodiment does not limit the data included in the anomaly detection results.

[0072] The above describes the anomaly detection process in detail. It should be noted that this embodiment organically integrates logs, monitoring, dependencies, and status indicators. Through multi-dimensional cross-validation, it significantly improves the accuracy and reliability of anomaly cause localization, guides the detection direction to commonly dependent components, and improves the efficiency of anomaly detection.

[0073] Furthermore, this embodiment can also predict potentially affected system components by analyzing the status indicators of commonly dependent components, thus achieving preventative anomaly avoidance. Specifically, the target commonly dependent component is obtained based on the dependency relationship, and its component status indicators are acquired. A trend analysis of these indicators is generated. If the trend is downward, the downstream system components of the target commonly dependent component are read, and a fault prediction alert is generated and sent to the user based on these downstream system components. The target commonly dependent component includes system components configured with downstream system components.

[0074] It should be noted that the above-mentioned anomaly detection and handling methods can be integrated as a core functional module into application performance monitoring, observability platforms, and / or intelligent operation and maintenance products. This will provide efficient anomaly detection for complex distributed architectures, significantly reduce anomaly repair time, and improve the efficiency of anomaly detection and repair.

[0075] The anomaly detection and processing method provided in this embodiment of the invention is applicable to the implementation environment of anomaly detection systems, and can be referred to as follows. Figure 2 The implementation environment includes at least: an anomaly detection engine 201; wherein, the anomaly detection engine includes an anomaly event set acquisition module 201-1, an anomaly detection module 201-2, and an anomaly detection result generation module 201-3; The system includes: an exception event set acquisition module 201-1, used to acquire an exception event set; an exception detection module 201-2, used to determine multiple abnormal system components of the distributed architecture based on the exception events in the exception event set, read the dependent components of each abnormal system component based on the dependency relationship, determine the common dependent components of multiple abnormal system components based on the dependent components of each abnormal system component, perform status indicator detection on the common dependent components, and obtain the common dependent components that fail the status indicator detection as abnormal dependent components; and an exception detection result generation module 201-3, used to generate the exception detection result of the distributed architecture based on the abnormal dependent components. In addition, the anomaly detection engine 201 may also include an anomaly event set generation module 201-4, which is used to determine an anomaly event set from multiple anomaly events under a distributed architecture.

[0076] In addition, the implementation environment may also include a data acquisition layer 202; wherein, the data acquisition layer 202 includes a monitoring system component 202-1, a log system component 202-2, a status indicator acquisition component 202-3 and a dependency management system component 202-4; Among them, the monitoring system component 202-1 is used to collect business indicators and system indicators of each service node as system service data; the log system component 202-2 is used to centrally collect and index the application logs and error logs of all system components as system service data; the status indicator collection component 202-3 can be deployed on shared resources to collect the health status of shared resources in real time as system service data; and the dependency management system component 202-4 is used to store dependency relationships.

[0077] In this implementation environment, after the abnormal event set generation module 201-4 obtains the system service data collected by the monitoring system component 202-1, the log system component 202-2, and the status indicator collection component 202-3, it first generates multiple abnormal events under the distributed architecture based on the system service data. Then, it creates an abnormal event set composed of highly correlated abnormal events from the multiple abnormal events. The abnormal event set acquisition module 201-1 obtains the abnormal event set generated by the abnormal event set generation module 201-4 and sends it to the abnormal detection module 201-2. The abnormal detection module 201-2 performs abnormal detection based on the abnormal event set and the dependency relationship stored by the dependency management system component 202-4, and obtains the common dependent components that fail the status indicator detection as abnormal dependent components. The abnormal detection result generation module 201-3 generates abnormal detection results based on the abnormal dependent components.

[0078] The following example illustrates the application of an anomaly detection method provided in this embodiment in an anomaly detection scenario within a distributed architecture. Figure 3 As shown, the anomaly detection and processing method applied to anomaly detection scenarios in distributed architectures specifically includes the following steps.

[0079] Step S301: Obtain multiple abnormal events under the distributed architecture, and cluster the multiple abnormal events according to the time window-based clustering algorithm to obtain at least one candidate abnormal event set.

[0080] Step S302: Perform event correlation detection on at least one candidate abnormal event set, and determine the candidate abnormal event set that passes the event correlation detection as the abnormal event set.

[0081] Step S303: Based on the abnormal events in the abnormal event set, identify multiple abnormal service nodes in the distributed architecture.

[0082] Step S304: Obtain the dependency relationships between system components under the distributed architecture, and read the dependent resources of each abnormal service node based on the dependency relationships.

[0083] Step S305: Calculate the intersection of the dependent resources of each abnormal service node to obtain the common dependent resources.

[0084] Step S306: Perform status indicator detection on common dependent resources to identify abnormal dependent resources that fail the detection.

[0085] Step S307: Generate anomaly detection results for the distributed architecture based on abnormal dependent resources.

[0086] It should be noted that any one or more of steps S301 to S307 can be combined with any one or more of steps S101 to S104 to form a new implementation method according to the needs of implementation and deployment. In addition, according to the actual deployment needs, any one or more technical features in steps S301 to S307 can be selected and combined with any one or more technical features provided in steps S101 to S104 to form a new implementation method. Alternatively, any one or more technical features in steps S301 to S307 can be replaced with any one or more technical features provided in steps S101 to S104 to form a new implementation method according to the actual deployment needs. These will not be elaborated on here.

[0087] Based on the same inventive concept, as an implementation of the above method, this embodiment of the invention also provides an anomaly detection and processing device. This embodiment corresponds to the aforementioned method embodiment. For ease of reading, this embodiment will not repeat the details of the aforementioned method embodiment one by one, but it should be clear that the anomaly detection and processing device in this embodiment can implement all the contents of the aforementioned method embodiment.

[0088] This invention provides an anomaly detection and processing device. Figure 4 This is a schematic diagram of the anomaly detection and processing device, as shown below. Figure 4 As shown, the anomaly detection and processing device includes: The acquisition unit 401 is used to acquire the set of abnormal events under the distributed architecture, as well as the dependency relationships between system components under the distributed architecture; Processing unit 402 is used to determine multiple abnormal system components of the distributed architecture based on the abnormal events in the abnormal event set; The processing unit 402 is further configured to read the dependent components of each abnormal system component based on the dependency relationship, and determine the common dependent components of the multiple abnormal system components based on the dependent components of each abnormal system component. The processing unit 402 is further configured to perform status indicator detection on the common dependent components, obtain common dependent components that fail the status indicator detection as abnormal dependent components, and generate the abnormal detection result of the distributed architecture based on the abnormal dependent components.

[0089] As an optional implementation of this invention, the acquisition unit 401 is specifically used for: Obtain multiple abnormal events under the distributed architecture, and sort the multiple abnormal events according to the time information of each abnormal event to obtain an abnormal event list; The abnormal events in the abnormal event list are divided according to a preset time window to obtain at least one candidate abnormal event set. Event association detection is performed on each of the at least one candidate abnormal event set, and the candidate abnormal event set that passes the detection is determined as the abnormal event set.

[0090] As an optional implementation of this invention, the processing unit 402 performs event correlation detection on the first candidate abnormal event set in at least one candidate abnormal event set in the following manner: Calculate the event similarity between any two abnormal events in the first candidate abnormal event set to obtain multiple event similarities; the first candidate abnormal event set is any one of the at least one candidate abnormal event set. Calculate the number of similarities among the multiple event similarities that are greater than a similarity threshold; If the number of similarities meets the first preset condition, the event association detection of the first candidate abnormal event set is determined to be successful.

[0091] As an optional implementation of this invention, when the processing unit 402 determines multiple abnormal system components of the distributed architecture based on the abnormal events in the abnormal event set, it is specifically used for: Obtain the initial exception system component corresponding to each exception event in the exception event set; The initial abnormal system components are deduplicated to obtain the plurality of abnormal system components.

[0092] As an optional implementation of this invention, when the processing unit 402 determines the common dependent component of the plurality of abnormal system components based on the dependent components of each abnormal system component, it is specifically used for: Calculate the component intersection of the dependent components of each abnormal system component to obtain the common dependent components of the multiple abnormal system components.

[0093] As an optional implementation of this invention, when the processing unit 402 performs state indicator detection on the commonly dependent components, it is specifically used for: The component status indicators of the common dependent components are collected by the status indicator collection component, and the component status indicators are checked to see if they meet the corresponding second preset conditions. If yes, the status indicator detection of the common dependent component is determined to be passed; if no, the status indicator detection of the common dependent component is determined to be failed.

[0094] As an optional implementation of this invention, the system components include at least one of the following: service node, database, cache, message middleware, and load balancer.

[0095] As an optional implementation of this invention, when the processing unit 402 generates the anomaly detection result of the distributed architecture based on the anomaly dependency component, it is specifically used for: Obtain the target abnormal event corresponding to the abnormal dependent component from the abnormal event set, and read the time information of the target abnormal event; Generate an anomaly detection result that includes at least the time information and the component information of the anomaly dependent components.

[0096] The anomaly detection and processing device provided in this embodiment of the invention can execute the anomaly detection and processing method provided in any of the above embodiments. Its implementation principle and technical effect are similar, and will not be described again here.

[0097] This invention also provides an electronic device, such as... Figure 5 As shown, it includes a processor 501, a communication interface 502, a memory 503, and a communication bus 504, wherein the processor 501, the communication interface 502, and the memory 503 communicate with each other through the communication bus 504. Memory 503 is used to store computer programs; When processor 501 executes the program stored in memory 503, it performs the following steps: Obtain the set of exception events under the distributed architecture, as well as the dependencies between system components under the distributed architecture; Based on the abnormal events in the abnormal event set, identify multiple abnormal system components of the distributed architecture; Based on the aforementioned dependency relationships, the dependent components of each abnormal system component are read, and based on the dependent components of each abnormal system component, the common dependent components of the multiple abnormal system components are determined. The common dependent components are subjected to status indicator detection. Common dependent components that fail the status indicator detection are identified as abnormal dependent components. Anomaly detection results of the distributed architecture are generated based on the abnormal dependent components.

[0098] The communication bus mentioned above can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.

[0099] The communication interface is used for communication between the aforementioned terminal and other devices.

[0100] The memory may include random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0101] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0102] The electronic device provided in the embodiments of the present invention can execute the anomaly detection and processing method provided in any of the above embodiments. Its implementation principle and technical effect are similar, and will not be described again here.

[0103] In another embodiment of the present invention, a computer-readable storage medium is also provided, which stores instructions that, when executed on a computer, cause the computer to perform any of the anomaly detection processing methods described in the above embodiments.

[0104] In another embodiment of the present invention, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute any of the anomaly detection processing methods described in the above embodiments.

[0105] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).

[0106] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0107] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0108] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention are included within the scope of protection of the present invention.

Claims

1. An anomaly detection and processing method, characterized in that, include: Obtain the set of exception events under the distributed architecture, as well as the dependencies between system components under the distributed architecture; Based on the abnormal events in the abnormal event set, identify multiple abnormal system components of the distributed architecture; Based on the aforementioned dependency relationships, the dependent components of each abnormal system component are read, and based on the dependent components of each abnormal system component, the common dependent components of the multiple abnormal system components are determined. The common dependent components are subjected to status indicator detection. Common dependent components that fail the status indicator detection are identified as abnormal dependent components. Anomaly detection results of the distributed architecture are generated based on the abnormal dependent components.

2. The method according to claim 1, characterized in that, The process of obtaining the set of abnormal events under the distributed architecture includes: Obtain multiple abnormal events under the distributed architecture, and sort the multiple abnormal events according to the time information of each abnormal event to obtain an abnormal event list; The abnormal events in the abnormal event list are divided according to a preset time window to obtain at least one candidate abnormal event set. Event association detection is performed on each of the at least one candidate abnormal event set, and the candidate abnormal event set that passes the detection is determined as the abnormal event set.

3. The method according to claim 2, characterized in that, The event association detection of the first candidate abnormal event set in the at least one candidate abnormal event set is implemented in the following manner: Calculate the event similarity between any two abnormal events in the first candidate abnormal event set to obtain multiple event similarities; the first candidate abnormal event set is any one of the at least one candidate abnormal event set. Calculate the number of similarities among the multiple event similarities that are greater than a similarity threshold; If the number of similarities meets the first preset condition, the event association detection of the first candidate abnormal event set is determined to be successful.

4. The method according to claim 1, characterized in that, The step of determining multiple abnormal system components of the distributed architecture based on abnormal events in the abnormal event set includes: Obtain the initial exception system component corresponding to each exception event in the exception event set; The initial abnormal system components are deduplicated to obtain the plurality of abnormal system components.

5. The method according to claim 1, characterized in that, The step of determining the common dependent components of the multiple abnormal system components based on the dependent components of each abnormal system component includes: Calculate the component intersection of the dependent components of each abnormal system component to obtain the common dependent components of the multiple abnormal system components.

6. The method according to claim 1, characterized in that, The step of detecting the status indicators of the commonly dependent components includes: The component status indicators of the common dependent components are collected by the status indicator collection component, and the component status indicators are checked to see if they meet the corresponding second preset conditions. If yes, the status indicator detection of the common dependent component is determined to be passed; if no, the status indicator detection of the common dependent component is determined to be failed.

7. The method according to claim 1, characterized in that, The system components include at least one of the following: service node, database, cache, message middleware, and load balancer.

8. The method according to claim 1, characterized in that, The generation of anomaly detection results for the distributed architecture based on the anomaly dependency component includes: Obtain the target abnormal event corresponding to the abnormal dependent component from the abnormal event set, and read the time information of the target abnormal event; Generate an anomaly detection result that includes at least the time information and the component information of the anomaly dependent components.

9. An anomaly detection and processing device, characterized in that, include: The acquisition unit is used to acquire a set of abnormal events under the distributed architecture, as well as the dependencies between system components under the distributed architecture. A processing unit is configured to determine multiple abnormal system components of the distributed architecture based on the abnormal events in the abnormal event set. The processing unit is further configured to read the dependent components of each abnormal system component based on the dependency relationship, and determine the common dependent components of the multiple abnormal system components based on the dependent components of each abnormal system component. The processing unit is further configured to perform status indicator detection on the common dependent components, obtain common dependent components that fail the status indicator detection as abnormal dependent components, and generate the abnormal detection result of the distributed architecture based on the abnormal dependent components.

10. An electronic device, characterized in that, It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; A processor, when executing a program stored in memory, implements the steps of the method described in any one of claims 1-8.