An abnormality delay tracing method, device and electronic equipment
By acquiring logs and device aggregation results from the latency anomaly interface, and utilizing multi-dimensional attribution criteria to automatically locate the root cause of latency anomalies, the inefficiency of existing technologies is solved, and rapid fault recovery is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING YOUTEJIE INFORMATION TECH
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-26
AI Technical Summary
In distributed systems, response latency issues occur frequently. Existing technologies rely on manual troubleshooting, which is inefficient, time-consuming, and makes it difficult to quickly locate the root cause, thus affecting the speed of fault recovery.
By acquiring the target logs and device aggregation results of the latency anomaly interface, generating hypothetical root causes using multi-dimensional attribution criteria, and determining the actual root causes based on the scores, automated latency anomaly tracing is achieved.
It improves the efficiency of tracing the source of delay anomalies, increases the speed of fault recovery, reduces reliance on experience, and shortens the troubleshooting time.
Smart Images

Figure CN122285348A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent operation and maintenance, and in particular to a method, apparatus and electronic device for tracing the source of delay anomalies. Background Technology
[0002] With the widespread deployment of distributed systems, microservice architectures, and cloud-native applications, system call chains are becoming increasingly complex, service call layers are deepening, and the number of dependent components is growing. In high-concurrency, high-traffic scenarios, response latency anomalies occur frequently, seriously affecting system stability and user experience.
[0003] Currently, maintenance personnel typically need to manually connect multiple isolated subsystems to locate the root cause of latency anomalies. This results in a troubleshooting process that is highly dependent on experience, inefficient, time-consuming, and makes it difficult to quickly locate the root cause, which in turn seriously affects the speed of fault recovery. Summary of the Invention
[0004] This invention provides a method, apparatus, and electronic device for tracing delayed anomalies, which improves the efficiency of tracing delayed anomalies and increases the speed of fault recovery.
[0005] In a first aspect, embodiments of the present invention provide a method for tracing the source of delayed anomalies, including: When it is determined that there is a latency anomaly in the target system, obtain the latency anomaly service, the latency anomaly interface under the latency anomaly service, and the target time window in which the latency anomaly interface occurs; Obtain the target log aggregation results and target device aggregation results for the delayed exception interface within the target time window; Based on the target log aggregation results and target device aggregation results, determine the instance dimension attribution basis, device dimension attribution basis, topology dimension attribution basis, and change dimension attribution basis corresponding to the latency anomaly interface; Based on each attribution criterion, multiple hypothetical root causes leading to latency anomalies are generated, and based on each attribution criterion, the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause are determined respectively. Based on the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause, the actual root cause causing the latency anomaly is determined among the hypothetical root causes.
[0006] Secondly, embodiments of the present invention also provide a delayed anomaly tracing device, comprising: The anomaly information acquisition module is used to acquire the delay anomaly service, the delay anomaly interface under the delay anomaly service, and the target time window when the delay anomaly interface appears when it is determined that there is a delay anomaly in the target system. The target aggregation result acquisition module is used to acquire the target log aggregation result and target device aggregation result of the delayed exception interface within the target time window; The attribution basis determination module is used to determine the instance dimension attribution basis, device dimension attribution basis, topology dimension attribution basis, and change dimension attribution basis corresponding to the latency anomaly interface based on the target log aggregation results and target device aggregation results. The multidimensional scoring determination module is used to generate multiple hypothetical root causes of delay anomalies based on various attribution criteria, and to determine the instance dimension score, device dimension score, topology dimension score and change dimension score corresponding to each hypothetical root cause based on each attribution criteria. The actual root cause determination module is used to determine the actual root cause causing the latency anomaly among each hypothetical root cause based on the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause.
[0007] Thirdly, embodiments of the present invention also provide an electronic device, the electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to execute the delay anomaly tracing method provided in any embodiment of the present invention.
[0008] The technical solution of this invention, when a latency anomaly is determined in the target system, acquires the target log aggregation result and target device aggregation result of the latency anomaly interface within a target time window; based on the target log aggregation result and target device aggregation result, determines the instance dimension attribution basis, device dimension attribution basis, topology dimension attribution basis, and change dimension attribution basis corresponding to the latency anomaly interface; based on each attribution basis, determines the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause; based on the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause, determines the actual root cause causing the latency anomaly among each hypothetical root cause. This technical means solves the problem that existing technologies rely heavily on experience when investigating root causes, resulting in low efficiency, long processing time, and difficulty in quickly locating root causes, thus seriously affecting the speed of fault recovery. This improves the efficiency of latency anomaly tracing and enhances the speed of fault recovery.
[0009] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description
[0010] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0011] Figure 1 This is a flowchart of a delayed anomaly tracing method provided in Embodiment 1 of the present invention; Figure 2 This is a flowchart of another delayed anomaly tracing method provided in Embodiment 2 of the present invention; Figure 3 This is a schematic diagram of a delayed anomaly tracing device provided according to Embodiment 3 of the present invention; Figure 4 This is a schematic diagram of the structure of an electronic device provided in Embodiment 4 of the present invention. Detailed Implementation
[0012] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0013] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0014] Example 1 Figure 1 This is a flowchart of a delayed anomaly tracing method according to Embodiment 1 of the present invention. This embodiment is applicable to the case of root cause localization of delayed anomalies. The method can be executed by a delayed anomaly tracing device, which can be implemented in hardware and / or software and can be configured in an electronic device such as a computer.
[0015] like Figure 1 As shown in this embodiment, a method for tracing the source of delayed anomalies includes: S110. When it is determined that there is a delay anomaly in the target system, obtain the delay anomaly service, the delay anomaly interface under the delay anomaly service, and the target time window in which the delay anomaly interface occurs.
[0016] In this embodiment, the target system can be understood as a system that needs to perform latency anomaly detection and root cause localization and maintenance processing on the detected latency anomalies. A latency anomaly can be understood as an anomaly where the response time exceeds a normal threshold. The target time window can be understood as the time window during which the target system experiences a latency anomaly. The latency anomaly service can be understood as the service in the target system where a latency anomaly occurs. The latency anomaly interface can be understood as the interface within the latency anomaly service where a latency anomaly exists.
[0017] In this step, specifically, when it is determined that the target system has a latency anomaly, that is, when an alarm message indicating that the target system has a latency anomaly is received, the alarm message is parsed to obtain the latency anomaly service, the latency anomaly interface, and the target time window for generating the latency anomaly service and the latency anomaly interface.
[0018] S120. Obtain the target log aggregation result and target device aggregation result of the delay exception interface within the target time window.
[0019] In this embodiment, the target log aggregation result can be understood as the log aggregation result corresponding to the latency anomaly interface and the target time window. The target device aggregation result can be understood as the device aggregation result corresponding to the target time window.
[0020] In this step, specifically, the target log aggregation results for the delayed exception interface within the target time window can be obtained from the log aggregation results corresponding to the target system. The log aggregation results can be understood as the results obtained by aggregating the original call logs corresponding to the target system according to time window, service, interface, instance, and status code.
[0021] Simultaneously, the target device aggregation results for the latency anomaly interface within the target time window can be obtained from the aggregation results of each device corresponding to the target system. The device aggregation results can be understood as the results of aggregating the various physical devices supporting the operation of the target system according to the time window.
[0022] S130. Based on the target log aggregation results and target device aggregation results, determine the instance dimension attribution basis, device dimension attribution basis, topology dimension attribution basis, and change dimension attribution basis corresponding to the latency anomaly interface.
[0023] In this embodiment, the instance-level attribution basis can be understood as the feature data used as the dividing perspective when tracing the source of latency anomalies. It can be used to attribute latency anomalies to a specific instance that provides the latency anomaly interface. The device-level attribution basis can be understood as the feature data used as the dividing perspective when tracing the source of latency anomalies. It can be used to attribute latency anomalies to a specific physical device associated with the latency anomaly interface.
[0024] Topology-based attribution can be understood as the characteristic data used to divide latency anomalies from the perspective of service call topology, when tracing the source of latency anomalies. It can be used to attribute latency anomalies to specific services or instances. Change-based attribution can be understood as the characteristic data used to divide latency anomalies from the perspective of change, when tracing the source of latency anomalies. It can be used to attribute latency anomalies to specific change events.
[0025] In this step, specifically, based on the target log aggregation results, the instance-level and topology-level attribution criteria corresponding to the latency anomaly interface can be determined. Based on the target log aggregation results and target device aggregation results, the device-level and change-level attribution criteria corresponding to the latency anomaly interface can be determined.
[0026] S140. Based on each attribution basis, generate multiple hypothetical root causes that lead to latency anomalies, and based on each attribution basis, determine the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause.
[0027] Here, the hypothetical root cause can be understood as a possible root cause that has not yet been verified and is proposed during the process of tracing the source of delayed anomalies.
[0028] In this step, specifically, multiple hypothetical root causes of latency anomalies can be generated based on the type and specific content of each attribution basis. These hypothetical root causes can be of various types, such as those caused by problems with the instance itself, those caused by physical device resource contention, those caused by upstream service issues, those caused by network problems, and those caused by change events.
[0029] After generating multiple root cause hypotheses, the instance dimension score corresponding to each root cause can be determined based on the instance dimension attribution criteria corresponding to the latency anomaly interface and the root cause description of each root cause. The device dimension score corresponding to each root cause can be determined based on the device dimension attribution criteria corresponding to the latency anomaly interface and the root cause description of each root cause. The topology dimension score corresponding to each root cause can be determined based on the topology dimension attribution criteria corresponding to the latency anomaly interface and the root cause description of each root cause. The change dimension score corresponding to each root cause can be determined based on the change dimension attribution criteria corresponding to the latency anomaly interface and the root cause description of each root cause.
[0030] S150. Based on the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause, determine the actual root cause causing the latency anomaly among each hypothetical root cause.
[0031] In this embodiment, the actual root cause can be understood as the real underlying reason for the delay anomaly.
[0032] In this step, specifically, a comprehensive score can be determined based on the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause. Then, based on the comprehensive score corresponding to each hypothetical root cause, the actual root cause causing the latency anomaly can be determined from each hypothetical root cause.
[0033] The technical solution of this embodiment, when a latency anomaly is determined in the target system, obtains the target time window, latency anomaly service, and latency anomaly interface under the latency anomaly service corresponding to the latency anomaly; obtains the target log aggregation result and target device aggregation result of the latency anomaly interface under the target time window; determines the instance dimension attribution basis, device dimension attribution basis, topology dimension attribution basis, and change dimension attribution basis corresponding to the latency anomaly interface based on the target log aggregation result and target device aggregation result; generates multiple hypothetical root causes causing the latency anomaly based on each attribution basis, and determines the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause based on each attribution basis; and determines the actual root cause causing the latency anomaly among each hypothetical root cause based on the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause. This technical means solves the problem that the existing technology relies heavily on experience when investigating root causes, resulting in low efficiency, long time consumption, and difficulty in quickly locating root causes, which seriously affects the speed of fault recovery. This improves the efficiency of latency anomaly tracing and enhances the speed of fault recovery.
[0034] Example 2 Figure 2This is a flowchart of another delayed anomaly tracing method provided by Embodiment 2 of the present invention. This embodiment is a further optimization and extension based on the above embodiments and can be combined with various optional technical solutions in the above embodiments.
[0035] like Figure 2 As shown in this embodiment, a method for tracing the source of delayed anomalies includes: S210. From the log aggregation results corresponding to the target system, obtain each interface corresponding to the target system, each instance under each interface, and the instance latency time of each instance.
[0036] In this step, specifically, the following information can be obtained from the log aggregation results corresponding to the target system: the services included in the target system, the interfaces included in each service, the instances under each interface, and the instance latency of each instance. Instance latency can be understood as the time it takes for an instance to process a request.
[0037] Optionally, before obtaining the interfaces, instances under each interface, and instance latency of each instance from the log aggregation results corresponding to the target system, the method further includes: obtaining the original call logs of the target system within each time window; wherein each original call log includes the service, interface, instance, status code, instance latency, and number of calls; aggregating the original call logs within each time window according to the service, interface, instance, and status code in each original call log to obtain a set of similar call logs corresponding to each time window; and determining the instance latency of each set of similar call logs based on the instance latency of each original call log in each set of similar call logs. The system calculates the number of times each call log set is called based on the number of original call logs contained in each similar call log set. Based on the similar call log sets corresponding to each time window, and the instance latency and number of calls for each similar call log set, it generates log aggregation results corresponding to each time window. These log aggregation results are used to determine whether the target system has latency anomalies and the actual root cause of the latency anomalies. The system also obtains the physical devices corresponding to each instance within different time windows and aggregates these physical devices to obtain device aggregation results corresponding to each time window. These device aggregation results are used to determine the actual root cause of the latency anomalies.
[0038] Specifically, the 99th percentile instance latency and / or 55th percentile instance latency of each original call log in each similar call log set can be determined based on the instance latency of each original call log in each similar call log set. The number of original call logs contained in each similar call log set is taken as the number of times that similar call log set has been called. The number of errors for each similar call log set is determined based on the status code of each original call log in each similar call log set. After determining the instance latency, number of calls, and number of errors for each similar call log set, log aggregation results corresponding to each time window can be generated based on the similar call log sets corresponding to each time window, as well as the latency, number of calls, and number of errors for each similar call log set. Optionally, each original call log may also include the upstream service of the service corresponding to the original call log.
[0039] In the process of generating log aggregation results corresponding to each time window, the physical devices corresponding to each instance in different time windows, as well as the CPU and memory usage of each physical device, can be obtained. Then, the physical devices and related information in different time windows are aggregated to obtain the device aggregation results corresponding to each time window.
[0040] Taking the aggregation of relevant data of the target system within a time window as an example, as shown in Table 1, the target system generated 10 original call logs within the time window from 10:00:00 on October 27, 2023 to 10:00:59 on October 27, 2023. Then, taking the first type of raw call log as an example, with the service, interface, instance, and status code being order-svc, / api / v1 / order, order-pod-a1, and 200 respectively, since there are 5 entries in the first type of raw call log and none of them reported errors, it can be determined that the number of calls and the number of errors in the first type of raw call log are 5 and 0 respectively. Since the latency list of the first type of raw call log is [45, 480, 510, 490, 520]ms, this latency list can be sorted in ascending order to obtain [45, 480, 490, 510, 520]ms.
[0041] Then, the formula can be used. Initially, the 99th percentile index of the reordering delay list was determined to be 4.95. Finally, 4.95 was rounded up, resulting in the 99th percentile index of the reordering delay list being the 5th position, thus determining the 99th percentile instance delay time of the first type of original call log to be 520ms. Simultaneously, the formula can be used... Initially, the 50th percentile index of the reordering delay list was determined to be 2.5. Finally, 2.5 was rounded up, resulting in the 50th percentile index of the reordering delay list being the 3rd position, thus determining the 50th percentile instance delay time of the first type of original call log to be 490ms.
[0042] Similarly, we can determine the number of calls, number of errors, and instance latency of the second type of raw call logs with services, interfaces, instances, and status codes of order-svc, / api / v1 / order, order-pod-b2, and 200, respectively, and the number of calls, number of errors, and instance latency of the third type of raw call logs with services, interfaces, instances, and status codes of order-svc, / api / v1 / order, order-pod-a1, and 500, respectively, thus obtaining the log clustering results shown in Table 2. While generating the log aggregation results shown in Table 2, we can obtain the physical devices supporting the target system, the instances running on each physical device, and the CPU and memory usage of each physical device within the time window from 10:00:00 on October 27, 2023 to 10:00:59 on October 27, 2023. Then, we can aggregate the physical devices and related information within this time window to obtain the device aggregation results shown in Table 3. With the above settings, massive amounts of raw detailed data with high cardinality can be pre-aggregated into time-series indicators with low cardinality and time alignment in the background, avoiding the performance bottleneck caused by real-time querying of massive data and providing a foundation for achieving second-level analysis.
[0043] S220. Based on the instance latency of each instance contained in each interface, determine the interface latency of each interface, and when the latency of any interface is greater than the preset latency threshold, determine that there is a latency anomaly in the target system.
[0044] In this embodiment, the preset delay time threshold can be understood as the critical time boundary point for the target system to transition from a normal state to an abnormal state. It can be determined based on historical experience or user needs. For example, the preset delay time threshold can be set to 200ms.
[0045] In this step, specifically, a monitoring system configured with log clustering results and device clustering results corresponding to the target system can be obtained, and this monitoring system can be used to determine whether the target system has latency anomalies. In one specific implementation, the interface latency time of each interface can be determined based on the instance latency time of normal instances with status codes contained in each interface. Then, when the latency time of any interface is greater than a preset latency time threshold, it can be determined that the target system has latency anomalies. When the latency times of all interfaces are less than or equal to the preset latency time threshold, it can be determined that the target system does not have latency anomalies, and then the operation of data aggregation and monitoring for latency anomalies in the target system is returned until a latency anomaly is detected in the target system, or an instruction to stop monitoring for latency anomalies in the target system is received.
[0046] For example, assuming the monitoring system reads the 99th percentile instance delay time of the api / v1 / order interface under the order-svc service at 10:00:00, it can be determined that the number of normal instances corresponding to the api / v1 / order interface is 9.
[0047] Then, the normal instance latency list corresponding to the pi / v1 / order interface can be determined to be [45, 52, 480, 510, 50, 490, 55, 48, 520] ms. Next, the 99th percentile index of the normal instance latency list can be determined to be 9, thus determining the 99th percentile instance latency corresponding to the pi / v1 / order interface to be 520 ms. Finally, with a preset latency threshold of 200 ms, since the 99th percentile instance latency corresponding to the pi / v1 / order interface is greater than 200 ms, an alarm message can be triggered stating that "the 99th percentile instance latency of order-svc / api / v1 / order reached 520 ms at 10:00".
[0048] S230. When it is determined that there is a delay anomaly in the target system, obtain the target time window, delay anomaly service, and delay anomaly interface under the delay anomaly service corresponding to the delay anomaly.
[0049] S240. Obtain the target log aggregation result and target device aggregation result of the delay exception interface within the target time window.
[0050] Specifically, in this step, we can obtain the target log aggregation results corresponding to the delayed abnormal service, delayed abnormal interface, and target time window with normal status codes from the log aggregation results corresponding to the target system. We can also obtain the target device aggregation results corresponding to the target time window from the device aggregation results corresponding to the target system.
[0051] S250. Based on the target log aggregation results, determine the attribution basis for the instance dimension corresponding to the delay exception interface.
[0052] In this step, specifically, based on the target log aggregation results, the abnormal and normal instances corresponding to the latency-abnormal interface can be identified, along with the call count and instance latency time corresponding to the normal and abnormal instances. Then, the abnormal and normal instances, along with the call count and instance latency time corresponding to the normal and abnormal instances, can be used as the attribution basis for the instance dimension corresponding to the latency-abnormal interface.
[0053] Optionally, based on the target log aggregation results, determine the instance dimension attribution basis corresponding to the delay exception interface, including: determining the number of times each instance is called and the instance delay time under the delay exception interface based on the target log aggregation results; determining the exception instance and normal instance corresponding to the delay exception interface based on the number of times each instance is called and the instance delay time under the delay exception interface; and using the exception instance, normal instance, and the number of times they are called and the instance delay time corresponding to the normal instance and the exception instance as the instance dimension attribution basis corresponding to the delay exception interface.
[0054] Specifically, if the instance latency of a certain instance under the latency exception interface exceeds a preset latency threshold, the instance latency of all other instances under the latency exception interface is less than or equal to the preset latency threshold, and the number of times this instance is invoked is greater than the number of times other instances under the latency exception interface are invoked, then this instance can be designated as an exception instance corresponding to the latency exception interface. Alternatively, instances under the latency exception interface whose latency exceeds the preset latency threshold can be directly designated as exception instances corresponding to the latency exception interface; instances under the latency exception interface whose latency is less than or equal to the preset latency threshold can be designated as normal instances corresponding to the latency exception interface.
[0055] Taking the alarm message "The 99th percentile instance latency of order-svc / api / v1 / order reached 520ms at 10:00" as an example, we can obtain the target log aggregation result with status code 200 corresponding to order-svc: / api / v1 / order. Then, from the target log aggregation result corresponding to order-svc: / api / v1 / order, we can query that the instances under order-svc: / api / v1 / order are order-pod-a1 and order-pod-b2, and the 99th percentile instance latency of order-pod-a1 and order-pod-b2 are 520ms and 55ms respectively, with 5 and 4 call counts respectively. Finally, since the 99th percentile instance latency of the order-pod-a1 instance is much greater than the preset latency threshold, and the 99th percentile instance latency of the order-pod-b2 instance is much less than the preset latency threshold, and the number of times the order-pod-a1 instance is called is greater than the number of times the order-pod-b2 instance is called, it can be assumed that all the exceptions originated from the order-pod-a1 instance. Therefore, the order-pod-a1 instance and the order-pod-b2 instance are both considered to be exception instances.
[0056] After identifying the abnormal and normal instances corresponding to order-svc: / api / v1 / order, as well as the number of calls and instance latency corresponding to the normal and abnormal instances, the following can be added to the attribution table shown in Table 4: "Attribution Dimension: Instance-level attribution; Specific findings: Anomalies are concentrated in the order-pod-a1 instance; Original data values: The 99th percentile instance latency and number of calls for the order-pod-a1 instance are 520ms and 5 times respectively, and the 99th percentile instance latency and number of calls for the order-pod-b2 instance are 55ms and 4 times respectively; Impact level judgment: Anomalies are highly concentrated". S260. Based on the target device aggregation results and instance dimension attribution criteria, determine the device dimension attribution criteria corresponding to the latency anomaly interface.
[0057] In this step, specifically, based on the instance dimension attribution criteria, the abnormal and normal instances corresponding to the latency anomaly interface can be obtained, and based on the aggregation results of the abnormal instances, normal instances, and target devices, the device dimension attribution criteria corresponding to the latency anomaly interface can be determined.
[0058] Optionally, based on the target device aggregation results and instance dimension attribution criteria, determine the device dimension attribution criteria corresponding to the latency anomaly interface, including: based on the target device aggregation results, obtain the abnormal physical device corresponding to the abnormal instance, the normal physical device corresponding to the normal instance, and the resource utilization rate corresponding to the normal physical device and the abnormal physical device respectively; and use the normal physical device, the abnormal physical device, and the resource utilization rate corresponding to the normal physical device and the abnormal physical device respectively as the device dimension attribution criteria corresponding to the latency anomaly interface.
[0059] Resource utilization can include CPU utilization and memory utilization, among others.
[0060] For example, from the target device aggregation results corresponding to order-svc / api / v1 / order, the abnormal physical device corresponding to the order-pod-a1 instance can be identified as host-78, and the normal physical device corresponding to the order-pod-b2 instance can be identified as host-79. The CPU utilization rates corresponding to host-78 and host-79 are 95% and 30%, respectively. Then, since the CPU utilization rate of the abnormal physical device where the order-pod-a1 instance is located is much higher than the preset CPU utilization rate threshold, the following can be added to the attribution basis table shown in Table 4: "Attribution basis dimension: Device dimension attribution basis; Specific findings: CPU utilization rate of host-78 is 95%; Original data values: CPU utilization rate of host-78 is 95%, and CPU utilization rate of host-79 is 30%; Impact level judgment: Severe resource anomaly".
[0061] S270. Based on the instance dimension attribution criteria, determine the topology dimension attribution criteria corresponding to the delay exception interface.
[0062] Specifically, in this step, the target upstream service corresponding to the delayed service can be obtained, and the first latency time when the target upstream service is called by the abnormal instance, and the second latency time when the target upstream service is called by the normal instance, can be determined. Then, the first latency time and the second latency time are used as the basis for topology dimension attribution corresponding to the delayed interface.
[0063] In one specific implementation, the target upstream service corresponding to the delayed abnormal service can be obtained according to a predefined service dependency graph. Then, based on the log aggregation results of distributed tracing, the first latency time when the target upstream service is called by the abnormal instance and the second latency time when the target upstream service is called by the normal instance can be determined.
[0064] For example, based on a predefined service dependency graph, the target upstream service for the delayed service `order-svc` can be determined to be `user-svc`. Then, the metrics for the `user-svc` service at 10:00:00 can be queried, revealing a 99th percentile latency of 40ms, which is normal. However, a specific check reveals that the 99th percentile latency of `user-svc` when called by the `order-pod-a1` instance is as high as 450ms, while the latency is normal when called by the `order-pod-b2` instance. Based on this, it can be assumed that the same upstream service is only slow to respond to the delayed instance. Combined with high host CPU usage, it is speculated that the `order-pod-a1` instance itself is experiencing resource issues, resulting in slow request processing and queuing of downstream calls, manifesting as increased response time from the `user-svc` service. In summary, the following can be added to the attribution criteria table as shown in Table 4: "Attribution Dimension: Topology dimension attribution; Specific findings: user-svc only responds slowly to abnormal instances; Original data values: user-svc delays 450ms for order-pod-a1 and 40ms for order-pod-b2; Impact assessment: The problem lies with the caller".
[0065] S280. Based on the device-level attribution criteria, determine the change-level attribution criteria corresponding to the delay anomaly interface.
[0066] Specifically, this step involves obtaining change records for abnormal physical devices or services with latency issues within a defined historical time period, along with the time these changes were generated. These change records and their generation times are then used as the basis for attributing changes to the corresponding latency-related interfaces.
[0067] For example, change records for the host-78 host or the order-svc service within one hour before 10:00:00 can be retrieved. Then, if a change record is found indicating that a new data export batch job was deployed on host-78 at 09:45:00 on 2023-10-27, it can be considered that a potential disruptive event was deployed before the anomaly occurred. In summary, the following can be added to the attribution criteria table shown in Table 4: "Attribution Criteria Dimension: Change Dimension Attribution Criteria; Specific Finding: Batch job deployed on host-78 15 minutes ago; Original Data Value: Change time is 09:45; Impact Level Judgment: Potential triggering event exists."
[0068] S290. Based on each attribution basis, generate multiple hypothetical root causes that lead to the delay anomaly, and based on each attribution basis, determine the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause.
[0069] Specifically, as shown in Table 5, four hypothetical root causes can be generated: host CPU contention, instance-specific issues, upstream service issues, and network issues. Since the instance-level, device-level, topology-level, and change-level attribution criteria all support the hypothesis of host CPU contention, the corresponding quantitative scoring rules determine the instance-level, device-level, topology-level, and change-level scores to be 100, 100, 100, and 80 points, respectively. Since only the instance-level attribution criteria support the hypothesis of instance-specific issues, the corresponding instance-level, device-level, topology-level, and change-level scores are determined to be 100, 0, 0, and 0 points, respectively.
[0070] Since there is no attribution evidence to support the hypothesis of an upstream service problem as the root cause, the instance, device, topology, and change dimension scores corresponding to this hypothesis are all 0. Since only the instance and topology dimension attribution evidence can support the hypothesis of a network problem as the root cause, the instance, device, topology, and change dimension scores corresponding to this hypothesis are 100, 0, 100, and 0, respectively. To illustrate the process of determining the scores for each dimension in detail, we can take the hypothetical root cause of host CPU contention in Table 5 as an example. The quantitative scoring rule corresponding to the instance dimension score is: [Formula omitted]. The anomaly contribution of each instance is calculated. The 99th percentile service baseline latency time refers to the latency value at the 99th percentile after sorting historical latency data of the latency-anomaly service in ascending order. Then, based on the instance contribution of each instance corresponding to the latency-anomaly interface, the total contribution corresponding to the latency-anomaly interface is calculated. Finally, the formula is used... The anomaly concentration is calculated. Finally, if the anomaly concentration is greater than 90%, the instance dimension score of the root cause is assumed to be 100; if the anomaly concentration is greater than 70%, the instance dimension score of the root cause is assumed to be 80; otherwise, the instance dimension score of the root cause is assumed to be equal to the product of the anomaly concentration and 100. Therefore, when the service baseline latency at the 99th percentile is 50ms, the instance contributions of the order-pod-a1 and order-pod-b2 instances are calculated to be 2350ms·times and 20ms·times, respectively, thus obtaining the maximum contribution and total contribution corresponding to the latency anomaly interface as 2350ms·times and 2370ms·times, respectively. Then, the anomaly concentration is calculated to be 99.2%, and since the anomaly concentration is greater than 90%, the instance dimension score of the hypothetical root cause, competition for host CPU, can be determined to be 100.
[0071] Since the quantitative scoring rule corresponding to the device dimension score is: through the formula The CPU anomaly score is calculated. If the CPU anomaly score is greater than 0.5, the root cause's device dimension score is assumed to be 100. If the anomaly concentration score is greater than 0.3, the root cause's device dimension score is assumed to be 80. Otherwise, the root cause's device dimension score is assumed to be equal to the product of the CPU anomaly score and 100. Therefore, the CPU anomaly score corresponding to order-svc / api / v1 / order can be calculated to be 100, thus determining that the assumed root cause's device dimension score is 100.
[0072] The quantitative scoring rule corresponding to the topology dimension score is as follows: if the target upstream service is generally normal but only responds slowly to abnormal instances, the problem is determined to be on the caller side, and the topology dimension score for the hypothetical root cause is determined to be 100 points; if the target upstream service responds to all callers, the problem is determined to be on the target upstream service, and the topology dimension score for the hypothetical root cause is determined to be 0 points. Therefore, even if the 99th percentile latency of user-svc is normal, the 99th percentile latency of order-pod-b2 is also normal, but the 99th percentile latency of order-pod-a1 is abnormal, a topology dimension score of 100 points can be calculated.
[0073] Since the quantitative scoring rule corresponding to the change dimension score is: through the formula The time proximity is calculated. Then, if the time proximity is less than 5 minutes, the root cause change dimension score is assumed to be 100 points; if the time proximity is greater than or equal to 5 minutes and less than 30 minutes, the root cause change dimension score is assumed to be 80 points; if the time proximity is greater than or equal to 30 minutes and less than 60 minutes, the root cause change dimension score is assumed to be 60 points; otherwise, the root cause change dimension score is assumed to be 0 points. Therefore, the assumed root cause change dimension score can be calculated to be 80 points.
[0074] S2100. Based on the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause, determine the actual root cause causing the latency anomaly among each hypothetical root cause.
[0075] Specifically, this step involves obtaining the instance dimension weight corresponding to the instance dimension score, the device dimension weight corresponding to the device dimension score, the topology dimension weight corresponding to the topology dimension score, and the change dimension weight corresponding to the change dimension score. The instance dimension weight, device dimension weight, topology dimension weight, and change dimension weight can be determined based on user needs and historical experience. For example, the instance dimension weight, device dimension weight, topology dimension weight, and change dimension weight can be set to 0.3, 0.4, 0.2, and 0.1, respectively.
[0076] Then, the weights for the instance dimension, device dimension, topology dimension, and change dimension can be added together to obtain the total weight. Simultaneously, based on the weights of each dimension, the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause can be weighted and summed to obtain the initial score corresponding to each hypothetical root cause. Next, the initial score corresponding to each hypothetical root cause can be divided by the total weight to obtain the comprehensive score corresponding to each hypothetical root cause. Finally, the highest and second-highest scores can be determined from the comprehensive scores, and a formula can be used... The confidence level of each comprehensive score is determined. When the confidence level of each comprehensive score is greater than or equal to a preset confidence threshold, the hypothesized root cause with the highest comprehensive score is taken as the actual root cause; when the confidence level of each comprehensive score is less than the preset confidence threshold, the comprehensive score is automatically labeled as having a medium confidence level and is transferred to manual review. The preset confidence threshold can be determined based on user needs and historical experience; for example, it can be set to 60%.
[0077] Taking instance dimension weights, device dimension weights, topology dimension weights, and change dimension weights of 0.3, 0.4, 0.2, and 0.1 respectively as an example, the comprehensive score for the hypothetical root cause of host CPU contention is calculated to be 98 points, the comprehensive score for the hypothetical root cause of instance-specific issues is 30 points, the comprehensive score for the hypothetical root cause of upstream service issues is 0 points, and the comprehensive score for the hypothetical root cause of network issues is 50 points. Since the confidence level of each comprehensive score is 49%, which is less than 60%, the comprehensive score can be automatically labeled as having a medium confidence level and then transferred to manual review.
[0078] Optionally, after identifying the actual root cause of the latency anomaly, a solution recommendation can be generated corresponding to the actual root cause. Then, referring to a predefined root cause analysis report template, a root cause analysis report is generated based on the identification process of the actual root cause and the solution recommendations, so that operations and maintenance personnel can quickly handle the fault.
[0079] Taking order-svc / api / v1 / order as an example, the following root cause analysis report can be obtained: I. Alarm Summary: Alarm time: 2023-10-27 10:00:05; Alarm target: order-svc / api / v1 / order; Alarm indicator: 99th percentile instance latency greater than 200ms; Current value: 520ms.
[0080] II. Root cause localization (confidence level: 49%): The most likely root cause is that the CPU utilization of the host machine host-78, where the order service instance order-pod-a1 is located in data center DC-01, is too high (95%), which leads to a decrease in the instance's processing capacity, resulting in an increase in the latency of its call to the user-svc service, and thus causing the overall latency of the interface to spike.
[0081] III. Key Chain of Evidence: (1) Anomalies are highly concentrated in a single instance Data: The latency of the 9th percentile instance of order-pod-a1 is 520ms (5 calls), and the latency of the 99th percentile instance of order-pod-b2 is 55ms (4 calls). Analysis: order-pod-a1 contributed 99.2% of the latency increase, indicating that the problem is instance-specific.
[0082] (2) Abnormal matching of infrastructure indicators Data: The CPU utilization of host-78, where order-pod-a1 is located, is 95%, while the CPU utilization of host-79, where order-pod-b2 is located, is 30%. Analysis: Abnormal instances are strongly correlated with high CPU utilization, with a difference of 65 percentage points.
[0083] (3) The problem of topological propagation characteristics pointing to the caller Data: The overall 99th percentile latency of the target upstream service user-svc is normal (i.e., 40ms), but the latency for calling order-pod-a1 is 450ms, while the latency for calling order-pod-b2 is only 40ms. Analysis: The fact that the same upstream service only responds slowly to abnormal instances excludes problems with the upstream service itself, pointing to an issue with the caller's environment.
[0084] (4) Recent changes may trigger resource competition. Data: A new data export batch job was deployed on host-78 15 minutes ago (09:45); Analysis: Batch processing jobs often consume a lot of CPU resources, which is close to the abnormal time window.
[0085] IV. Recommended Actions: [High Priority] Immediately log in to host-78, check the resource usage of batch jobs, and consider temporarily limiting their CPU quota; [Medium Priority] Migrate the order-pod-a1 instance to a less loaded host (such as host-79) and observe whether the latency recovers; [Low Priority] Check the GC logs and thread stacks of order-pod-a1 to confirm whether there is any blocking caused by resource contention.
[0086] V. Analyze metadata: Analysis Number: RCA-20231027-100005-ORDER-001 Analysis time: 2023-10-27 10:00:10 (Time taken: 5 seconds) Time window: 2023-10-27 09:59:00 - 10:00:00 Related data sources: application metrics (i.e., instance latency, number of errors, and number of requests), physical device metrics (i.e., CPU utilization and memory utilization), topology data (i.e., the upstream service corresponding to the latency-abnormal service), and change logs (i.e., a new data export batch job was deployed on host-78).
[0087] The technical solution of this embodiment determines the instance dimension attribution basis corresponding to the latency anomaly interface based on the target log aggregation results; determines the device dimension attribution basis corresponding to the latency anomaly interface based on the target device aggregation results and the instance dimension attribution basis; determines the topology dimension attribution basis corresponding to the latency anomaly interface based on the instance dimension attribution basis; determines the change dimension attribution basis corresponding to the latency anomaly interface based on the device dimension attribution basis; generates multiple hypothetical root causes causing latency anomalies based on each attribution basis; and determines the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause based on each attribution basis; and determines the actual root cause causing latency anomalies from each hypothetical root cause based on the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each hypothetical root cause. This allows for a comprehensive multi-dimensional perspective in locating the actual root cause of latency anomalies, improving the accuracy of determining the actual root cause.
[0088] To implement the technical solutions of any embodiment of the present invention, the present invention also provides a latency anomaly tracing system, which includes a pre-aggregation component, a service dependency graph component, a correlation analysis engine, a change event handling component, and a report generator. The pre-aggregation component can be used to aggregate log aggregation results and device aggregation results corresponding to the target system, and store each aggregation result using a time-series database. The service dependency graph component can be used to establish real-time, accurate service governance or link tracing metadata to obtain the target upstream service corresponding to the latency anomaly service, the first latency time when the target upstream service is called by the anomaly instance, and the second latency time when the target upstream service is called by the normal instance. The correlation analysis engine has a built-in rule engine and simple algorithms, such as calculating the contribution of each dimension. The change event handling component can be used to control that all operation and maintenance operations (such as deployment, scaling, and configuration changes) must be published to a unified event stream. The report generator can be used to template the analysis results into natural language reports.
[0089] Example 3 Figure 3 This is a schematic diagram of a delay anomaly tracing device according to Embodiment 3 of the present invention. This embodiment is applicable to the case of root cause localization of delay anomalies. The delay anomaly tracing device can be implemented in hardware and / or software and can be configured in electronic devices such as computers.
[0090] like Figure 3 As shown, the delay anomaly tracing device disclosed in this embodiment includes: The abnormal information acquisition module 31 is used to acquire the delay abnormal service, the delay abnormal interface under the delay abnormal service, and the target time window when the delay abnormal interface appears when it is determined that there is a delay abnormality in the target system. The target aggregation result acquisition module 32 is used to acquire the target log aggregation result and target device aggregation result of the delayed abnormal interface within the target time window; The attribution basis determination module 33 is used to determine the instance dimension attribution basis, device dimension attribution basis, topology dimension attribution basis, and change dimension attribution basis corresponding to the delay anomaly interface based on the target log aggregation result and the target device aggregation result. The multidimensional scoring determination module 34 is used to generate multiple hypothetical root causes that lead to delay anomalies based on various attribution criteria, and to determine the instance dimension score, device dimension score, topology dimension score and change dimension score corresponding to each hypothetical root cause based on each attribution criteria. The actual root cause determination module 35 is used to determine the actual root cause causing the delay anomaly among each hypothetical root cause based on the instance dimension score, device dimension score, topology dimension score and change dimension score corresponding to each hypothetical root cause.
[0091] The technical solution in this embodiment, through the cooperation of the anomaly information acquisition module 31, the target aggregation result acquisition module 32, the attribution basis determination module 33, the multidimensional scoring determination module 34, and the actual root cause determination module 35, solves the problem that the existing technology relies heavily on experience when investigating root causes, resulting in low efficiency, long time consumption, and difficulty in quickly locating root causes, which seriously affects the speed of fault recovery. It improves the efficiency of delayed anomaly tracing and enhances the speed of fault recovery.
[0092] Optionally, the device also includes a data aggregation module, which is used for: Obtain the raw call logs of the target system within each time window; each raw call log entry includes the service, interface, instance, status code, instance latency time, and number of calls. Based on the services, interfaces, instances, and status codes in each original call log, the original call logs within each time window are aggregated to obtain a set of call logs of the same type corresponding to each time window; Based on the instance latency of each original call log in each similar call log set, determine the instance latency of each similar call log set, and based on the number of original call logs contained in each similar call log set, determine the number of times each similar call log set is called. Based on the call log set of the same type corresponding to each time window, and the instance latency and number of calls of each call log set of the same type, log aggregation results corresponding to each time window are generated. Based on the log aggregation results, it is possible to determine whether there is a latency anomaly in the target system and the actual root cause of the latency anomaly. Obtain the physical devices corresponding to each instance within different time windows, and aggregate the physical devices within different time windows to obtain the device aggregation results corresponding to each time window, so as to determine the actual root cause of the latency anomaly based on the device aggregation results.
[0093] Optionally, the anomaly information acquisition module 31 is specifically used to: obtain from the log aggregation results each interface corresponding to the target system, each instance under each interface, and the instance latency time of each instance; determine the interface latency time of each interface based on the instance latency time of each instance under each interface; and determine that there is a latency anomaly in the target system when the latency time of any interface is greater than the preset latency time threshold.
[0094] Optionally, the attribution basis determination module 33 includes: The instance attribution basis determination unit is used to determine the instance dimension attribution basis corresponding to the delay exception interface based on the target log aggregation results. The device attribution basis determination unit is used to determine the device dimension attribution basis corresponding to the latency anomaly interface based on the target device aggregation results and instance dimension attribution basis. The topology attribution basis determination unit is used to determine the topology dimension attribution basis corresponding to the delay exception interface based on the instance dimension attribution basis. The change attribution basis determination unit is used to determine the change dimension attribution basis corresponding to the delay anomaly interface based on the device dimension attribution basis.
[0095] Optionally, the instance attribution basis determination unit is specifically used to: determine the number of times each instance is called and the instance latency time under the latency anomaly interface based on the target log aggregation results; determine the abnormal instance and normal instance corresponding to the latency anomaly interface based on the number of times each instance is called and the instance latency time under the latency anomaly interface; and use the abnormal instance, normal instance, and the number of times they are called and the instance latency time corresponding to the normal instance and abnormal instance as the instance dimension attribution basis corresponding to the latency anomaly interface.
[0096] Optionally, the device attribution basis determination unit is specifically used to: obtain, based on the target device aggregation results, the abnormal physical devices corresponding to the abnormal instances, the normal physical devices corresponding to the normal instances, and the resource utilization rates corresponding to the normal physical devices and the abnormal physical devices respectively; and use the normal physical devices, the abnormal physical devices, and the resource utilization rates corresponding to the normal physical devices and the abnormal physical devices respectively as the device dimension attribution basis corresponding to the latency anomaly interface.
[0097] Optionally, the topology attribution basis determination unit is specifically used to: obtain the target upstream service corresponding to the delayed abnormal service, and determine the first delay time when the target upstream service is called by the abnormal instance, and the second delay time when the target upstream service is called by the normal instance; and use the first delay time and the second delay time as the topology dimension attribution basis corresponding to the delayed abnormal interface.
[0098] Optionally, the change attribution basis determination unit is specifically used to: obtain the change records of the abnormal physical device within a set historical time period, as well as the time when the change records were generated; and use the change records and the time when the change records were generated as the change dimension attribution basis corresponding to the delay abnormal interface.
[0099] The delay anomaly tracing device provided in this embodiment of the invention can execute the delay anomaly tracing method provided in any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the method execution. Content not described in detail in this embodiment can be referred to the description in any method embodiment of this application.
[0100] Example 4 Figure 4 A schematic diagram of the structure of an electronic device 10 that can be used to implement embodiments of the present invention is shown. For example... Figure 4 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 can also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.
[0101] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0102] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as latency anomaly tracing methods.
[0103] In some embodiments, the delay anomaly tracing method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and / or installed on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the delay anomaly tracing method described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the delay anomaly tracing method by any other suitable means (e.g., by means of firmware).
[0104] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0105] Computer programs used to implement the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0106] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0107] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0108] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or middleware components (e.g., application servers), or frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.
[0109] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.
[0110] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and no limitation is imposed herein.
[0111] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. A method for tracing the source of delayed anomalies, characterized in that, The method includes: When it is determined that there is a latency anomaly in the target system, obtain the latency anomaly service, the latency anomaly interface under the latency anomaly service, and the target time window in which the latency anomaly interface appears; Obtain the target log aggregation result and target device aggregation result of the delay anomaly interface within the target time window; Based on the target log aggregation results and the target device aggregation results, determine the instance dimension attribution basis, device dimension attribution basis, topology dimension attribution basis, and change dimension attribution basis corresponding to the latency anomaly interface; Based on each of the attribution criteria, multiple hypothetical root causes leading to latency anomalies are generated, and based on each of the attribution criteria, instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each of the hypothetical root causes are determined respectively. Based on the instance dimension score, device dimension score, topology dimension score, and change dimension score corresponding to each of the hypothetical root causes, the actual root cause causing the latency anomaly is determined among the hypothetical root causes.
2. The method according to claim 1, characterized in that, Before determining that the target system has an anomaly in latency, the following steps are also included: Obtain the original call logs of the target system within each time window; wherein each original call log includes the service, interface, instance, status code, instance delay time and number of calls; Based on the services, interfaces, instances, and status codes in each original call log, the original call logs within each time window are aggregated to obtain a set of call logs of the same type corresponding to each time window; Based on the instance latency of each original call log in each similar call log set, determine the instance latency of each similar call log set, and based on the number of original call logs contained in each similar call log set, determine the number of times each similar call log set is called. Based on the call log set of the same type corresponding to each time window, and the instance delay time and the number of times each call log set of the same type is called, a log aggregation result corresponding to each time window is generated, so as to determine whether there is a delay anomaly in the target system and the actual root cause of the delay anomaly based on the log aggregation result; Obtain the physical devices corresponding to each instance within different time windows, and aggregate the physical devices within different time windows to obtain the device aggregation results corresponding to each time window, so as to determine the actual root cause of the delay anomaly based on the device aggregation results.
3. The method according to claim 2, characterized in that, Determining that the target system has a latency anomaly includes: From the log aggregation results, obtain each interface corresponding to the target system, each instance under each interface, and the instance latency time of each instance; The interface delay time of each interface is determined based on the instance delay time of each instance under each interface. When the latency of any interface exceeds a preset latency threshold, it is determined that the target system has a latency anomaly.
4. The method according to claim 2, characterized in that, Based on the target log aggregation results and the target device aggregation results, determine the instance-level attribution criteria, device-level attribution criteria, topology-level attribution criteria, and change-level attribution criteria corresponding to the latency anomaly interface, including: Based on the target log aggregation results, determine the instance dimension attribution criteria corresponding to the latency anomaly interface; Based on the target device aggregation results and the instance dimension attribution criteria, determine the device dimension attribution criteria corresponding to the latency anomaly interface; Based on the instance dimension attribution criteria, determine the topology dimension attribution criteria corresponding to the latency anomaly interface; Based on the device dimension attribution criteria, determine the change dimension attribution criteria corresponding to the latency anomaly interface.
5. The method according to claim 4, characterized in that, Based on the target log aggregation results, the attribution criteria for the instance dimension corresponding to the latency anomaly interface are determined, including: Based on the target log aggregation results, determine the number of times each instance is called and the instance latency time under the latency anomaly interface; Based on the number of times each instance is called and the instance delay time under the delay exception interface, determine the exception instance and normal instance corresponding to the delay exception interface; The abnormal instances, normal instances, and the number of calls and instance latency corresponding to the normal and abnormal instances are used as the attribution basis for the instance dimension corresponding to the latency exception interface.
6. The method according to claim 5, characterized in that, Based on the target device aggregation results and the instance dimension attribution criteria, determine the device dimension attribution criteria corresponding to the latency anomaly interface, including: Based on the target device aggregation results, obtain the abnormal physical device corresponding to the abnormal instance, the normal physical device corresponding to the normal instance, and the resource utilization rate corresponding to the normal physical device and the abnormal physical device respectively; The normal physical devices, the abnormal physical devices, and the resource utilization rates corresponding to the normal physical devices and the abnormal physical devices are used as the device dimension attribution basis for the latency anomaly interface.
7. The method according to claim 5, characterized in that, Based on the instance dimension attribution criteria, the topology dimension attribution criteria corresponding to the latency anomaly interface are determined, including: Obtain the target upstream service corresponding to the delayed abnormal service, and determine the first delay time when the target upstream service is called by the abnormal instance, and the second delay time when the target upstream service is called by the normal instance; The first delay time and the second delay time are used as the topology dimension attribution basis for the delay anomaly interface.
8. The method according to claim 6, characterized in that, Based on the device-level attribution criteria, determine the change-level attribution criteria corresponding to the latency anomaly interface, including: Obtain the change records of the abnormal physical device within a set historical time period, as well as the time when the change records were generated; The change record and the time when the change record was generated are used as the basis for attributing the change dimension corresponding to the delay anomaly interface.
9. A device for tracing delayed anomalies, characterized in that, The device includes: The abnormal information acquisition module is used to acquire the delay abnormality service, the delay abnormality interface under the delay abnormality service, and the target time window when the delay abnormality interface appears when it is determined that there is a delay abnormality in the target system. The target aggregation result acquisition module is used to acquire the target log aggregation result and target device aggregation result of the delay anomaly interface within the target time window; The attribution basis determination module is used to determine the instance dimension attribution basis, device dimension attribution basis, topology dimension attribution basis, and change dimension attribution basis corresponding to the latency anomaly interface based on the target log aggregation result and the target device aggregation result. The multidimensional scoring determination module is used to generate multiple hypothetical root causes of delay anomalies based on each of the attribution criteria, and to determine the instance dimension score, device dimension score, topology dimension score and change dimension score corresponding to each of the hypothetical root causes based on each of the attribution criteria. The actual root cause determination module is used to determine the actual root cause causing the latency anomaly among the hypothetical root causes based on the instance dimension score, device dimension score, topology dimension score and change dimension score corresponding to each hypothetical root cause.
10. An electronic device, characterized in that, The electronic device includes: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the delay anomaly tracing method according to any one of claims 1-8.