Microservice system root cause positioning method and system for hybrid deployment scenarios
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WUHAN UNIV
- Filing Date
- 2023-05-17
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies cannot accurately pinpoint the root causes of microservice systems in hybrid deployment scenarios. In particular, the increased scale and complexity of microservice systems lead to high costs and increased complexity in manual error detection.
An unsupervised learning method is used to construct an anomaly graph for a single microservice system. Frequent itemsets and causal inference algorithms are used to construct anomaly graphs for multiple microservice systems. A random walk algorithm is used to identify the root cause microservice, and weights are updated by combining business-level and container-level metrics.
It improves the accuracy of root cause analysis, can handle mixed deployment scenarios of multiple microservice systems, and integrates business-level and container-level metrics to comprehensively reflect the health status of the microservice operating environment.
Smart Images

Figure CN116737436B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computers, and in particular to a root cause localization method and system for microservice systems in hybrid deployment scenarios. Background Technology
[0002] In recent years, as software companies have experienced continuous business growth, increased user scale, and greater data diversity, the costs of designing, developing, deploying, testing, and maintaining software using traditional monolithic architectures have become increasingly high. In microservice systems, applications are decomposed into small-scale, component-based, loosely coupled, autonomous, and decentralized services. Services communicate with each other through lightweight communication mechanisms such as HTTP, and automated deployment mechanisms using continuous integration and continuous deployment significantly reduce the complexity of the work for developers and operations personnel.
[0003] As software companies' needs increase and their operations expand, modern software is becoming increasingly larger and more numerous. CPU and memory resources are extremely valuable, yet their utilization is often inadequate. Hybrid deployment has emerged as a crucial means to improve resource utilization, reduce costs for software companies, and break down resource barriers between different departments. However, with the increasing scale of services, the growing complexity of inter-service dependencies, and the use of agile development and DevOps tools, code commits and version updates can reach hundreds per day. The cost and complexity of manually detecting errors and locating potential root causes are also gradually increasing. Therefore, automating root cause analysis is a critical task.
[0004] In recent years, academia and industry have done a great deal of work on root cause localization in microservice systems. Chinese patent document CN115576732A, "Root Cause Localization Method and System," proposes to filter candidate virtual machines in a virtual machine cluster based on traffic change information, load historical data associated with the failure time points of the candidate virtual machines, then determine the anomaly information of the candidate virtual machines in a preset root cause localization dimension based on the historical data, and finally determine the target virtual machine from the candidate virtual machines based on the anomaly information. This method is not suitable for root cause localization in microservice systems, as its granularity is at the host machine level, and such excessively large granularity is not conducive to focusing on the true root cause microservice. Chinese patent document CN115756919A, "A Root Cause Localization Method and System for Multidimensional Data," proposes to acquire and preprocess data within a window before and after an anomaly occurs, predict the expected value of each attribute combination, calculate the deviation score of each attribute combination based on the actual value and the predicted expected value, cluster the data, determine whether the data conforms to the ripple effect, and select a root cause localization algorithm for root cause localization. This method focuses on single deployment scenarios of microservice systems, and lacks exploration in hybrid deployments of microservice systems. Therefore, how to construct a microservice root cause localization method suitable for hybrid deployment scenarios remains a challenge for cloud-native intelligent operations and maintenance. Summary of the Invention
[0005] To address the problem that existing technologies cannot accurately pinpoint the root causes of microservices in hybrid deployment scenarios, this invention proposes a root cause localization method for microservice systems in hybrid deployment scenarios. This method uses unsupervised learning to construct an anomaly graph for a single microservice system based on the indicator data generated by the microservice system, and uses frequent itemsets and causal inference algorithms to construct anomaly graphs for multiple microservice systems. Finally, the weights of the edges in the graphs are updated, and a random walk algorithm is used to obtain the final root cause list. The microservice ranked first is the root cause microservice.
[0006] The technical solution of this invention is as follows:
[0007] The first aspect provides a root cause analysis method for microservice systems in hybrid deployment scenarios, including:
[0008] S1: Conduct preliminary chaos engineering experiments on microservice systems for hybrid deployment scenarios, collect chaos engineering datasets, and continuously monitor and collect business-level and container-level metrics through monitoring tools.
[0009] S2: An unsupervised learning algorithm is used to obtain the calling relationship of different microservice systems in a hybrid deployment scenario, and a single-system abnormal service dependency graph is constructed for each microservice system. In the single-system abnormal service dependency graph, the nodes are abnormal services and the edges represent the calling relationship between abnormal services.
[0010] S3: The frequent itemset mining algorithm and the causal inference algorithm are used to obtain the relationship between different microservices in the hybrid deployment scenario, and construct the abnormal service dependency graph of the multi-system. The nodes in the abnormal service dependency graph of the multi-system are abnormal services, and the edges represent the dependency relationship between abnormal services.
[0011] S4: Update the weights of the abnormal service dependency graph of multiple systems based on the correlation between container-level and business-level metrics of two services under the same microservice system;
[0012] S5: Perform a personalized random walk algorithm on the abnormal service dependency graph of the multi-system after weight update to achieve root cause localization.
[0013] In one implementation, step S1 includes:
[0014] Chaos engineering datasets are collected by injecting exceptions into microservice system instances for hybrid deployment scenarios using chaos engineering tools. The types of injected exceptions include instance exceptions, network exceptions, file system exceptions, and stress exceptions.
[0015] Business-level metrics include average latency, P90 latency, and P99 latency for each microservice, while container-level metrics include CPU, memory, network, and file system metrics during microservice operation.
[0016] In one implementation, step S2 includes:
[0017] An unsupervised learning algorithm is used to perform cluster analysis on the P90 latency index between different microservices to find the candidate set of abnormal services. If the input latency data is clustered into one class, the collected latency data is considered to be stable. If the input latency data is clustered into multiple classes, the collected latency data is considered to be discrete. In this case, the latency data between microservices is regarded as abnormal latency, and the calls between microservices are regarded as abnormal calls.
[0018] Construct a single system's exception service dependency graph by treating exception services as nodes and the call relationships between exception services as edges.
[0019] In one implementation, step S3 includes:
[0020] S3.1: Construct frequent itemsets using the Apriori algorithm based on the chaotic engineering dataset;
[0021] S3.2: Mining strong relationships between different microservice systems based on the constructed frequent itemsets;
[0022] S3.3: Use the Granger causality test algorithm to test the causal relationship between strongly associated abnormal services and construct an abnormal service dependency graph for multiple systems.
[0023] In one implementation, step S3.1 includes:
[0024] Scan all anomalous microservices in the chaos engineering dataset, where different microservices are treated as different items, and the items are arranged and combined to generate 1-itemsets, each of which belongs to the C1 set;
[0025] For each item, count the items and remove those that do not meet the minimum support from all 1-itemsets, thus obtaining the set L1 of frequent 1-itemsets;
[0026] By performing a self-join and pruning strategy on L1, a set of 2-itemsets C2 is generated. The chaotic engineering dataset is scanned, and each itemset in C2 is counted. Items that do not meet the minimum support requirement are deleted, thus obtaining a set of frequent 2-itemsets L2. This process is repeated for L... k-1 A set C of k-itemsets is generated by performing self-joins and pruning strategies. k Scan the transaction set and for C k Count each itemset in the set, and then remove items that do not meet the minimum support requirement, thus obtaining the set of frequent k-itemsets L. k .
[0027] In one implementation, step S3.2 includes:
[0028] For each frequent k-item set, generate non-empty subsets of all frequent k-item sets;
[0029] Let two itemsets be X and Y, and the association rule be defined as follows: This can be represented as an itemset X from which Y can be derived; for association rules... Its confidence level is the ratio of transactions containing both X and Y to transactions containing only X, denoted as . Among them, when Then we get Let X represent the probability or confidence level that the occurrence of itemset X will cause the occurrence of itemset Y. conf min This represents the minimum confidence level.
[0030] In one implementation, step S3.3 includes:
[0031] Detect whether the abnormal services in the dependency graph of multiple single systems appear in the frequent itemset set. If several abnormal services appear in the frequent itemset set, use the Granger causality test to test the container-level indicators in the hybrid deployment scenario. If the change of a container-level indicator has a causal relationship, it indicates that the service anomalies between different microservice systems have a causal relationship.
[0032] If no abnormal service appears in the set of frequent itemsets, then a causal test is performed on all abnormal services across different microservice systems. When a causal relationship is found in the change of a certain container-level metric, it indicates that there is a causal relationship between the abnormal services in different systems.
[0033] In one implementation, step S4 includes:
[0034] Extract all container-level metrics and P90 latency data between two services in the same microservice system;
[0035] The Pearson correlation coefficient between the extracted container-level metrics and P90 latency data is calculated. The value of the largest positive correlation coefficient is used as the weight of the directed edge between services in the same microservice system, and the weights of the abnormal service dependency graph of the multi-system are updated.
[0036] In one implementation, step S5 includes:
[0037] S5.1: Define the basic transition matrix M of the multi-system anomaly service dependency graph MSDG, M is represented by formula (1):
[0038] M = [m ij ] n×n (1)
[0039] For each node v in the MSDG, assume it has k outgoing edges that connect to nodes u1, u2, ... u3. k Set the element in the i-th row and j-th column of M as the weight w of that edge. ij Dividing by the out-degree k of node v, the element in the i-th row and j-th column of M is represented by formula (2):
[0040] m ij =w ij / k (2)
[0041] Each element in M represents the transition probability from one node to another;
[0042] S5.2: Introduce a completely random transition matrix E, which means that the transition probability from one node to any other node is 1 / n, where n is the number of nodes in the MSDG. Define a damping factor d to control the ratio between M and E, where 0≤d≤1.
[0043] S5.3: By weighting M and E, the complete transition matrix P of MSDG is obtained, i.e., P = dM + (1-d)E. P is used as the transition matrix of a Markov chain of a general random walk and iteratively calculated. In each iteration, the current transition matrix vector is multiplied by P to obtain a new state vector. This process is repeated until the state vector converges, and then a stationary distribution R is reached. R is an n-dimensional vector in which the sum of its components is 1. Each component represents the score of the corresponding node in MSDG, i.e., the PageRank value, which represents the importance and influence of the node in MSDG. The representation of R is shown in formula (3):
[0044]
[0045] PR(v1), PR(v n ) represent node v1 and node v respectively. n PageRank value;
[0046] S5.4: Sort the PageRank values of nodes in MSDG in descending order, and take the service ranked first as the root cause microservice, that is, the microservice most likely to cause the abnormal situation.
[0047] Based on the same inventive concept, a second aspect of the present invention provides a root cause localization system for microservice systems in hybrid deployment scenarios, comprising:
[0048] The data collection module is used to conduct preliminary chaos engineering experiments on microservice systems for hybrid deployment scenarios, collect chaos engineering datasets, and continuously monitor and collect business-level and container-level metrics through monitoring tools.
[0049] A module for constructing a single-application exception graph is used to obtain the call relationships of different microservice systems in a hybrid deployment scenario using an unsupervised learning algorithm, and to construct a single-system exception service dependency graph for each microservice system. In the single-system exception service dependency graph, the nodes are exception services, and the edges represent the call relationships between exception services.
[0050] A multi-application anomaly graph module is constructed to use frequent itemset mining and causal inference algorithms to determine the relationships between different microservices in a hybrid deployment scenario, and to construct an anomaly service dependency graph for multiple systems. Nodes in the multi-system anomaly service dependency graph represent anomaly services, and edges represent the dependencies between anomaly services.
[0051] The comprehensive ranking module is used to update the weights of the abnormal service dependency graph of multiple systems based on the correlation between container-level and business-level metrics of two services under the same microservice system.
[0052] Furthermore, a personalized random walk algorithm is executed on the abnormal service dependency graph of multiple systems after weight updates to achieve root cause localization.
[0053] Compared with the prior art, the technical solution provided by the present invention has at least the following technical effects:
[0054] The root cause localization method proposed in this invention constructs anomaly service dependency graphs for single systems and multiple systems, respectively. On the one hand, it can handle the situation of multiple microservice systems being deployed in a mixed manner. On the other hand, by integrating business-level indicators and container-level indicators of the microservices, it can comprehensively reflect the health status of the microservices' operating environment, thereby improving the accuracy of root cause localization. Attached Figure Description
[0055] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0056] Figure 1 This is a flowchart of a root cause localization method for microservice systems in hybrid deployment scenarios provided by an embodiment of the present invention;
[0057] Figure 2 This is a framework diagram of a microservice system for hybrid deployment scenarios in an embodiment of the present invention;
[0058] Figure 3 This is a schematic diagram of generating multiple SSDGs based on collected business-level metrics in a specific embodiment of the method of the present invention;
[0059] Figure 4 In a specific embodiment of the method of the present invention, an MSDG is constructed based on multiple SSDGs and weights are assigned to the MSDG;
[0060] Figure 5 The results are experimental findings in Online-Boutique, Sock-Shop, and Train-Ticket, representing specific embodiments of the method of the present invention. Detailed Implementation
[0061] This invention proposes a root cause localization method for microservice systems in hybrid deployment scenarios. The method includes the following steps: First, data collection is performed, including a preliminary chaos engineering experiment to collect a fault dataset for the hybrid deployment system; second, container-level and business-level metrics of different microservice systems in the hybrid deployment scenario are collected; then, an unsupervised learning algorithm is used to derive the call relationships between different microservice systems in the hybrid deployment scenario and construct an anomaly service dependency graph for each microservice system; next, frequent itemset mining and causal inference algorithms are used to derive the connections between different microservices in the hybrid deployment scenario and construct an anomaly service dependency graph for multiple systems; the anomaly weights in the multi-system anomaly service dependency graph are updated; finally, a personalized random walk algorithm is used to rank the anomaly services of the multiple systems to achieve root cause localization.
[0062] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0063] Example 1
[0064] This invention discloses a root cause localization method for microservice systems in hybrid deployment scenarios. Please refer to [link to relevant documentation]. Figure 1 The method includes:
[0065] S1: Conduct preliminary chaos engineering experiments on microservice systems for hybrid deployment scenarios, collect chaos engineering datasets, and continuously monitor and collect business-level and container-level metrics through monitoring tools.
[0066] S2: An unsupervised learning algorithm is used to obtain the calling relationship of different microservice systems in a hybrid deployment scenario, and a single-system abnormal service dependency graph is constructed for each microservice system. In the single-system abnormal service dependency graph, the nodes are abnormal services and the edges represent the calling relationship between abnormal services.
[0067] S3: The frequent itemset mining algorithm and the causal inference algorithm are used to obtain the relationship between different microservices in the hybrid deployment scenario, and construct the abnormal service dependency graph of the multi-system. The nodes in the abnormal service dependency graph of the multi-system are abnormal services, and the edges represent the dependency relationship between abnormal services.
[0068] S4: Update the weights of the abnormal service dependency graph of multiple systems based on the correlation between container-level and business-level metrics of two services under the same microservice system;
[0069] S5: Perform a personalized random walk algorithm on the abnormal service dependency graph of the multi-system after weight update to achieve root cause localization.
[0070] A single-system out-of-system service dependency graph is also called a single-application service dependency graph, abbreviated as SSDG. A multi-system out-of-system service dependency graph is also called a multi-application service dependency graph, abbreviated as MSDG.
[0071] Compared with the prior art, the present invention has the following advantages and technical effects:
[0072] 1. This paper presents a root cause localization method for microservice systems in hybrid deployment scenarios, which can handle situations where multiple microservice systems are deployed in a hybrid manner. Current research mainly focuses on single deployment scenarios, while hybrid deployment scenarios have received little attention.
[0073] 2. The proposed root cause analysis method for microservice systems in hybrid deployment scenarios integrates business-level metrics and container-level metrics, including CPU, memory, network, and file system metrics. Both business-level and container-level metrics can comprehensively reflect the health of the microservice's operating environment, thereby improving the accuracy of root cause analysis.
[0074] In one implementation, step S1 includes:
[0075] Chaos engineering datasets are collected by injecting exceptions into microservice system instances for hybrid deployment scenarios using chaos engineering tools. The types of injected exceptions include instance exceptions, network exceptions, file system exceptions, and stress exceptions.
[0076] Business-level metrics include average latency, P90 latency, and P99 latency for each microservice, while container-level metrics include CPU, memory, network, and file system metrics during microservice operation.
[0077] Specifically, the chaos engineering experiment used chaos engineering tools to inject exceptions into hybrid-deployed microservice instances. Instance exceptions included instance failure and instance killing; network exceptions included network partitioning, network packet loss, network latency, sending duplicate packets, and network packet errors; file system exceptions included latency in file system calls and file system return errors; and stress exceptions included CPU full load stress and memory full load stress.
[0078] Continuously monitoring and collecting business-level and container-level metrics through monitoring tools refers to collecting business-level and container-level metrics through Prometheus.
[0079] In one implementation, step S2 includes:
[0080] An unsupervised learning algorithm is used to perform cluster analysis on the P90 latency index between different microservices to find candidate sets of abnormal services. If the input latency data is clustered into one class, the collected latency data is considered to be stable. If the input latency data is clustered into multiple classes, the collected latency data is considered to be discrete. In this case, the latency data between microservices is regarded as abnormal latency, and the calls between microservices are regarded as abnormal calls.
[0081] Construct a single system's exception service dependency graph by treating exception services as nodes and the call relationships between exception services as edges.
[0082] Specifically, the latency data is the P90 latency index. Figure 3 This example illustrates an abnormal microservice call topology generated from multiple microservice systems based on collected business-level metrics. Specifically, as shown... Figure 3 As shown, the first column represents the P90 latency from one microservice to another. For example, `frontend_adservice&P90` indicates the P90 latency between the `frontend` microservice and `adservice`. Based on the latency data of multiple microservice systems, an abnormal service dependency graph of multiple microservice systems can be obtained using the BIRCH algorithm.
[0083] In one implementation, step S3 includes:
[0084] S3.1: Construct frequent itemsets using the Apriori algorithm based on the chaotic engineering dataset;
[0085] S3.2: Mining strong relationships between different microservice systems based on the constructed frequent itemsets;
[0086] S3.3: Use the Granger causality test algorithm to test the causal relationship between strongly associated abnormal services and construct an abnormal service dependency graph for multiple systems.
[0087] In one implementation, step S3.1 includes:
[0088] Scan all anomalous microservices in the chaos engineering dataset, where different microservices are treated as different items, and the items are arranged and combined to generate 1-itemsets, each of which belongs to the C1 set;
[0089] For each item, count the items and remove those that do not meet the minimum support from all 1-itemsets, thus obtaining the set L1 of frequent 1-itemsets;
[0090] By performing a self-join and pruning strategy on L1, a set of 2-itemsets C2 is generated. The chaotic engineering dataset is scanned, and each itemset in C2 is counted. Items that do not meet the minimum support requirement are deleted, thus obtaining a set of frequent 2-itemsets L2. This process is repeated for L... k-1 A set C of k-itemsets is generated by performing self-joins and pruning strategies. k Scan the transaction set and for C k Count each itemset in the set, and then remove items that do not meet the minimum support requirement, thus obtaining the set of frequent k-itemsets L. k .
[0091] In the specific implementation process, L k C is the set of frequent k-itemsets. k It is a set of k-itemsets.
[0092] In one implementation, step S3.2 includes:
[0093] For each frequent k-item set, generate non-empty subsets of all frequent k-item sets;
[0094] Let two itemsets be X and Y, and the association rule be defined as follows: This can be represented as an itemset X from which Y can be derived; for association rules... Its confidence level is the ratio of transactions containing both X and Y to transactions containing only X, denoted as . Among them, when Then we get Let X represent the probability or confidence level that the occurrence of itemset X will cause the occurrence of itemset Y. conf min This represents the minimum confidence level.
[0095] Specifically, constructing strong associations between different system services based on frequent itemsets refers to using frequent k-itemset sets L x To build strong relationships between different services.
[0096] All non-empty subsets of a frequent itemset are also frequent itemsets, thus ensuring that all generated strong association rules are related to frequent k-itemsets and their subsets.
[0097] In one implementation, step S3.3 includes:
[0098] Detect whether the abnormal services in the SSDG dependency graph of multiple single systems appear in the frequent itemset set. If several abnormal services appear in the frequent itemset set, use the Granger causality test to test the container-level indicators in the hybrid deployment scenario. If the change of a container-level indicator has a causal relationship, it indicates that the service anomalies between different microservice systems have a causal relationship.
[0099] If no abnormal service appears in the set of frequent itemsets, then a causal test is performed on all abnormal services across different microservice systems. When a causal relationship is found in the change of a certain container-level metric, it indicates that there is a causal relationship between the abnormal services in different systems.
[0100] Specifically, using the Granger causality test algorithm to test the causal relationship between strongly associated abnormal services and constructing a multi-application service dependency graph means that before building the MSDG, it is necessary to determine the causal relationship between different system services.
[0101] In the specific implementation process, such as Figure 4 As shown in part (a), if the microservice system A experiences an abnormal service S A It is the abnormal service S that caused the microservice system B. B The reason is to add a line from S in the MSDG. A Pointing to S B The edges here indicate the direction of causal relationships, which can be interpreted as dependencies between services.
[0102] In one implementation, step S4 includes:
[0103] Extract all container-level metrics and P90 latency data between two services in the same microservice system;
[0104] The Pearson correlation coefficient between the extracted container-level metrics and P90 latency data is calculated. The value of the largest positive correlation coefficient is used as the weight of the directed edge between services in the same microservice system, and the weights of the abnormal service dependency graph of the multi-system are updated.
[0105] Specifically, for edges between services within the same microservice system, it's necessary to extract all container-level metrics for both services, as well as the P90 latency data between them, and calculate the Pearson correlation coefficient between the extracted container-level metrics and P90 latency data. The value of the largest positive correlation coefficient will be used as the weight of the directed edge between services within the same microservice system. In practice, when calculating the Pearson correlation coefficient, a positive result indicates a positive correlation between the two types of data, and the stronger the correlation, the larger the correlation coefficient r will be. Therefore, by calculating the Pearson correlation coefficients of all service container-level metrics and the latency data between services, the correlation between services within the same microservice system can be found and used as the weight of the directed edge.
[0106] In one implementation, the edges between services in different microservice applications are assigned a fixed weight α (α∈[0,1]). α can be optimized and adjusted by the developers, and is typically set to 0.4.
[0107] Among them, the MSDG after weight update is as follows: Figure 4 As shown in section (b), it illustrates the weighted MSDGs of systems A and B.
[0108] In one implementation, step S5 includes:
[0109] S5.1: Define the basic transition matrix M of the multi-system abnormal service dependency graph MSDG, M is represented by formula (1):
[0110] M = [m ij ] n×n (1)
[0111] For each node v in the MSDG, assume it has k outgoing edges that connect to nodes u1, u2, ... u3. k Set the element in the i-th row and j-th column of M as the weight w of that edge. ij Dividing by the out-degree k of node v, the element in the i-th row and j-th column of M is represented by formula (2):
[0112] m ij =w ij / k (2)
[0113] Each element in M represents the transition probability from one node to another;
[0114] S5.2: Introduce a completely random transition matrix E, which means that the transition probability from one node to any other node is 1 / n, where n is the number of nodes in the MSDG. Define a damping factor d to control the ratio between M and E, i.e., the linear combination coefficient, where 0≤d≤1.
[0115] S5.3: By weighting M and E, the complete transition matrix P of MSDG is obtained, i.e., P = dM + (1-d)E. P is used as the transition matrix of a Markov chain of a general random walk and iteratively calculated. In each iteration, the current transition matrix vector is multiplied by P to obtain a new state vector. This process is repeated until the state vector converges, and then a stationary distribution R is reached. R is an n-dimensional vector in which the sum of its components is 1. Each component represents the score of the corresponding node in MSDG, i.e., the PageRank value, which represents the importance and influence of the node in MSDG. The representation of R is shown in formula (3):
[0116]
[0117] PR(v1), PR(v n ) represent node v1 and node v respectively. n PageRank value;
[0118] S5.4: Sort the PageRank values of nodes in MSDG in descending order, and take the service ranked first as the root cause microservice, that is, the microservice most likely to cause the abnormal situation.
[0119] Specifically, each node in the MSDG corresponds one-to-one with a microservice in the microservice system. The PageRank values of the nodes in the MSDG are sorted to identify the microservices with higher scores, thus pinpointing potential root causes. In this process, the service with the highest score is typically considered the root cause microservice, i.e., the microservice most likely to cause the anomaly.
[0120] In this embodiment, the present invention was tested on the open-source microservice systems Online-Boutique, Sock-Shop, and Train-Ticket.
[0121] Figure 5 Experimental results of this invention in Online-Boutique, Sock-Shop, and Train-Ticket environments are presented. HybridMRCL represents this invention. This invention is compared with Random Walk (RW), MicroRCA, MicroRCA*, FRL-MFPG, and FRL-MFPG* methods. In MicroRCA and FRL-MFPG, multiple connections between systems and their nodes are artificially added to enable operation in hybrid deployment environments. MicroRCA* and FRL-MFPG* use the method of this invention to construct relationships between multiple microservice systems. Experimental results show that this invention has higher accuracy compared to existing methods.
[0122] Example 2
[0123] Based on the same inventive concept, this embodiment discloses a root cause localization system for microservice systems in hybrid deployment scenarios. Please refer to [link to relevant documentation]. Figure 2 The system includes:
[0124] The data collection module is used to conduct preliminary chaos engineering experiments on microservice systems for hybrid deployment scenarios, collect chaos engineering datasets, and continuously monitor and collect business-level and container-level metrics through monitoring tools.
[0125] A module for constructing a single-application exception graph is used to obtain the call relationships of different microservice systems in a hybrid deployment scenario using an unsupervised learning algorithm, and to construct a single-system exception service dependency graph for each microservice system. In the single-system exception service dependency graph, the nodes are exception services, and the edges represent the call relationships between exception services.
[0126] A multi-application anomaly graph module is constructed to use frequent itemset mining and causal inference algorithms to determine the relationships between different microservices in a hybrid deployment scenario, and to construct an anomaly service dependency graph for multiple systems. Nodes in the multi-system anomaly service dependency graph represent anomaly services, and edges represent the dependencies between anomaly services.
[0127] The comprehensive ranking module is used to update the weights of the abnormal service dependency graph of multiple systems based on the correlation between container-level and business-level metrics of two services under the same microservice system.
[0128] Furthermore, a personalized random walk algorithm is executed on the abnormal service dependency graph of multiple systems after weight updates to achieve root cause localization.
[0129] Since the system described in Embodiment 2 of this invention is the system used to implement the root cause localization method for microservice systems in hybrid deployment scenarios in Embodiment 1 of this invention, those skilled in the art can understand the specific structure and variations of this system based on the method described in Embodiment 1 of this invention, and therefore will not be repeated here. All systems used in the method of Embodiment 1 of this invention fall within the scope of protection of this invention.
[0130] Example 3
[0131] Based on the same inventive concept, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed, implements the method described in Embodiment 1.
[0132] Since the computer-readable storage medium described in Embodiment 3 of this invention is the same computer-readable storage medium used in implementing the root cause localization method for microservice systems in hybrid deployment scenarios in Embodiment 1 of this invention, those skilled in the art can understand the specific structure and variations of this computer-readable storage medium based on the method described in Embodiment 1 of this invention, and therefore will not be repeated here. All computer-readable storage media used in the method of Embodiment 1 of this invention fall within the scope of protection of this invention.
[0133] Example 4
[0134] Based on the same inventive concept, this application also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method in Embodiment 1.
[0135] Since the computer device described in Embodiment 4 of this invention is the same computer device used to implement the root cause localization method for microservice systems in hybrid deployment scenarios in Embodiment 1 of this invention, those skilled in the art can understand the specific structure and variations of this computer device based on the method described in Embodiment 1 of this invention, and therefore will not be repeated here. All computer devices used in the method of Embodiment 1 of this invention fall within the scope of protection of this invention.
[0136] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0137] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0138] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention. Clearly, those skilled in the art can make various modifications and variations to the embodiments of the invention without departing from the spirit and scope of the invention. Thus, if these modifications and variations of the embodiments of the invention fall within the scope of the claims of the invention and their equivalents, the invention also intends to include these modifications and variations.
Claims
1. A root cause localization method for microservice systems in hybrid deployment scenarios, characterized in that, include: S1: Conduct preliminary chaos engineering experiments on microservice systems for hybrid deployment scenarios, collect chaos engineering datasets, and continuously monitor and collect business-level and container-level metrics through monitoring tools. S2: An unsupervised learning algorithm is used to obtain the calling relationship of different microservice systems in a hybrid deployment scenario, and a single-system abnormal service dependency graph is constructed for each microservice system. In the single-system abnormal service dependency graph, the nodes are abnormal services and the edges represent the calling relationship between abnormal services. S3: The frequent itemset mining algorithm and the causal inference algorithm are used to obtain the relationship between different microservices in the hybrid deployment scenario, and construct the abnormal service dependency graph of the multi-system. The nodes in the abnormal service dependency graph of the multi-system are abnormal services, and the edges represent the dependency relationship between abnormal services. S4: Update the weights of the abnormal service dependency graph of multiple systems based on the correlation between container-level and business-level metrics of two services under the same microservice system; S5: Perform a personalized random walk algorithm on the abnormal service dependency graph of the multi-system after weight update to achieve root cause localization.
2. The root cause localization method for microservice systems in hybrid deployment scenarios as described in claim 1, characterized in that, Step S1 includes: Chaos engineering datasets are collected by injecting exceptions into microservice system instances for hybrid deployment scenarios using chaos engineering tools. The types of injected exceptions include instance exceptions, network exceptions, file system exceptions, and stress exceptions. Business-level metrics include average latency, P90 latency, and P99 latency for each microservice, while container-level metrics include CPU, memory, network, and file system metrics during microservice operation.
3. The root cause localization method for microservice systems in hybrid deployment scenarios as described in claim 2, characterized in that, Step S2 includes: An unsupervised learning algorithm is used to perform cluster analysis on the P90 latency index between different microservices to find candidate sets of abnormal services. If the input latency data is clustered into one class, the collected latency data is considered to be stable. If the input latency data is clustered into multiple classes, the collected latency data is considered to be discrete. In this case, the latency data between microservices is regarded as abnormal latency, and the calls between microservices are regarded as abnormal calls. Construct a single system's exception service dependency graph by treating exception services as nodes and the call relationships between exception services as edges.
4. The root cause localization method for microservice systems in hybrid deployment scenarios as described in claim 1, characterized in that, Step S3 includes: S3.1: Construct frequent itemsets using the Apriori algorithm based on the chaotic engineering dataset; S3.2: Mining strong relationships between different microservice systems based on the constructed frequent itemsets; S3.3: Use the Granger causality test algorithm to test the causal relationship between strongly associated abnormal services and construct an abnormal service dependency graph for multiple systems.
5. The root cause localization method for microservice systems in hybrid deployment scenarios as described in claim 4, characterized in that, Step S3.1 includes: Scan all anomalous microservices in the chaos engineering dataset, where different microservices are treated as different items, and the items are arranged and combined to generate 1-itemsets, each of which belongs to the C1 set; For each item, count the items and remove those that do not meet the minimum support from all 1-itemsets, thus obtaining the set L1 of frequent 1-itemsets; By performing a self-join and pruning strategy on L1, a set of 2-itemsets C2 is generated. The chaotic engineering dataset is scanned, and each itemset in C2 is counted. Items that do not meet the minimum support requirement are deleted, thus obtaining a set of frequent 2-itemsets L2. This process is repeated for L... k-1 A set C of k-itemsets is generated by performing self-joins and pruning strategies. k Scan the transaction set and for C k Count each itemset in the set, and then remove items that do not meet the minimum support requirement, thus obtaining the set of frequent k-itemsets L. k .
6. The root cause localization method for microservice systems in hybrid deployment scenarios as described in claim 5, characterized in that, Step S3.2 includes: For each frequent k-item set, generate non-empty subsets of all frequent k-item sets; Let two itemsets be X and Y, and the association rule be defined as follows: This can be represented as an itemset X from which Y can be derived; for association rules... Its confidence level is the ratio of transactions containing both X and Y to transactions containing only X, denoted as . Among them, when Then we get Let X represent the probability or confidence level that the occurrence of itemset X will cause the occurrence of itemset Y. conf min This represents the minimum confidence level.
7. The root cause localization method for microservice systems in hybrid deployment scenarios as described in claim 4, characterized in that, Step S3.3 includes: Detect whether the abnormal services in the dependency graph of multiple single systems appear in the frequent itemset set. If several abnormal services appear in the frequent itemset set, use the Granger causality test to test the container-level indicators in the hybrid deployment scenario. If the change of a container-level indicator has a causal relationship, it indicates that the service anomalies between different microservice systems have a causal relationship. If no abnormal service appears in the set of frequent itemsets, then a causal test is performed on all abnormal services across different microservice systems. When a causal relationship is found in the change of a certain container-level metric, it indicates that there is a causal relationship between the abnormal services in different systems.
8. The root cause localization method for microservice systems in hybrid deployment scenarios as described in claim 1, characterized in that, Step S4 includes: Extract all container-level metrics and P90 latency data between two services in the same microservice system; The Pearson correlation coefficient between the extracted container-level metrics and P90 latency data is calculated. The value of the largest positive correlation coefficient is used as the weight of the directed edge between services in the same microservice system, and the weights of the abnormal service dependency graph of the multi-system are updated.
9. The root cause localization method for microservice systems in hybrid deployment scenarios as described in claim 1, characterized in that, Step S5 includes: S5.1: Define the basic transition matrix M of the multi-system abnormal service dependency graph MSDG, M is represented by formula (1): M=[m ij ] n×n (1) For each node v in the MSDG, assume it has k outgoing edges that connect to nodes u1, u2, ... u3. k Set the element in the i-th row and j-th column of M as the weight w of that edge. ij Dividing by the out-degree k of node v, the element in the i-th row and j-th column of M is represented by formula (2): m ij =w ij / k (2) Each element in M represents the transition probability from one node to another; S5.2: Introduce a completely random transition matrix E, which means that the transition probability from one node to any other node is 1 / n, where n is the number of nodes in the MSDG. Define a damping factor d to control the ratio between M and E, where 0≤d≤1. S5.3: By weighting M and E, the complete transition matrix P of MSDG is obtained, i.e., P = dM + (1-d)E. P is used as the transition matrix of a Markov chain of a general random walk and iteratively calculated. In each iteration, the current transition matrix vector is multiplied by P to obtain a new state vector. This process is repeated until the state vector converges, and then a stationary distribution R is reached. R is an n-dimensional vector in which the sum of its components is 1. Each component represents the score of the corresponding node in MSDG, i.e., the PageRank value, which represents the importance and influence of the node in MSDG. The representation of R is shown in formula (3): PR(v1), PR(v n ) represent node v1 and node v respectively. n PageRank value; S5.4: Sort the PageRank values of nodes in MSDG in descending order, and take the service ranked first as the root cause microservice, that is, the microservice most likely to cause the abnormal situation.
10. A root cause localization system for microservice systems in hybrid deployment scenarios, characterized in that: include: The data collection module is used to conduct preliminary chaos engineering experiments on microservice systems for hybrid deployment scenarios, collect chaos engineering datasets, and continuously monitor and collect business-level and container-level metrics through monitoring tools. A module for constructing a single-application exception graph is used to obtain the call relationships of different microservice systems in a hybrid deployment scenario using an unsupervised learning algorithm, and to construct a single-system exception service dependency graph for each microservice system. In the single-system exception service dependency graph, the nodes are exception services, and the edges represent the call relationships between exception services. A multi-application anomaly graph module is constructed to use frequent itemset mining and causal inference algorithms to determine the relationships between different microservices in a hybrid deployment scenario, and to construct an anomaly service dependency graph for multiple systems. Nodes in the multi-system anomaly service dependency graph represent anomaly services, and edges represent the dependencies between anomaly services. The comprehensive ranking module is used to update the weights of the abnormal service dependency graph of multiple systems based on the correlation between container-level and business-level metrics of two services under the same microservice system. Furthermore, a personalized random walk algorithm is executed on the abnormal service dependency graph of multiple systems after weight updates to achieve root cause localization.