Method for local updating of cloud environment service fault propagation graph based on service call graph

By analyzing service call graphs and service operation data in the cloud computing environment, the update scope of the service fault propagation graph is identified and locally updated, solving the problem of rapid identification and updating of service fault propagation graphs in the cloud computing environment and improving the efficiency of fault propagation analysis.

CN116527474BActive Publication Date: 2026-06-26KUNMING UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
KUNMING UNIV OF SCI & TECH
Filing Date
2023-04-27
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies struggle to quickly identify partial updates to service failure propagation graphs in cloud computing environments, resulting in inefficient failure propagation analysis.

Method used

By analyzing changes in the service call graph and combining it with service operation data in a cloud computing environment, the update scope of the service fault propagation graph is identified, and the service fault propagation graph is locally updated to construct a local fault propagation graph.

Benefits of technology

It reduces the number of services involved in causal inference and improves the efficiency of constructing service failure propagation graphs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116527474B_ABST
    Figure CN116527474B_ABST
Patent Text Reader

Abstract

The application discloses a cloud environment service fault propagation graph local updating method based on a service call graph, which comprises the following steps: analyzing service call graph changes, screening out services with changed current call relationship; identifying service fault propagation graph updating range, combining service running data in a cloud computing environment, analyzing whether service call relationship changes result in service fault relationship changes, updating the service list with changed call relationship, and obtaining an updating range list; and locally updating the service fault propagation graph, and updating the original service fault propagation graph. The application measures service abnormal events and infers event causality relationship by using service running data in a cloud computing environment to construct a service fault propagation graph. When service call relationship changes, the method for determining the updating range of the service fault propagation graph reduces the number of services involved in causality inference, and improves the efficiency of constructing the service fault propagation graph.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a method for local updating of service fault propagation graph in a cloud environment based on service call graph, belonging to the field of service fault diagnosis under cloud computing. Background Technology

[0002] In recent years, cloud computing applications have become increasingly widespread. Because the composition of distributed services is inherently uncertain, and the behavior of services is dynamically combined and interconnected, with dependencies between them, it is difficult to detect the abnormal state of the entire distributed service through the state of a single service.

[0003] Yu Feng et al. proposed a system fault modeling method integrating Petri nets and fault trees. This integrated model can effectively perform qualitative and quantitative analysis of fault trees. Sun proposed a Gaussian Bayesian network fault propagation path identification and reasoning algorithm based on parent node filtering. It calculates the maximum conditional probability of possible subsets of the reduced parent set based on parameter weights and reduction of the parent node set, and determines the fault propagation path network through layer-by-layer reasoning. Chen Hao et al., targeting the execution trajectory between services in a distributed system, used an injection proxy to intercept and forward traffic. By collecting service call information through a distributed tracing system, they were able to effectively detect anomalies in the service call path. Wang Tao et al. proposed a microservice fault diagnosis method oriented towards anomaly propagation. This method monitors microservice metrics, inter-microservice call behavior, and abnormal microservices, constructs a microservice dependency graph, and obtains a fault propagation subgraph based on the service dependency graph and the set of abnormal services.

[0004] Most existing studies describe the propagation of service failures by constructing a directed graph of service failure propagation. Since service call relationships are constantly changing, how to comprehensively consider service operation data and service call relationships to locally update the service failure propagation graph in order to quickly identify service failure propagation has become an urgent problem to be solved. Summary of the Invention

[0005] This invention provides a method for partially updating a service fault propagation graph in a cloud environment based on a service call graph. The method identifies the update range of the service fault propagation graph according to the changes in the call graph, and combines service operation data in the cloud computing environment to partially update the service fault propagation graph.

[0006] The technical solution of this invention is:

[0007] According to one aspect of the present invention, a method for partially updating a service fault propagation graph in a cloud environment based on a service call graph is provided, comprising: Step 1, analyzing changes in the service call graph: analyzing the service call graph and filtering out services whose current call relationships have changed; Step 2, identifying the update scope of the service fault propagation graph: combining service operation data in the cloud computing environment, analyzing whether changes in service call relationships lead to changes in service fault relationships, updating the list of services whose call relationships have changed, and obtaining an update scope list; Step 3, partially updating the service fault propagation graph: constructing a local fault propagation graph and updating the original service fault propagation graph.

[0008] Step 1 includes: inputting the service fault propagation graph adjacency matrix W, the residual correlation coefficient matrix Ecorr, the service fault propagation graph node list ServerList, and the service fault propagation graph generation time t; reading the first service call graph adjacency matrix Mnew and the first node list MnewList from the first time to the update trigger time; wherein, the first time represents the difference between the update trigger time and the time period; reading the second service call graph adjacency matrix M and the second node list MList from the second time to the service fault propagation graph generation time; wherein, the second time represents the difference between the service fault propagation graph generation time and the time period; constructing a stop based on the first node list MnewList and the second node list MList. The system generates a list of services that have stopped running (StopServer) and a list of newly emerging services (NewServer). Services not present in the first node list (MnewList) but included in the second node list (Mlist) are stored as the StopServer list. Services included in the first node list (MnewList) but not present in the second node list (Mlist) are stored as the newly emerging service list (NewServer). Based on the services directly connected to services in the first service call topology graph adjacency matrix (Mnew) and NewServer, a list of correspondences between newly emerging services and their directly connected services (NewServerCorr) and a list of services directly connected to newly emerging services (NewServerCorrList) are created. Based on the newly emerging service list (NewServer), the first service call graph adjacency matrix (Mnew) and the first node list (MnewList) are updated. Based on the StopServer list of services that have stopped running, the second service call graph adjacency matrix (M) and the second node list (Mlist) are updated. Based on the StopServer list of services that have stopped running, services with different structures in the first service call graph adjacency matrix (Mnew) and the second service call graph adjacency matrix (M) are filtered to create a list of services with structural changes (ChangeServer). The newly emerging service list (NewServer), the list of services directly connected to newly emerging services (NewServerCorrList), and the list of services with structural changes (ChangeServer) are merged. `hangeServer` establishes a list of services whose call relationships have changed, `PartServer`; if the residual correlation coefficient matrix `Ecorr` is empty, it obtains the abnormal event measurement sequences of n services in the service failure propagation graph node list `ServerList`, and establishes a global abnormal event measurement matrix `AbnormEvents`; based on the service failure propagation graph adjacency matrix `W` and the global abnormal event measurement matrix `AbnormEvents`, it obtains the regression coefficient matrix `wj`; through the regression coefficient matrix `wj`, it updates the service failure propagation graph adjacency matrix `W`; based on the updated service failure propagation graph adjacency matrix `W` and the global abnormal event measurement matrix `AbnormEvents`, it obtains the first residual matrix `E`.Based on the first residual matrix E, obtain the updated residual correlation coefficient matrix Ecorr; merge the service failure propagation graph node list ServerList and the newly appearing service list NewServer, and store it as the service failure propagation graph new node list ServerListNew; if the residual correlation coefficient matrix Ecorr is not empty, directly merge the service failure propagation graph node list ServerList and the newly appearing service list NewServer, and store it as the service failure propagation graph new node list ServerListNew; output the service list PartServer whose call relationship has changed, the residual correlation coefficient matrix Ecorr, the service failure propagation graph new node list ServerListNew, the newly appearing service list NewServer, the service list ChangeServer whose structure has changed, the list of stopped service names StopServer, the list of services directly connected to the newly appearing service NewServerCorrList, the list of correspondences between the newly appearing service and its directly connected service NewServerCorr, the service failure propagation graph adjacency matrix W, the time period period, and the update trigger time time, and execute Step 2.

[0009] The process of obtaining the abnormal event measurement sequence of n services in the service failure propagation graph node list ServerList and establishing a global abnormal event measurement matrix AbnormEvents includes: reading the monitoring data NowData of the nth service in ServerList from the second time to the service failure propagation graph generation time; reading the monitoring data NormData of the nth service in ServerList during normal operation; for any time between the second time and the service failure propagation graph generation time, if the value corresponding to NowData is not empty, assign the value of NowData at that time to vector d, and calculate the mean vector u and covariance matrix cov of NormData; obtaining the service abnormal event measurement value T* based on vector d, mean vector u, and covariance matrix cov; if the value corresponding to NowData is empty, then the service abnormal event measurement value T* = 0; and establishing an abnormal event measurement sequence based on the service abnormal event measurement value to obtain the abnormal event measurement sequence AbnormEvents of n services in ServerList.

[0010] The expression for obtaining the service anomaly event metric T* based on the vector d, the mean vector u, and the covariance matrix cov is: T*=(du)*cov-1*(du). T ; where T represents transpose.

[0011] The process of obtaining the regression coefficient matrix wj based on the service failure propagation graph adjacency matrix W and the global abnormal event metric matrix AbnormEvents includes: for each (j+1)th column of W with a non-zero value, recording the row number of the (j+1)th column of W with a non-zero value and storing it as a list Xnum; using the row with row number Xnum of AbnormEvents as the independent variable and the (j+1)th row of AbnormEvents as the dependent variable, obtaining the regression coefficient matrix wj through a multiple linear regression parameter estimation method; where j is less than the length of the list ServerList.

[0012] Based on the updated service failure propagation graph adjacency matrix W and the global abnormal event metric matrix AbnormEvents, the first residual matrix E is obtained, where E = AbnormEvents - W. T *AbnormEvents; where T represents transpose.

[0013] Step 2 includes: inputting the service list PartServer whose call relationships have changed, the residual correlation coefficient matrix Ecorr, the service fault propagation graph new node list ServerListNew, the newly emerging service list NewServer, the service list ChangeServer whose structure has changed, the service name list StopServer that has stopped running, the service list NewServerCorrList that is directly connected to the newly emerging service, the correspondence list NewServerCorr between the newly emerging service and its directly connected service, the service fault propagation graph adjacency matrix W, the time period period, and the update trigger time time; from the first time to the update trigger time, obtaining the abnormal event measurement sequence of n services in the service list PartServer whose call relationships have changed, and establishing a local abnormal event measurement matrix AbnormPartEvents; where the first time represents the difference between the update trigger time and the time period; calculating the length L of the list NewServer; adding L rows and L columns to the end of both the adjacency matrix W and the residual correlation coefficient matrix Ecorr, with the added values ​​initialized to 0; selecting the rows and columns corresponding to the services in PartServer in the adjacency matrix W, naming it the local service fault propagation graph PartWold before the update; and based on the abnormal event measurements of the n services in PartServer... The sequence AbnormPartEvents and the local service fault propagation graph PartWold before the update are used to obtain the second residual matrix Enew. Based on the second residual matrix Enew, a new residual correlation coefficient matrix Enewcorr is obtained. The ChangeServer is iterated through. If all values ​​of the corresponding row and column of a service in the ChangeServer are not higher than the values ​​of the corresponding row and column in the Ecorr, the service name of that service is deleted from the ChangeServer, and the row corresponding to that service name is deleted from AbnormPartEvents. Otherwise, no action is taken. The NewServerCorr is iterated through, and the independence of the two row vectors corresponding to the newly added service and its directly connected service in AbnormPartEvents is checked. If they are independent and the directly connected service is not a newly added service, the service name of the directly connected service is deleted from NewServerCorr and NewServerCorrList, and the row corresponding to that service name is deleted from AbnormPartEvents. Otherwise, no action is taken. If a service in a NewServer in NewServerCorr has no directly connected service, the service name is deleted from the NewServer, and the row corresponding to that service name is deleted from AbnormPartEvents. Otherwise, no action is taken.Merge the lists NewServer, NewServerCorrList, and ChangeServer, remove duplicates, and save them as the service failure propagation graph update range list PartServer1; output the service failure propagation graph adjacency matrix W, residual correlation coefficient matrix Ecorr, service failure propagation graph update range list PartServer1, local anomaly event metric matrix AbnormPartEvents, service failure propagation graph new node list ServerListNew, update trigger time time, and stop server name list StopServer, then execute Step 3.

[0014] Step 3 includes: inputting the service fault propagation graph adjacency matrix W, residual correlation coefficient matrix Ecorr, service fault propagation graph update range list PartServer1, local anomaly event metric matrix AbnormPartEvents, service fault propagation graph new node list ServerListNew, update trigger time time, and stop server name list StopServer; resetting the row and column corresponding to StopServer in adjacency matrix W to 0; obtaining the local range fault propagation graph PartW based on the local anomaly event metric matrix AbnormPartEvents; obtaining the local residual matrix PartE based on the local range fault propagation graph PartW and AbnormPartEvents; obtaining the local residual correlation coefficient matrix PartEr based on the local residual matrix PartE; replacing the values ​​at corresponding positions in W and Ecorr with PartW and PartEcorr; service fault propagation graph generation time t = time; service fault propagation graph node list ServerList = ServerListNew; outputting the updated service fault propagation graph adjacency matrix W, residual correlation coefficient matrix Ecorr, service fault propagation graph node list ServerList, and service fault propagation graph generation time t.

[0015] According to another aspect of the present invention, a cloud environment service fault propagation graph partial update system based on service call graph is provided, comprising: a service call graph change analysis module, used to analyze the service call graph and filter out services whose current call relationships have changed; a service fault propagation graph update range identification module, used to combine service operation data in the cloud computing environment to analyze whether changes in service call relationships lead to changes in service fault relationships, update the list of services whose call relationships have changed, and obtain an update range list; and a partial update service fault propagation graph module, used to construct a local fault propagation graph and update the original service fault propagation graph.

[0016] According to another aspect of the present invention, a processor is provided for running a program, wherein the program, when running, executes the cloud environment service fault propagation graph local update method based on service call graph as described in any one of the above.

[0017] The beneficial effects of this invention are: by measuring service anomaly events and inferring causal relationships in service operation data under a cloud computing environment, this invention constructs a service fault propagation graph. When service call relationships change, by determining the update range of the service fault propagation graph, the number of services involved in causal inference is reduced, and the efficiency of constructing the service fault propagation graph is improved. Attached Figure Description

[0018] Figure 1 This is the overall flowchart of the present invention;

[0019] Figure 2 for Figure 1 The specific process of Step 1 in the middle;

[0020] Figure 3 for Figure 2 Partial section of Step 1 Figure 1 ;

[0021] Figure 4 for Figure 1 Partial section of Step 1 Figure 2 ;

[0022] Figure 5 for Figure 1 The specific process of Step 2;

[0023] Figure 6 for Figure 1 The specific process of Step 3. Detailed Implementation

[0024] The invention will be further described below with reference to the accompanying drawings and embodiments, but the scope of the invention is not limited to the description.

[0025] Example 1: As Figure 1-6 As shown, according to one aspect of the present invention, a method for partially updating a service fault propagation graph in a cloud environment based on a service call graph is provided, comprising: Step 1, analyzing changes in the service call graph: analyzing the service call graph and filtering out services whose current call relationships have changed; Step 2, identifying the update scope of the service fault propagation graph: combining service operation data in the cloud computing environment, analyzing whether changes in service call relationships lead to changes in service fault relationships, updating the list of services whose call relationships have changed, and obtaining an update scope list; Step 3, partially updating the service fault propagation graph: constructing a local fault propagation graph and updating the original service fault propagation graph.

[0026] Further, Step 1 includes: inputting the service failure propagation graph adjacency matrix W, the residual correlation coefficient matrix Ecorr, the service failure propagation graph node list ServerList, and the service failure propagation graph generation time t; reading the first service call graph adjacency matrix Mnew and the first node list MnewList from the first time to the update trigger time; wherein, the first time represents the difference between the update trigger time and the time period; reading the second service call graph adjacency matrix M and the second node list MList from the second time to the service failure propagation graph generation time; wherein, the second time represents the difference between the service failure propagation graph generation time and the time period; and based on the first node list MnewList and the second node list MNewList... The process involves constructing a list of stopped services (StopServer) and a list of newly emerging services (NewServer). Services not present in the first node list (MnewList) but included in the second node list (Mlist) are stored as the stopped service list (StopServer). Services included in the first node list (MnewList) but not in the second node list (Mlist) are stored as the newly emerging service list (NewServer). Based on the services directly connected to services in the first service call topology graph adjacency matrix (Mnew) and NewServer, a list of newly emerging services corresponding to their directly connected services (NewServerCorr) and a list of services directly connected to newly emerging services (NewServerCorrList) are established. Based on the newly emerging service list (NewServer), the first service call graph adjacency matrix (Mnew) and the first node list (MnewList) are updated. Based on the stopped service list (StopServer), the second service call graph adjacency matrix (M) and the second node list (Mlist) are updated. Based on the stopped service list (StopServer), services with different structures in the first and second service call graph adjacency matrices (Mnew and M) are filtered to create a list of services with structural changes (ChangeServer). Finally, the newly emerging service list (NewServer), the list of services directly connected to newly emerging services (NewServerCorrList), and the list of services with structural changes are merged. For the ChangeServer table, create a list of services whose call relationships have changed, PartServer. If the residual correlation coefficient matrix Ecorr is empty, obtain the abnormal event measurement sequences of n services in the service failure propagation graph node list ServerList, and create a global abnormal event measurement matrix AbnormEvents. Based on the service failure propagation graph adjacency matrix W and the global abnormal event measurement matrix AbnormEvents, obtain the regression coefficient matrix wj. Update the service failure propagation graph adjacency matrix W using the regression coefficient matrix wj. Based on the updated service failure propagation graph adjacency matrix W and the global abnormal event measurement matrix AbnormEvents, obtain the first residual matrix E.Based on the first residual matrix E, obtain the updated residual correlation coefficient matrix Ecorr; merge the service failure propagation graph node list ServerList and the newly appearing service list NewServer, and store it as the service failure propagation graph new node list ServerListNew; if the residual correlation coefficient matrix Ecorr is not empty, directly merge the service failure propagation graph node list ServerList and the newly appearing service list NewServer, and store it as the service failure propagation graph new node list ServerListNew; output the service list PartServer whose call relationship has changed, the residual correlation coefficient matrix Ecorr, the service failure propagation graph node list ServerList, the service failure propagation graph new node list ServerListNew, the newly appearing service list NewServer, the service list ChangeServer whose structure has changed, the list of stopped service names StopServer, the list of services directly connected to the newly appearing service NewServerCorrList, the list of correspondences between the newly appearing service and its directly connected service NewServerCorr, the service failure propagation graph adjacency matrix W, the time period period, and the update trigger time time, and execute Step 2.

[0027] Further, the step of obtaining the abnormal event measurement sequence of n services in the service failure propagation graph node list ServerList and establishing a global abnormal event measurement matrix AbnormEvents includes: reading the monitoring data NowData of the nth service in ServerList from the second time to the service failure propagation graph generation time; reading the monitoring data NormData of the nth service in ServerList during normal operation; from the second time to the service failure propagation graph generation time, for any time, if the value corresponding to NowData is not empty, assign the value of NowData at that time to vector d, and calculate the mean vector u and covariance matrix cov of NormData; based on vector d, mean vector u and covariance matrix cov, obtain the service abnormal event measurement value T*; if the value corresponding to NowData is empty, then the service abnormal event measurement value T* = 0; based on the service abnormal event measurement value, establish an abnormal event measurement sequence to obtain the abnormal event measurement sequence AbnormEvents of n services in ServerList.

[0028] Furthermore, the expression for obtaining the service anomaly event metric T* based on vector d, mean vector u, and covariance matrix cov is: T*=(du)*cov-1*(du) T ; where T represents transpose.

[0029] Further, the step of obtaining the regression coefficient matrix wj based on the service failure propagation graph adjacency matrix W and the global abnormal event metric matrix AbnormEvents includes: for each (j+1)th column of W with a non-zero value, recording the row number of the (j+1)th column of W with a non-zero value and storing it as a list Xnum; using the row with row number Xnum of AbnormEvents as the independent variable and the (j+1)th row of AbnormEvents as the dependent variable, obtaining the regression coefficient matrix wj through a multiple linear regression parameter estimation method; where j is less than the length of the list ServerList.

[0030] Furthermore, based on the updated service failure propagation graph adjacency matrix W and the global abnormal event metric matrix AbnormEvents, the first residual matrix E is obtained, where E = AbnormEvents - W. T *AbnormEvents; where T represents transpose.

[0031] Further, Step 2 includes: inputting the service list PartServer whose call relationships have changed, the residual correlation coefficient matrix Ecorr, the service failure propagation graph node list ServerList, the service failure propagation graph new node list ServerListNew, the newly emerging service list NewServer, the service list ChangeServer whose structure has changed, the service name list StopServer, the service list directly connected to the newly emerging service NewServerCorrList, the correspondence list between the newly emerging service and its directly connected service NewServerCorr, the service failure propagation graph adjacency matrix W, the time period period, and the update trigger time time; from the first time to the update trigger time, obtaining the abnormal event measurement sequence of n services in the service list PartServer whose call relationships have changed, and establishing a local abnormal event measurement matrix AbnormPartEvents; where the first time represents the difference between the update trigger time and the time period; (specifically, reading the current monitoring data NowData of the nth service in PartServer from the first time to the update trigger time time; reading the monitoring data of the nth service in PartServer during normal operation) Control data NormData; from the first time to the update trigger time, for any given moment, if the value corresponding to NowData is not empty, assign the value of NowData at that moment to vector d, and calculate the mean vector u and covariance matrix cov of NormData; based on vector d, mean vector u, and covariance matrix cov, obtain the service anomaly event metric T*; if the value corresponding to NowData is empty, then the service anomaly event metric T* = 0; based on the service anomaly event metric, establish an anomaly event metric sequence, and obtain the anomaly event metric sequence AbnormPartEvents for n services of PartServer. ; Calculate the length L of the list NewServer; Add L rows and L columns to the end of both the adjacency matrix W and the residual correlation coefficient matrix Ecorr, initializing the added values ​​to 0; Select the rows and columns corresponding to the services in PartServer in the adjacency matrix W, and name them PartWold (the local service failure propagation graph before the update); Obtain the second residual matrix Enew based on the abnormal event measurement sequence AbnormPartEvents of n services in PartServer and the local service failure propagation graph PartWold (the local service failure propagation graph before the update); Obtain the new residual correlation coefficient matrix Enewcorr based on the second residual matrix Enew.Iterate through ChangeServer. If all values ​​in the corresponding row and column of a service in ChangeServer are not higher than the values ​​in the corresponding row and column of Ecor, then delete the service name in ChangeServer and the row corresponding to that service name in AbnormPartEvents; otherwise, do nothing. Iterate through NewServerCorr and check the independence of the two row vectors corresponding to the newly added service and its directly connected service in AbnormPartEvents. If they are independent and the directly connected service is not a newly added service, then delete the service name of the directly connected service in NewServerCorr and NewServerCorrList and the row corresponding to that service name in AbnormPartEvents; otherwise, do nothing. If NewServerCorr... If a service in a NewServer does not have a directly connected service, then delete the service name in NewServer and delete the corresponding row in AbnormPartEvents; otherwise, do nothing. Merge the lists NewServer, NewServerCorrList, and ChangeServer, remove duplicates, and save them as the service failure propagation graph update range list PartServer1. Output the service failure propagation graph adjacency matrix W, residual correlation coefficient matrix Ecorr, service failure propagation graph update range list PartServer1, local anomaly event metric matrix AbnormPartEvents, service failure propagation graph new node list ServerListNew, update trigger time time, and stop server name list StopServer, and execute Step 3. Further, the second residual matrix Enew = AbnormPartEvents - PartWold; T *AbnormPartEvents.

[0032] Further, Step 3 includes: inputting the service fault propagation graph adjacency matrix W, the residual correlation coefficient matrix Ecorr, the service fault propagation graph update range list PartServer1, the local anomaly event metric matrix AbnormPartEvents, the service fault propagation graph new node list ServerListNew, the update trigger time time, and the list of stopped service names StopServer; resetting the row and column corresponding to StopServer in the adjacency matrix W to 0; obtaining the local range fault propagation graph PartW based on the local anomaly event metric matrix AbnormPartEvents; obtaining the local residual matrix PartE based on the local range fault propagation graph PartW and AbnormPartEvents; obtaining the local residual correlation coefficient matrix PartEr based on the local residual matrix PartE; replacing the values ​​at corresponding positions in W and Ecorr with PartW and PartEcorr; the service fault propagation graph generation time t = time; the service fault propagation graph node list ServerList = ServerListNew; and outputting the updated service fault propagation graph adjacency matrix W, the residual correlation coefficient matrix Ecorr, the service fault propagation graph node list ServerList, and the service fault propagation graph generation time t. Furthermore, the local residual matrix PartE = AbnormPartEvents - PartW T *AbnormPartEvents.

[0033] According to another aspect of the present invention, a cloud environment service fault propagation graph partial update system based on service call graph is provided, comprising: a service call graph change analysis module, used to analyze the service call graph and filter out services whose current call relationships have changed; a service fault propagation graph update range identification module, used to combine service operation data in the cloud computing environment to analyze whether changes in service call relationships lead to changes in service fault relationships, update the list of services whose call relationships have changed, and obtain an update range list; and a partial update service fault propagation graph module, used to construct a local fault propagation graph and update the original service fault propagation graph.

[0034] According to another aspect of the present invention, a processor is provided, the processor being configured to run a program, wherein the program, when running, executes the cloud environment service fault propagation graph local update method based on service call graph as described in any one of the above embodiments.

[0035] Example 2: As Figures 1-6 As shown, the cloud environment service fault propagation graph local update method based on service call graph first traverses the process step by step according to Step1-Step2-Step3 until execution is terminated.

[0036] The service operation data attribute table is shown in Table 1. The table gives the meaning of the service operation data attributes and the category label of the service operation data.

[0037] Table 1 Service Operation Data Attribute Table

[0038] time Service runtime data acquisition time servername Service Name cpuuser Percentage of CPU used by the service process in user space memuser Percentage of memory used by the service process system Percentage of CPU used by the service process in kernel space

[0039] The service operation data table is shown in Table 2:

[0040] Table 2 Service Normal Operation Data Table

[0041]

[0042] The service operation data table is shown in Table 3:

[0043] Table 3 Service Operation Data Table

[0044]

[0045] Continued from Table 3: Service Operation Data Table

[0046]

[0047] Furthermore, the specific steps of the method can be set as follows:

[0048] Step 1 is described in detail below:

[0049] Step 1.1: Input the adjacency matrix W, residual correlation coefficient matrix Ecorr, node list ServerList, and service failure propagation graph generation time t, then execute Step 1.2;

[0050] The adjacency matrix of the input service fault propagation graph is W = [[0,0,0,1],[0,0,0,0],[0,0,0,0],[0,0,1,0]].

[0051] The residual correlation coefficient matrix Ecorr = []

[0052] Node list ServerList = ['Train', 'Restaurant', 'Cinema', 'OnlineCar']

[0053] Service failure propagation graph was generated at time t = '2022-02-02 08:00:04'

[0054] Step 1.2: Initialize n=0, time period=4s, empty list AbnormEvents, and execute Step 1.3;

[0055] Step 1.3: Get the current time, save it as "time", and then proceed to Step 1.4.

[0056] Current time: '2022-02-02 08:00:07'

[0057] Step 1.4: Read the adjacency matrix Mnew and node list MnewList of the service call topology graph for the current time period from time-period to time, and then execute Step 1.5;

[0058] Table 4 Service Invocation Data Attribute Table

[0059] time Service call request initiation time user Service call request initiator server Service Invocation Request Receiver

[0060] Table 5 Service Invocation Data Table

[0061] time user server 2022-02-0208:00:00 Onlinecar Cinema 2022-02-0208:00:02 Train OnlineCar 2022-02-0208:00:03 OnlineCar Restaurant 2022-02-0208:00:05 Hotel Plane 2022-02-0208:00:05 Hotel OnlineCar 2022-02-0208:00:06 Cinema OnlineCar

[0062] From 2022-02-02 08:00:03 to 2022-02-02 08:00:07, the service operation data table contains the operation data of the services 'Plane', 'Restaurant', 'Cinema', 'OnlineCar', and 'Hotel'. Therefore, the node list MnewList = ['Plane', 'Restaurant', 'Cinema', 'OnlineCar', 'Hotel']. In the service call data table, there are call relationships such as OnlineCar->Restaurant, Hotel->Plane, Hotel->OnlineCar, and Cinema->OnlineCar. Therefore, the adjacency matrix of the service call topology graph for the current time period is Mnew = [[0,0,0,0,0], [0,0,0,0,0], [0,0,0,1,0], [0,1,0,0,0], [1,0,0,1,0]].

[0063] Step 1.5: Read the adjacency matrix M and node list MList of the service call topology graph from time t-period to t, and then execute Step 1.6;

[0064] From 2022-02-02 08:00:00 to 2022-02-02 08:00:04, the service operation data table contains operation data for the services 'Train', 'Plane', 'Restaurant', and 'OnlineCar'. Therefore, the node list MList = ['Train', 'Restaurant', 'Cinema', 'OnlineCar']. In the service call data table, there are call relationships such as OnlineCar->Cinema, Train->OnlineCar, and OnlineCar->Restaurant. Therefore, the service call topology M = [[0,0,0,1],[0,0,0,0],[0,0,0,0],[0,1,1,0]].

[0065] Step 1.6: Save the service names that do not exist in MnewList but are contained in Mlist as a list StopServer, and then execute Step 1.7;

[0066] Since MList contains the service name 'Train', but MnewList does not, StopServer = ['Train'].

[0067] Step 1.7: Store the service names that are included in MnewList but not in MList as a list NewServer, and then proceed to Step 1.8;

[0068] Since MList does not contain the service names 'Hotel' and 'Plane', but MnewList does, NewServer = ['Plane', 'Hotel'].

[0069] Step 1.8: Locate the services in graph Mnew that are directly connected to the services in NewServer, save the correspondence between the services in NewServer and their directly connected services as NewServerCorr, and save the directly connected services as a list NewServerCorrList. Then execute Step 1.9.

[0070] NewServer contains services Plane and Hotel. According to the node list MnewList, Plane corresponds to the first row and first column in Mnew, and Hotel corresponds to the fifth row and fifth column. Since the value of the first column of the fifth row in Mnew is 1, and the value of the fourth column of the fifth row is also 1, and the fourth column corresponds to the service OnlineCar, therefore NewServerCorr = {'Hotel':['Plane','OnlineCar'],'Plane':['Hotel']}, and NewServerCorrList = ['Plane','Hotel','OnlineCar']

[0071] Step 1.9: Remove the rows and columns corresponding to the services in NewServer and StopServer from Mnew and Mlist, and remove the service names corresponding to the services in NewServer and StopServer from MnewList and Mlist. Then execute Step 1.10; (That is, remove the rows and columns corresponding to the services in NewServer from Mnew, remove the rows and columns corresponding to the services in StopServer from M, remove the service names corresponding to the services in NewServer from MnewList, and remove the service names corresponding to the services in StopServer from Mlist. Then execute Step 1.10.)

[0072] From the node list MnewList, we know that the Hotel service in NewServer corresponds to the fifth row and fifth column of graph Mnew, and Plane corresponds to the first row and first column of Mnew. Graph Mnew does not contain Train. From the node list MList, we know that graph M does not contain Hotel or Plane. The Train service in StopServer corresponds to the first row and first column of M. Therefore, removing the first row and first column from M, and removing the first row and first column, and the fifth row and fifth column from Mnew, we get Mnew = [[0,0,0],[0,0,1],[1,0,0]], MnewList = ['Restaurant','Cinema','OnlineCar'], M = [[0,0,0],[0,0,0],[1,1,0]], MList = ['Restaurant','Cinema','OnlineCar']

[0073] Step 1.10: Filter the services with different structures in Mnew and M, save their service names as a list ChangeServer, and then proceed to Step 1.11;

[0074] Since in graph Mnew it is Cinema->OnlineCar, and in graph M it is OnlineCar->Cinema, therefore ChangeServer = ['Cinema', 'OnlineCar'];

[0075] Step 1.11: Merge the lists NewServer, NewServerCorrList, and ChangeServer, remove duplicates, and save them as PartServer, a list of services whose call relationships have changed. Then execute Step 1.12.

[0076] PartServer1=['Plane','Hotel','OnlineCar','Cinema'];

[0077] Step 1.12: If the service list PartServer with changed call relationships is not empty, proceed to Step 1.13; otherwise, end the process.

[0078] PartServer contains 4 elements and executes Steop 1.13;

[0079] Step 1.13: If the residual correlation coefficient matrix Ecorr is empty, proceed to Step 1.14; otherwise, proceed to Step 1.37.

[0080] Ecorr = [], which is empty, so Step 1.14 is executed;

[0081] Step 1.14: If n is less than the length of the list ServerList, proceed to Step 1.15; otherwise, proceed to Step 1.28.

[0082] Step 1.15: Initialize i = 0, empty list AbnormEvent, and execute Step 1.16;

[0083] Step 1.16: Read the monitoring data NowData of the nth service in ServerList from time t-period to t, and then execute Step 1.17;

[0084] (The following explanation mainly uses data from the 'Train' service);

[0085] When n=0, the Train's data from 2022-02-02 08:00:00 to 2022-02-02 08:00:04 is NowData=[[0.24,0.12,1.00],[0.87,0.12,0.82],[0.21,0.11,0.57],[],[]]

[0086] Step 1.17: Read the monitoring data NormData of the nth service in ServerList when it is running normally, and then execute Step 1.18;

[0087] Train data during normal operation

[0088] NormData=[[0.25,0.31,0.22],[0.11,0.37,0.11],[0.44,0.25,0.15],[0.25,0.25,0.10],[0.16,0.26,0.26],[0.14,0.41,0.10]]

[0089] Step 1.18: If i is less than or equal to period, proceed to Step 1.19; otherwise, proceed to Step 1.26.

[0090] When i = 0, 0 <= 4, execute Step 1.19;

[0091] When i = 1, 1 <= 4, execute Step 1.19;

[0092] When i = 2, 2 <= 4, execute Step 1.19;

[0093] When i = 3, 3 <= 4, execute Step 1.19;

[0094] When i = 4, 4 <= 4, execute Step 1.19;

[0095] When i = 5, 5 is not less than or equal to 4, so execute Step 1.26;

[0096] Step 1.19: If the value of NowData at time t-period+i is not empty, execute Step 1.20; otherwise, execute Step 1.23.

[0097] When i = 0, the value of NowData corresponding to '2022-02-02 08:00:00' is [0.24, 0.12, 1.00], which is not empty, so Step 1.20 is executed;

[0098] When i = 1, the value of NowData corresponding to '2022-02-02 08:00:01' is [0.87, 0.12, 0.82], which is not empty, so Step 1.20 is executed;

[0099] When i = 2, the value of NowData corresponding to '2022-02-02 08:00:02' is [0.21, 0.11, 0.57], which is not empty, so Step 1.20 is executed;

[0100] When i = 3, the value of NowData corresponding to '2022-02-02 08:00:03' is [], which is empty. Execute Step 1.23.

[0101] When i = 4, the value of NowData corresponding to '2022-02-02 08:00:04' is [], which is empty. Execute Step 1.23.

[0102] Step 1.20: Assign the value of NowData at time t-period+i to vector d, and then execute Step 1.21;

[0103] When i = 0, d = [0.24, 0.12, 1.00];

[0104] When i = 1, d = [0.87, 0.12, 0.82];

[0105] When i = 2, d = [0.21, 0.11, 0.57];

[0106] Step 1.21: Calculate the mean vector u and covariance matrix cov of NormData, then proceed to Step 1.22;

[0107] Calculate the average of the three columns of NormData and save it as u, u = [0.225000, 0.308333, 0.156667].

[0108] Calculate the NormData covariance matrix cov = [[0.01443, -0.00531, 0.00044], [-0.00531, 0.00465667, -0.00196667], [0.00044, -0.00196667, 0.00466667]];

[0109] Step 1.22, Service anomaly event metric T* = (du) * cov -1 *(du) T Proceed to Step 1.24;

[0110] Among them, cov -1 Let (du) be the inverse of the covariance matrix cov. T This represents the transpose operation of vector (du).

[0111] When i = 0, T* = 163.059196;

[0112] When i = 1, T* = 162.405193;

[0113] When i = 2, T* = 37.750970;

[0114] Step 1.23: Service exception event metric T* = 0, proceed to Step 1.24;

[0115] When i = 3, T* = 0;

[0116] When i = 4, T* = 0;

[0117] Step 1.24: Add the metric T* of the nth service exception event in ServerList to the end of AbnormEvent, and then execute Step 1.25;

[0118] When i = 0, AbnormEvent = [163.059196];

[0119] When i = 1, AbnormEvent = [163.059196, 162.405193];

[0120] When i = 2, AbnormEvent = [163.059196, 162.405193, 37.750970];

[0121] When i = 3, AbnormEvent = [163.059196, 162.405193, 37.750970, 0];

[0122] When i = 4, AbnormEvent = [163.059196, 162.405193, 37.750970, 0, 0];

[0123] Step 1.25, i = i + 1, execute Step 1.18;

[0124] Step 1.26: Add the abnormal event metric sequence AbnormEvent of the nth service in ServerList to the end of AbnormEvents, and then execute Step 1.27;

[0125] When n = 3, AbnormEvents = [[163.05919588,162.40519278,37.75096995,0,0],[208.89393042,35.58486225,16.62017,139.43848329,240.65016523],[ 87.84212687,15.16263657,35.35738332,4.71780382,68.61560346],[665.28015379,449.88898173,363.4043527,453.99482535,560.52651552]];

[0126] Step 1.27, n = n + 1, execute Step 1.14;

[0127] Step 1.28: Initialize j = 0, then execute Step 1.29;

[0128] Step 1.29: If j is less than the length of the list ServerList, proceed to Step 1.30; otherwise, proceed to Step 1.35.

[0129] When j = 0, 0 < 4, execute Step 1.30;

[0130] When j = 1, 1 < 4, so execute Step 1.30;

[0131] When j = 2, 2 < 4, so execute Step 1.30;

[0132] When j = 3, 3 < 4, so execute Step 1.30;

[0133] When j = 4, 4 is not less than 4, so execute Step 1.35;

[0134] Step 1.30: If column j+1 of W has a non-zero value, execute Step 1.31; otherwise, execute Step 1.34.

[0135] Since W = [[0,0,0,1],[0,0,0,0],[0,0,0,0],[0,0,1,0]];

[0136] When j = 0, there are no non-zero values ​​in the first column, so execute Step 1.34;

[0137] When j=1, there are no non-zero values ​​in the second column, so execute Step1.34;

[0138] When j=2, the value in the 4th row of the 3rd column is non-zero, so Step1.31 is executed;

[0139] When j=3, the first row of the fourth column has a non-zero value, so Step1.31 is executed;

[0140] Step 1.31: Record the row number of column j+1 of W that has a non-zero value, store it as a list Xnum, and then execute Step 1.32;

[0141] When j = 2, Xnum = [4];

[0142] When j = 3, Xnum = [1];

[0143] Step 1.32: Using the row with row number Xnum in AbnormEvents as the independent variable and the (j+1)th row in AbnormEvents as the dependent variable, obtain the regression coefficient matrix wj through the multiple linear regression parameter estimation method, and then execute Step 1.33.

[0144] When j = 2, the 4th row of AbnormEvents is used as the independent variable, and the 3rd row is used as the dependent variable. The regression coefficient matrix wj = [0.24250067] is calculated.

[0145] When j = 3, the first row of AbnormEvents is used as the independent variable, and the fourth row is used as the dependent variable. The regression coefficient matrix wj = [0.50554396] is calculated.

[0146] Step 1.33: Replace the non-zero value in column j+1 of W with the value of wj, and then execute Step 1.34;

[0147] Replace the non-zero values ​​of W, so W = [[0,0,0,0.50554396],[0,0,0,0],[0,0,0,0],[0,0,0.24250067,0]]

[0148] Step 1.34, j = j + 1, execute Step 1.29;

[0149] Step 1.35, Residual matrix E = AbnormEvents - W T *AbnormEvents, execute Step 1.36;

[0150] E=[[163.05919588,162.40519278,37.75096995,0,0],[208.89393042 ,35.58486225,16.62017,139.43848329,240.65016523],[-73.4887582 9,-93.93574436,-52.76841685,-105.37624695,-67.3124539],[582.8 4656294,367.78601819,344.31957803,453.99482535,560.52651552]]

[0151] Step 1.36: Calculate the Pearson correlation coefficients between the row vectors of the residual matrix E, and take the absolute value of the correlation coefficients to obtain the residual correlation coefficient matrix Ecorr. Then proceed to Step 1.37.

[0152] Ecorr=[[1.00000000e+00,1.81829599e-01,9.80100049e-02,2.50426100 e-16],[1.81829599e-01,1.00000000e+00,9.60003300e-03,9.77601204e- 01],[9.80100049e-02,9.60003300e-03,1.00000000e+00,3.82054910e-02 ],[2.50426100e-16,9.77601204e-01,3.82054910e-02,1.00000000e+00]]

[0153] Step 1.37: Merge lists ServerList and NewServer, save as a list ServerListNew, and then execute Step 1.38;

[0154] ServerListNew=['Train','Restaurant','Cinema','OnlineCar','Plane','Hotel'];

[0155] Step 1.38: Output the following service list with changed call relationships: PartServer; residual correlation coefficient matrix: Ecorr; updated node list: ServerListNew; newly emerging service list: NewServer; service list with structural changes: ChangeServer; service name list: StopServer; service list directly connected to the newly emerging service: NewServerCorrList; correspondence between the newly emerging service and its directly connected service: NewServerCorr; service fault propagation graph: W; time period: period; and update trigger time: time. Then execute Step 2.

[0156] Output PartServer = ['Plane','Hotel','OnlineCar','Cinema'];

[0157] Ecorr=[[1.00000000e+00,1.81829599e-01,9.80100049e-02,2.50426100 e-16],[1.81829599e-01,1.00000000e+00,9.60003300e-03,9.77601204e- 01],[9.80100049e-02,9.60003300e-03,1.00000000e+00,3.82054910e-02 ],[2.50426100e-16,9.77601204e-01,3.82054910e-02,1.00000000e+00]]

[0158] ServerListNew=['Train','Restaurant','Cinema','OnlineCar','Plane','Hotel'];

[0159] NewServer=['Plane','Hotel'];

[0160] ChangeServer=['Cinema','OnlineCar'];

[0161] StopServer = ['Train'];

[0162] NewServerCorrList=['Plane','Hotel','OnlineCar'];

[0163] NewServerCorr={'Hotel':['Plane','OnlineCar'],'Plane':['Hotel']};

[0164] W=[[0,0,0,0.50554396],[0,0,0,0],[0,0,0,0],[0,0,0.24250067,0]];

[0165] period = 4s;

[0166] time='2022-02-0208:00:07'.

[0167] Step 2 is described in detail below:

[0168] Step 2.1: Input the following service list with changed call relationships: PartServer; residual correlation coefficient matrix: Ecorr; updated node list: ServerListNew; newly emerging service list: NewServer; service list with structural changes: ChangeServer; service name list: StopServer; service list directly connected to the newly emerging service: NewServerCorrList; correspondence between the newly emerging service and its directly connected service: NewServerCorr; service fault propagation graph: W; time period: period; and update trigger time: time. Then execute Step 2.2.

[0169] PartServer=['Plane','Hotel','OnlineCar','Cinema'];

[0170] Ecorr=[[1.00000000e+00,1.81829599e-01,9.80100049e-02,2.50426100 e-16],[1.81829599e-01,1.00000000e+00,9.60003300e-03,9.77601204e- 01],[9.80100049e-02,9.60003300e-03,1.00000000e+00,3.82054910e-02 ],[2.50426100e-16,9.77601204e-01,3.82054910e-02,1.00000000e+00]]

[0171] ServerListNew=['Train','Restaurant','Cinema','OnlineCar','Plane','Hotel'];

[0172] NewServer=['Plane','Hotel'];

[0173] ChangeServer=['Cinema','OnlineCar'];

[0174] StopServer = ['Train'];

[0175] NewServerCorrList=['Plane','Hotel','OnlineCar'];

[0176] NewServerCorr={'Hotel':['Plane','OnlineCar'],'Plane':['Hotel']};

[0177] W=[[0,0,0,0.50554396],[0,0,0,0],[0,0,0,0],[0,0,0.24250067,0]];

[0178] period = 4s;

[0179] time='2022-02-0208:00:07';

[0180] Step 2.2: Initialize n = 0, empty list AbnormPartEvents, and execute Step 2.3;

[0181] Step 2.3: If n is less than the length of the list PartServer, proceed to Step 2.4; otherwise, proceed to Step 2.17.

[0182] Step 2.4: Initialize i = 0, empty list AbnormPartEvent, and execute Step 2.5;

[0183] Step 2.5: Read the current monitoring data NowData of the nth service in PartServer from time-period to time, and then execute Step 2.6;

[0184] (The following explanation mainly uses data from the 'Plane' service);

[0185] When n=0, Plane's data from 08:00:03 to 08:00:07 on 2022-02-02 is NowData=[],[],[0.47,0.25,0.30],[0.79,0.25,0.86],[0.72,0.24,0.07]]

[0186] Step 2.6: Read the monitoring data NormData of the nth service in PartServer when it is running normally, and then execute Step 2.7;

[0187] Data during normal Plane operation:

[0188] NormData=[[0.02,0.30,0.96],[0.04,0.34,0.66],[0.03,0.20,0.74],[0.10,0.25,0.80],[0.03,0.26,0.59],[0.07,0.21,0.79]];

[0189] Step 2.7: If i is less than or equal to period, proceed to Step 2.8; otherwise, proceed to Step 2.15.

[0190] When i = 0, 0 <= 4, execute Step 2.8;

[0191] When i = 1, 1 <= 4, so execute Step 2.8;

[0192] When i = 2, 2 <= 4, so execute Step 2.8;

[0193] When i = 3, 3 <= 4, execute Step 2.8;

[0194] When i = 4, 4 <= 4, so execute Step 2.8;

[0195] When i = 5, 5 is not less than or equal to 4, so execute Step 2.15;

[0196] Step 2.8: If the value of NowData at time-period+i is not empty, proceed to Step 2.9; otherwise, proceed to Step 2.12.

[0197] When i = 0, the value of NowData corresponding to '2022-02-02 08:00:03' is [], which is empty. Execute Step 2.12.

[0198] When i = 1, the value of NowData corresponding to '2022-02-02 08:00:04' is [], which is empty. Execute Step 2.12.

[0199] When i = 2, the value of NowData corresponding to '2022-02-02 08:00:05' is [0.47, 0.25, 0.30], which is not empty, so Step 2.9 is executed;

[0200] When i = 3, the value of NowData corresponding to '2022-02-02 08:00:06' is [0.79, 0.25, 0.86], which is not empty, so Step 2.9 is executed;

[0201] When i = 4, the value of NowData corresponding to '2022-02-02 08:00:07' is [0.72, 0.24, 0.07], which is not empty, so Step 2.9 is executed;

[0202] Step 2.9: Assign the value of NowData at time-period+i to vector d, and then execute Step 2.10;

[0203] When i = 2, d = [0.47, 0.25, 0.30];

[0204] When i = 3, d = [0.79, 0.25, 0.86];

[0205] When i = 4, d = [0.72, 0.24, 0.07];

[0206] Step 2.10: Calculate the mean vector u and covariance matrix cov of NormData, then proceed to Step 2.11;

[0207] Calculate the average of the three columns of NormData and save it as u, u = [0.04833333, 0.26, 0.75666667];

[0208] Calculate the NormData covariance matrix cov = [[0.00093667, -0.00046, 0.00027333], [-0.00046, 0.00284, -0.00014], [0.00027333, -0.00014, 0.01634667]];

[0209] Step 2.11, Service anomaly event metric T* = (du) × cov -1 ×(du) T Proceed to Step 2.13;

[0210] Among them, cov -1 Let (du) be the inverse of the covariance matrix cov. T This indicates that the vector (du) is transposed.

[0211] When i = 2, T* = 225.312216;

[0212] When i = 3, T* = 636.040490;

[0213] When i = 4, T* = 566.189485;

[0214] Step 2.12, Service exception event metric T* = 0, proceed to Step 2.13;

[0215] When i = 0, T* = 0;

[0216] When i = 1, T* = 0;

[0217] Step 2.13: Add the metric T* of the nth service exception event of PartServer to the end of AbnormPartEvent, and then execute Step 2.14;

[0218] When i = 0, AbnormPartEvent = [0];

[0219] When i = 1, AbnormPartEvent = [0, 0];

[0220] When i = 2, AbnormPartEvent = [0, 0, 225.312216];

[0221] When i = 3, AbnormPartEvent = [0, 0, 225.312216, 636.040490];

[0222] When i = 4, AbnormPartEvent = [0, 0, 225.312216, 636.040490, 566.189485];

[0223] Step 2.14, i = i + 1, execute Step 2.7;

[0224] Step 2.15: Add the abnormal event metric sequence AbnormPartEvent of the nth service of PartServer to the end of AbnormPartEvents, and then execute Step 2.17;

[0225] When n=3

[0226] AbnormPartEvents=[[0,0,225.31221619,636.0404899,566.18948537],[0,0,26.13297609,30.80327851,23.0272247],[453.9948253 5,560.52651552,473.72742308,563.95925359,826.18916333],[4. 71780382,68.61560346,28.85878836,8.41985733,13.88152483]];

[0227] Step 2.16, n = n + 1, execute Step 2.3;

[0228] Step 2.17: Calculate the length L of the list NewServer, then execute Step 2.18;

[0229] L = 2;

[0230] Step 2.18: Add L rows and L columns to matrices W and Ecorr simultaneously, initializing the added values ​​to 0, and then execute Step 2.19;

[0231] W=[[0,0,0,0.50554396,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0],[0,0,0.24250067,0,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0]];

[0232] Ecorr=[[1.00000000e+00,1.81829599e-01,9.80100049e-02,2.50426100e-16,0,0],[1 .81829599e-01,1.00000000e+00,9.60003300e-03,9.77601204e-01,0,0],[9.80100049 e-02,9.60003300e-03,1.00000000e+00,3.82054910e-02,0,0],[2.50426100e-16,9.77 601204e-01,3.82054910e-02,1.00000000e+00,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0]];

[0233] Step 2.19: Select the row and column corresponding to the service in PartServer in matrix W, name it PartWold, and execute Step 2.20;

[0234] Given `ServerListNew` = ['Train', 'Restaurant', 'Cinema', 'OnlineCar', 'Plane', 'Hotel']`, and `ServerListNew` corresponds to the row and column order of `W`, since `PartServer` = ['Plane', 'Hotel', 'OnlineCar', 'Cinema'], we sequentially filter out rows 5, 6, 4, and 3 and columns 5, 6, 4, and 3 of `W`, resulting in `PartWold` = [[0,0,0,0],[0,0,0,0],[0,0,0,0.24250067],[0,0,0,0]].

[0235] Step 2.20: Calculate the residual matrix Enew = AbnormPartEvents - PartWold T ×AbnormPartEvents, execute Step 2.21;

[0236] Enew=[[0,0,225.31221619,636.0404899,566.18948537],[0,0,26.13297609,30.80327851,23.0272247],[453.99482535,560.52651 552,473.72742308,563.95925359,826.18916333],[-105.3762455,-67.31245211,-86.02042913,-128.34063951,-186.46990082]];

[0237] Step 2.21: Calculate the Pearson correlation coefficients between the row vectors of the residual matrix Enew, and take the absolute value of the correlation coefficients to obtain the residual correlation coefficient matrix Enewcorr. Then proceed to Step 2.22.

[0238] Enewcorr=[[1,0.8698132,0.61770636,0.7660864],[0.8698132,1,0.31059409,0.483102 62],[0.61770636,0.31059409,1,0.82505566],[0.7660864,0.48310262,0.82505566,1]];

[0239] Step 2.22: Traverse ChangeServer. If all the values ​​of the corresponding row and column of the service in ChangeServer are not higher than the values ​​of the corresponding row and column in Ecorr, then delete the service name of the service in ChangeServer and delete the row corresponding to the service name in AbnormPartEvents. Then execute Step 2.23.

[0240] Because ChangeServer = ['Cinema', 'OnlineCar'];

[0241] The service names in PartServer = ['Plane', 'Hotel', 'OnlineCar', 'Cinema'] correspond to the row and column order of Enewcorr and the row order of AbnormPartEvents;

[0242] The service names in ServerListNew = ['Train', 'Restaurant', 'Cinema', 'OnlineCar', 'Plane', 'Hotel'] correspond to the row and column order in Ecor;

[0243] Since the values ​​at the corresponding positions in Enewcorr are all higher than the values ​​in Ecorr, the service names and rows corresponding to the Cinema and OnlineCar services will not be deleted in ChangeServer and AbnormPartEvents.

[0244] Step 2.23: Traverse NewServerCorr and check the independence of the two row vectors corresponding to the newly added service and its directly connected service in AbnormPartEvents. If they are independent and the directly connected service is not the newly added service, delete the service name of the directly connected service in NewServerCorr and NewServerCorrList, delete the row corresponding to the service name in AbnormPartEvents, and execute Step 2.24.

[0245] Since NewServerCorr = {'Hotel':['Plane','OnlineCar'],'Plane':['Hotel']}, the newly added service Hotel corresponds to line 2 of AbnormPartEvents, the directly connected service Plane corresponds to line 1 of AbnormPartEvents, and the directly connected service OnlineCar corresponds to line 3 of AbnormPartEvents. The independence checks for lines 2 and 3, lines 2 and 1, and lines 1 and 2 of AbnormPartEvents all failed; therefore, the service names of directly connected services will not be deleted from NewServerCorr and NewServerCorrList, and line 3 corresponding to OnlineCar will not be deleted from AbnormPartEvents.

[0246] Step 2.24: If a service in a NewServer does not have a service directly connected to it, delete the service name in NewServer, delete the row corresponding to the service name in AbnormPartEvents, and then execute Step 2.25.

[0247] Since NewServerCorr = {'Hotel':['Plane','OnlineCar'],'Plane':['Hotel']}, no service names are deleted from NewServer, and no rows are deleted from AbnormPartEvents;

[0248] Step 2.25: Merge the lists NewServer, NewServerCorrList, and ChangeServer, remove duplicates, and save them as the update range list PartServer1. Then execute Step 2.26.

[0249] PartServer1=['Plane','Hotel','OnlineCar','Cinema'];

[0250] Step 2.26: Output the service failure propagation graph W, the residual correlation coefficient matrix Ecorr, the update range list PartServer1, the local abnormal event metric matrix AbnormPartEvents, the updated node list ServerListNew, and the list of stopped service names StopServer, then execute Step 3;

[0251] Output W = [[0,0,0,0.50554396,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0],[0,0,0.24250067,0,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0]];

[0252] Ecorr = [[1.00000000e+00,1.81829599e-01,9.80100049e-02,2.50426100e-16,0,0],[1.81829599e-01,1.00000000e+00,9.60003300e-03,9.77601204e-01,0,0],[9.80100049e-02,9.60003300e-03,1.00000000e+00,3.82054910e-02,0,0],[2.50426100e-16,9.77601204e-01,3.82054910e-02,1.00000000e+00,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0]];

[0253] PartServer1 = ['Plane', 'Hotel', 'OnlineCar', 'Cinema'];

[0254] AbnormPartEvents = [[0,0,225.31221619,636.0404899,566.18948537],[0,0,26.13297609,30.80327851,23.0272247],[453.99482535,560.52651552,473.72742308,563.95925359,826.18916333],[4.71780382,68.61560346,28.85878836,8.41985733,13.88152483]];

[0255] ServerListNew = ['Train', 'Restaurant', 'Cinema', 'OnlineCar', 'Plane', 'Hotel'];

[0256] time = '2022-02-02 08:00:07';

[0257] StopServer = ['Train'];

[0258] Step 3 is described in detail below:

[0259] Step 3.1: Input the service failure propagation graph W, the residual correlation coefficient matrix Ecorr, the update range list PartServer1, the local abnormal event metric matrix AbnormPartEvents, the updated node list ServerListNew, the update trigger time time, and the list of stopped service names StopServer, and then execute Step 3.2.

[0260] Input W = [[0,0,0,0.50554396,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0],[0,0,0.24250067,0,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0]];

[0261] Ecorr=[[1.00000000e+00,1.81829599e-01,9.80100049e-02,2.50426100e-16,0,0],[1 .81829599e-01,1.00000000e+00,9.60003300e-03,9.77601204e-01,0,0],[9.80100049 e-02,9.60003300e-03,1.00000000e+00,3.82054910e-02,0,0],[2.50426100e-16,9.77 601204e-01,3.82054910e-02,1.00000000e+00,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0]];

[0262] PartServer1=['Plane','Hotel','OnlineCar','Cinema'];

[0263] AbnormPartEvents=[[0,0,225.31221619,636.0404899,566.18948537],[0,0,26.13297609,30.80327851,23.0272247],[453.9948253 5,560.52651552,473.72742308,563.95925359,826.18916333],[4. 71780382,68.61560346,28.85878836,8.41985733,13.88152483]];

[0264] ServerListNew=['Train','Restaurant','Cinema','OnlineCar','Plane','Hotel'];

[0265] time='2022-02-0208:00:07';

[0266] StopServer = ['Train'];

[0267] Step 3.2: Reset the row and column corresponding to StopServer in W to 0, then proceed to Step 3.3;

[0268] Since StopServer = ['Train'] and ServerListNew = ['Train', 'Restaurant', 'Cinema', 'OnlineCar', 'Plane', 'Hotel'], Train corresponds to the first row and first column of W. Therefore, the values ​​in the first row and first column of W are reset to 0.

[0269] W=[[0,0,0,0,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0],[0,0,0.24250067,0,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0]]

[0270] Step 3.3: Substitute the service anomaly event metric matrix AbnormPartEvents into the DirectLiNGAM algorithm to obtain the local service fault propagation graph PartW, and then execute Step 3.4;

[0271] PartW=[[0,0,0,0],[7.46769979,0,0,0],[0,0,0,0],[0,0,0,0]];

[0272] Step 3.4, Local residual matrix PartE = AbnormPartEvents - PartW T ×AbnormPartEvents, T indicates transpose, execute Step 3.5;

[0273] PartE=[[0,0,30.1589961,406.01085339,394.22908428],[0,0,26.13297609,30.80327851,23.0272247],[453.99482535,560. 52651552,473.72742308,563.95925359,826.18916333],[4.71780382,68.61560346,28.85878836,8.41985733,13.88152483]];

[0274] Step 3.5: Calculate the Pearson correlation coefficients between the row vectors of the local residual matrix PartE, and take the absolute value of the correlation coefficients to obtain the local residual correlation coefficient matrix PartEcorr. Then proceed to Step 3.6.

[0275] PartEcorr=[[1,0.8698132,0.61770636,0.7660864],[0.8698132,1,0.31059409,0.483102 62],[0.61770636,0.31059409,1,0.82505566],[0.7660864,0.48310262,0.82505566,1]];

[0276] Step 3.6: Replace the values ​​at the corresponding positions of W and Ecorr with PartW and PartEcorr, then proceed to Step 3.7;

[0277] The service name of PartServer1 corresponds to the row and column order of PartW and PartEcorr, and ServerListNew corresponds to the row and column order of W and Ecorr, so after the replacement:

[0278] W=[[0,0,0,0,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0],[0,0,0.24250067,0,0,0],[0,0,0,0,0,0],[0,0,0,0,7.46769979,0]];

[0279] Ecorr=[[1.00000000e+00,1.81829599e-01,9.80100049e-02,2.50426100e-16,0,0],[1.81829599e-01,1.000 00000e+00,9.60003300e-03,9.77601204e-01,0,0],[9.80100049e-02,9.60003300e-03,1.00000000e+00,0.82 505566,0.7660864,0.48310262],[2.50426100e-16,9.77601204e-01,0.82505566,1.00000000e+00,0.617706 36,0.31059409],[0,0,0.7660864,0.61770636,1,0.8698132],[0,0,0.48310262,0.31059409,0.8698132,1]];

[0280] Step 3.7: When the service failure propagation graph is generated at time t = time, proceed to Step 3.8;

[0281] t = '2022-02-02 08:00:07'

[0282] Step 3.8: Set the node list ServerList = ServerListNew, then proceed to Step 3.9;

[0283] ServerList=['Train','Restaurant','Cinema','OnlineCar','Plane','Hotel']

[0284] Step 3.9: Output the service failure propagation graph, adjacency matrix W, residual correlation coefficient matrix Ecorr, node list ServerList, and service failure propagation graph generation time t;

[0285] Output W = [[0,0,0,0,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0],[0,0,0.24250067,0,0,0],[0,0,0,0,0,0],[0,0,0,0,7.46769979,0]];

[0286] Ecorr=[[1.00000000e+00,1.81829599e-01,9.80100049e-02,2.50426100e-16,0,0],[1.81829599e-01,1.000 00000e+00,9.60003300e-03,9.77601204e-01,0,0],[9.80100049e-02,9.60003300e-03,1.00000000e+00,0.82 505566,0.7660864,0.48310262],[2.50426100e-16,9.77601204e-01,0.82505566,1.00000000e+00,0.617706 36,0.31059409],[0,0,0.7660864,0.61770636,1,0.8698132],[0,0,0.48310262,0.31059409,0.8698132,1]];

[0287] ServerList=['Train','Restaurant','Cinema','OnlineCar','Plane','Hotel'];

[0288] t = '2022-02-02 08:00:07'.

[0289] This invention determines the local update scope of the service failure propagation graph by comprehensively analyzing changes in the service call graph. It also combines service operation data from a cloud computing environment to measure service anomalies and uses a causal inference algorithm to infer the causal relationships of these local anomalies, thereby updating the service failure propagation graph. Since updates are only required for services whose call relationships in the service failure propagation graph have changed, the efficiency of constructing the service failure propagation graph is improved.

[0290] The specific embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.

Claims

1. A method for local updating of service fault propagation graph in a cloud environment based on service call graph, characterized in that, include: Step 1: Analyze changes in the service call graph: Analyze the service call graph and filter out services whose current call relationships have changed; Step 2: Identify the update scope of the service failure propagation graph: Combine service operation data in the cloud computing environment to analyze whether changes in service call relationships lead to changes in service failure relationships, update the list of services whose call relationships have changed, and obtain the update scope list; Step 3: Partially update the service failure propagation graph: Construct a local failure propagation graph and update the original service failure propagation graph; Step 1 includes: Input the service failure propagation graph adjacency matrix W, residual correlation coefficient matrix Ecorr, service failure propagation graph node list ServerList, and service failure propagation graph generation time t; Read the adjacency matrix Mnew of the first service call graph from the first time to the update trigger time and the first node list MnewList; where the first time represents the difference between the update trigger time and the time period; Read the second service call graph adjacency matrix M and the second node list MList from the second time to the service failure propagation graph generation time; where the second time represents the difference between the service failure propagation graph generation time and the time period; Based on the first node list MnewList and the second node list MList, construct the list of stopped service names StopServer and the list of newly emerging services NewServer; wherein, service names that are not in the first node list MnewList but are included in the second node list MList are stored as the list of stopped service names StopServer; service names that are included in the first node list MnewList but are not in the second node list MList are stored as the list of newly emerging services NewServer. Based on the services that are directly connected to the services in NewServer in the adjacency matrix Mnew of the first service call topology graph, establish a list of correspondences between the newly emerging services and their directly connected services, NewServerCorr, and a list of services directly connected to the newly emerging services, NewServerCorrList. Based on the newly emerging service list NewServer, update the adjacency matrix Mnew and the first node list MnewList of the first service call graph; based on the service name list StopServer, update the adjacency matrix M and the second node list MList of the second service call graph. Based on the filtering of services that call the graph adjacency matrix Mnew for the first service and services that call the graph adjacency matrix M with different structures, a service list ChangeServer with structural changes is established. Merge the newly emerging service list NewServer, the service list NewServerCorrList directly connected to the newly emerging service, and the service list ChangeServer whose structure has changed, and establish the service list PartServer whose calling relationship has changed; If the residual correlation coefficient matrix Ecorr is empty, obtain the abnormal event measurement sequences of n services in the service failure propagation graph node list ServerList, and establish a global abnormal event measurement matrix AbnormEvents; obtain the regression coefficient matrix wj based on the service failure propagation graph adjacency matrix W and the global abnormal event measurement matrix AbnormEvents; update the service failure propagation graph adjacency matrix W using the regression coefficient matrix wj; obtain the first residual matrix E based on the updated service failure propagation graph adjacency matrix W and the global abnormal event measurement matrix AbnormEvents; obtain the updated residual correlation coefficient matrix Ecorr based on the first residual matrix E; merge the service failure propagation graph node list ServerList and the newly appearing service list NewServer, and store it as the new node list ServerListNew of the service failure propagation graph; if the residual correlation coefficient matrix Ecorr is not empty, directly merge the service failure propagation graph node list ServerList and the newly appearing service list NewServer, and store it as the new node list ServerListNew of the service failure propagation graph. Output the following: list of services whose call relationships have changed (PartServer), residual correlation coefficient matrix (Ecorr), list of new nodes in the service failure propagation graph (ServerListNew), list of newly emerging services (NewServer), list of services with structural changes (ChangeServer), list of services that have stopped running (StopServer), list of services directly connected to the newly emerging services (NewServerCorrList), list of correspondences between the newly emerging services and their directly connected services (NewServerCorr), adjacency matrix W of the service failure propagation graph, time period (period), and update trigger time (time). Then execute Step 2. Step 2 includes: The input includes the following: the list of services whose call relationships have changed (PartServer), the residual correlation coefficient matrix (Ecorr), the list of new nodes in the service failure propagation graph (ServerListNew), the list of newly emerging services (NewServer), the list of services with structural changes (ChangeServer), the list of services that have stopped running (StopServer), the list of services directly connected to the newly emerging services (NewServerCorrList), the list of correspondences between the newly emerging services and their directly connected services (NewServerCorr), the adjacency matrix of the service failure propagation graph (W), the time period (period), and the update trigger time (time). From the first time to the update trigger time, obtain the abnormal event measurement sequence of n services in the service list PartServer whose call relationship has changed, and establish a local abnormal event measurement matrix AbnormPartEvents; where the first time represents the difference between the update trigger time and the time period; Calculate the length L of the list NewServer; Add L rows and L columns to the end of both the adjacency matrix W and the residual correlation coefficient matrix Ecorr, and initialize the added values ​​to 0. Select the row and column corresponding to the service in PartServer in the adjacency matrix W, and name it PartWold, the local service failure propagation graph before the update. Based on the abnormal event measurement sequence AbnormPartEvents of n services in PartServer and the local service failure propagation graph PartWold before the update, the second residual matrix Enew is obtained; Based on the second residual matrix Enew, a new residual correlation coefficient matrix Enewcorr is obtained; Iterate through ChangeServer. If all values ​​of the corresponding row and column of a service in ChangeServer are not higher than the corresponding row and column values ​​in Ecor, then delete the service name of that service in ChangeServer and delete the row corresponding to that service name in AbnormPartEvents; otherwise, do nothing. Iterate through NewServerCorr and check the independence of the two row vectors corresponding to the newly added service and its directly connected service in AbnormPartEvents. If they are independent and the directly connected service is not the newly added service, delete the service name of the directly connected service in NewServerCorr and NewServerCorrList, and delete the row corresponding to the service name in AbnormPartEvents; otherwise, do nothing. If a service in a NewServer does not have a service directly connected to it, then delete the service name in NewServer and delete the corresponding row in AbnormPartEvents; otherwise, do nothing. Merge the lists NewServer, NewServerCorrList, and ChangeServer, remove duplicates, and save them as the service failure propagation graph update scope list PartServer1; Output the service failure propagation graph adjacency matrix W, residual correlation coefficient matrix Ecorr, service failure propagation graph update range list PartServer1, local abnormal event metric matrix AbnormPartEvents, service failure propagation graph new node list ServerListNew, update trigger time time and stop server name list StopServer, and execute Step3; Step 3 includes: Input the service failure propagation graph adjacency matrix W, residual correlation coefficient matrix Ecorr, service failure propagation graph update range list PartServer1, local anomaly event metric matrix AbnormPartEvents, service failure propagation graph new node list ServerListNew, update trigger time time, and list of stopped service names StopServer. Reset the row and column corresponding to StopServer in the adjacency matrix W to 0; Based on the local anomaly event metric matrix AbnormPartEvents, the local fault propagation graph PartW is obtained; Based on the local fault propagation graphs PartW and AbnormPartEvents, the local residual matrix PartE is obtained; Based on the local residual matrix PartE, the local residual correlation coefficient matrix PartEcorr is obtained; Replace the values ​​at the corresponding positions of W and Ecor with PartW and PartEcorr; Service failure propagation graph generation time t=time; Service failure propagation graph node list ServerList = ServerListNew; Output the updated service failure propagation graph adjacency matrix W, residual correlation coefficient matrix Ecorr, service failure propagation graph node list ServerList, and service failure propagation graph generation time t.

2. The method for partial updating of cloud environment service fault propagation graph based on service call graph as described in claim 1, characterized in that, The process involves obtaining the abnormal event measurement sequences of n services from the service failure propagation graph node list ServerList, and establishing a global abnormal event measurement matrix AbnormEvents, including: Read the monitoring data NowData of the nth service in ServerList from the second time to the time when the service failure propagation graph was generated; Read NormData, the monitoring data of the nth service in ServerList when it is running normally; From the second time to the time when the service failure propagation graph is generated, for any given time, if the value corresponding to NowData is not empty, the value of NowData at that time is assigned as vector d, and the mean vector u and covariance matrix cov of NormData are calculated; based on vector d, mean vector u and covariance matrix cov, the service anomaly event metric T* is obtained; if the value corresponding to NowData is empty, then the service anomaly event metric T* = 0; Based on the service exception event metrics, an exception event metric sequence is established to obtain the exception event metric sequence AbnormEvents for n services in ServerList.

3. The method for partial updating of cloud environment service fault propagation graph based on service call graph as described in claim 2, characterized in that, The expression for obtaining the service anomaly event metric T* based on the vector d, the mean vector u, and the covariance matrix cov is: T*=(du)*cov-1*(du). T ; where T represents transpose.

4. The method for partial updating of cloud environment service fault propagation graph based on service call graph according to claim 1, characterized in that, The process of obtaining the regression coefficient matrix wj based on the service failure propagation graph adjacency matrix W and the global abnormal event metric matrix AbnormEvents includes: for each (j+1)th column of W with a non-zero value, recording the row number of the (j+1)th column of W with a non-zero value and storing it as a list Xnum; using the row with row number Xnum of AbnormEvents as the independent variable and the (j+1)th row of AbnormEvents as the dependent variable, obtaining the regression coefficient matrix wj through a multiple linear regression parameter estimation method; where j is less than the length of the list ServerList.

5. The method for partial updating of cloud environment service fault propagation graph based on service call graph according to claim 1, characterized in that, Based on the updated service failure propagation graph adjacency matrix W and the global abnormal event metric matrix AbnormEvents, the first residual matrix E is obtained, E = AbnormEvents - W. T *AbnormEvents; where T represents transpose.

6. A cloud environment service fault propagation graph partial update system based on service call graph for implementing the method of claim 1, characterized in that, include: The service call graph change analysis module is used to analyze the service call graph and filter out services whose current call relationships have changed. The service failure propagation graph update scope identification module is used to combine service operation data in the cloud computing environment to analyze whether changes in service call relationships lead to changes in service failure relationships, update the list of services whose call relationships have changed, and obtain the update scope list. The Partial Update Service Fault Propagation Graph module is used to construct a local fault propagation graph and update the original service fault propagation graph.

7. A processor, characterized in that, The processor is used to run a program, wherein the program executes the cloud environment service fault propagation graph local update method based on service call graph as described in any one of claims 1-5.