Chaotic testing method, apparatus, device, medium, and program product
By using risk prediction models and fault knowledge graphs in chaotic testing, the problems of inaccurate fault injection and reliance on manual intervention in existing technologies are solved, achieving more efficient and automated fault injection and test coverage, and improving the accuracy and efficiency of testing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INDUSTRIAL AND COMMERCIAL BANK OF CHINA
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-19
AI Technical Summary
Existing chaos testing methods suffer from low accuracy in fault injection, rely on manual intervention which is inefficient, struggle to cover all potential faults, and lack automation in complex distributed systems.
By inputting the multi-source heterogeneous operating data of the system at the current moment into the risk prediction model, and using the risk prediction model trained with fault knowledge graph and historical data, the risk value is analyzed and predicted, the fault injection strategy is determined, the target fault is accurately located from the fault link and injected into the system for testing.
It improves the accuracy and efficiency of fault injection, enabling more precise simulation of system faults and enhancing the automation and coverage of testing.
Smart Images

Figure CN122240475A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of big data, and more specifically to a chaos testing method, apparatus, equipment, medium, and program product. Background Technology
[0002] Chaos testing is a testing method that verifies the stability and recovery capability of a system by simulating system failures. Generally, chaos testing has the following problems: (1) The accuracy of the injected faults is not high, resulting in poor test results. (2) Fault injection depends on manual intervention, which is inefficient and difficult to cover all potential faults. (3) The testing process lacks automation, making it difficult to conduct chaos testing in complex distributed systems. Summary of the Invention
[0003] In view of the above problems, embodiments of this application provide a chaos testing method, apparatus, device, medium, and program product.
[0004] According to a first aspect of this application, a chaos testing method is provided, comprising: inputting multi-source heterogeneous operating data of the system at the current moment into a risk prediction model; analyzing the multi-source heterogeneous operating data using the risk prediction model and obtaining a predicted risk value; the multi-source heterogeneous operating data includes code defect information, performance indicators, and system log information; the risk prediction model is used to predict faults that will occur in the system at future moments; the risk prediction model is trained based on a fault knowledge graph and historical multi-source heterogeneous operating data; determining a fault injection strategy based on the relationship between the predicted risk value and a preset risk threshold; obtaining fault links from the fault knowledge graph based on the multi-source heterogeneous operating data; entities in the fault knowledge graph include historical multi-source heterogeneous operating data, which includes historical code defects, historical performance indicators, and historical system log information; the relationships between entities represent the causal associations of historical faults; determining a target fault from the fault links based on the fault injection strategy; and injecting the target fault into the system to perform chaos testing.
[0005] According to an embodiment of this application, the risk prediction model is used to analyze multi-source heterogeneous operation data and obtain a predicted risk value, including: performing time-series analysis on the multi-source heterogeneous operation data and obtaining anomaly scores based on historical multi-source heterogeneous operation data; tracing back the fault knowledge graph from the current node to determine the graph propagation score, where the graph propagation score represents the probability of the occurrence of the historical fault that has the greatest impact on the current node; and fusing the anomaly score and the graph propagation score to obtain the predicted risk value.
[0006] According to an embodiment of this application, an anomaly score is obtained based on historical multi-source heterogeneous operation data, including: determining the deviation between each multi-source heterogeneous operation data and its historical multi-source heterogeneous operation data; obtaining the weight of each multi-source heterogeneous operation data based on the degree of influence of each historical multi-source heterogeneous operation data on historical faults; weighting the deviation of the multi-source heterogeneous operation data and its weight, and fusing the weighted values to obtain the anomaly score.
[0007] According to an embodiment of this application, obtaining fault links from a fault knowledge graph based on multi-source heterogeneous operational data includes: mapping the multi-source heterogeneous operational data to entities in the fault knowledge graph and marking abnormal nodes; traversing the fault knowledge graph starting from the abnormal nodes to obtain candidate fault links; scoring the candidate fault links based on their impact range, the duration of the fault in the candidate fault links, and the recovery time, and sorting them from high to low scores; and selecting the candidate fault link corresponding to the highest-scoring candidate fault link as the fault link.
[0008] According to an embodiment of this application, a fault injection strategy is determined based on the relationship between the predicted risk value and a preset risk threshold, including: if the predicted risk value is greater than a first preset risk threshold, then a combined fault injection strategy is executed; if the predicted risk value is less than the first preset risk threshold but greater than a second preset risk threshold, then a single fault injection strategy is executed; if the predicted risk value is less than the second preset risk threshold, then no fault injection operation is performed and the operating status of the system is monitored.
[0009] According to an embodiment of this application, determining a target fault from a fault chain based on a fault injection strategy includes: determining key fault nodes from the fault chain based on a fault knowledge graph, obtaining the propagation probability of the key fault nodes and sorting them from high to low according to the probability value, where the propagation probability represents the probability that the fault is generated by the key fault node; if a combined fault injection strategy is executed, the top N key fault nodes are taken as target faults, where N is an integer greater than 1; if a single fault injection is executed, the key fault node ranked first is taken as the target fault.
[0010] According to an embodiment of this application, the method further includes: if the system availability index after the chaos test exceeds its corresponding benchmark value, and the excess part is greater than a preset threshold, then the target fault and its fault link are stored in the fault knowledge graph to update the fault knowledge graph, wherein the system availability index includes fault recovery time, fault-tolerant switching success rate and business completion rate; and the risk prediction model is optimized based on the updated fault knowledge graph.
[0011] According to a second aspect of this application, a chaos testing device is provided, comprising: a fault prediction module, used to input multi-source heterogeneous operating data of the system at the current moment into a risk prediction model, analyze the multi-source heterogeneous operating data using the risk prediction model and obtain a predicted risk value, wherein the multi-source heterogeneous operating data includes code defect information, performance indicators and system log information, the risk prediction model is used to predict faults that will occur in the system at future moments, and the risk prediction model is trained based on a fault knowledge graph and historical multi-source heterogeneous operating data; a fault injection strategy determination module, used to determine a fault injection strategy based on the relationship between the predicted risk value and a preset risk threshold; a fault link acquisition module, used to acquire fault links from the fault knowledge graph based on the multi-source heterogeneous operating data, wherein the entities in the fault knowledge graph include historical multi-source heterogeneous operating data, which includes historical code defects, historical performance indicators and historical system log information, and the relationships between entities represent the causal associations of historical faults; a target fault determination module, used to determine a target fault from the fault links based on the fault injection strategy; and a fault injection module, used to inject the target fault into the system for chaos testing.
[0012] According to a third aspect of this application, an electronic device is provided, comprising: one or more processors; and a memory for storing one or more computer programs, wherein the one or more processors execute the one or more computer programs to implement the steps of the method described above.
[0013] According to a fourth aspect of this application, a computer-readable storage medium is also provided, on which a computer program or instructions are stored, wherein the computer program or instructions, when executed by a processor, implement the steps of the above-described method.
[0014] According to a fifth aspect of this application, a computer program product is also provided, including a computer program or instructions that, when executed by a processor, implement the steps of the above-described method. Attached Figure Description
[0015] The above-mentioned contents, other objects, features and advantages of this application will become clearer from the following description of embodiments with reference to the accompanying drawings, in which:
[0016] Figure 1 The illustrations depict application scenarios of chaos testing methods, apparatus, devices, media, and program products according to embodiments of this application.
[0017] Figure 2 A flowchart illustrating a chaos testing method according to an embodiment of this application is shown schematically.
[0018] Figure 3 A schematic diagram illustrates a complete flowchart of a chaos testing method according to an embodiment of this application;
[0019] Figure 4 This illustration schematically shows a method for obtaining predicted risk values according to an embodiment of this application;
[0020] Figure 5 The diagram illustrates a method for obtaining a faulty link according to an embodiment of this application.
[0021] Figure 6 This illustration schematically shows a method for determining a fault injection strategy according to an embodiment of this application;
[0022] Figure 7 This illustration schematically shows a fault knowledge graph updating method according to an embodiment of the present application;
[0023] Figure 8 A schematic diagram of a chaos testing apparatus according to an embodiment of this application is shown.
[0024] Figure 9 A block diagram schematically illustrates an electronic device suitable for implementing a chaos testing method according to an embodiment of this application. Detailed Implementation
[0025] The embodiments of this application will now be described with reference to the accompanying drawings. However, it should be understood that these descriptions are exemplary only and are not intended to limit the scope of this application. In the following detailed description, numerous specific details are set forth to provide a thorough understanding of the embodiments of this application for ease of explanation. However, it will be apparent that one or more embodiments may be implemented without these specific details. Furthermore, descriptions of well-known structures and technologies are omitted in the following description to avoid unnecessarily obscuring the concepts of this application.
[0026] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of this application. The terms “comprising,” “including,” etc., as used herein indicate the presence of the stated features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.
[0027] All terms used herein (including technical and scientific terms) have the meanings commonly understood by those skilled in the art, unless otherwise defined. It should be noted that the terms used herein are to be interpreted in a manner consistent with the context of this specification, and not in an idealized or overly rigid way.
[0028] When using expressions such as "at least one of A, B and C", they should generally be interpreted in accordance with the meaning that is commonly understood by those skilled in the art (e.g., "a system having at least one of A, B and C" should include, but is not limited to, a system having A alone, a system having B alone, a system having C alone, a system having A and B, a system having A and C, a system having B and C, and / or a system having A, B and C, etc.).
[0029] As used in this paper, the term "model" refers to a model that learns the relationship between inputs and outputs from training data, enabling it to generate corresponding outputs for a given input after training. Model generation can be based on machine learning techniques. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs using multiple layers of processing units. A neural network model is an example of a deep learning-based model. In this paper, "model" may also be referred to as a "machine learning model," "learning model," "machine learning network," or "learning network," and these terms are used interchangeably.
[0030] Figure 1 The diagram illustrates an application scenario of the chaos testing method according to an embodiment of this application. For example... Figure 1 As shown, application scenario 100 according to an embodiment of this application may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 serves as a medium for providing a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired or wireless communication links or fiber optic cables. For example, a user can use the first terminal device 101, the second terminal device 102, and the third terminal device 103 to interact with the server 105 through the network 104 to receive or send information, etc.
[0031] The first terminal device 101, the second terminal device 102, and the third terminal device 103 can be electronic devices such as smartphones, wearable devices, personal computers, intelligent voice interaction devices, smart home appliances, intelligent vehicles, in-vehicle terminals, aircraft, unmanned vending terminals, and extended reality devices. Extended reality devices can include virtual reality devices, augmented reality devices, and mixed reality devices. A client application for the target application can be installed and run on the terminal devices. This target application can include, but is not limited to, financial transaction applications, payment applications, shopping applications, web browser applications, search applications, instant messaging tools, email clients, and social media platform software (these are just examples). Furthermore, this application embodiment does not limit the form of the target application, and it can include, but is not limited to, applications, mini-programs, etc., installed on the terminal devices, and can also be in the form of web pages.
[0032] Server 105 can be a server providing various services, such as a backend management server supporting websites browsed by users using the first terminal device 101, the second terminal device 102, and the third terminal device 103 (this is just an example). The backend management server can analyze and process received user requests and other data, and feed back the processing results (such as web pages, information, or data obtained or generated according to user requests) to the terminal devices. The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services such as cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks, and basic cloud computing services such as big data. The server can be the backend server of the aforementioned target application, used to provide backend services to the clients of the target application.
[0033] It should be noted that the chaos testing method provided in this application embodiment can generally be executed by server 105 and / or terminal devices 101-103. Accordingly, the chaos testing device provided in this application embodiment can generally be set in server 105 and / or terminal devices 101-103.
[0034] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.
[0035] Figure 2 A flowchart illustrating a chaos testing method according to an embodiment of this application is shown schematically. Figure 2 As shown, the chaos testing method 200 according to an embodiment of this application may include steps S210 to S250.
[0036] In step S210, the multi-source heterogeneous operation data of the system at the current moment is input into the risk prediction model, and the risk prediction model is used to analyze the multi-source heterogeneous operation data and obtain the predicted risk value.
[0037] In step S220, a fault injection strategy is determined based on the relationship between the predicted risk value and the preset risk threshold.
[0038] In step S230, fault links are obtained from the fault knowledge graph based on multi-source heterogeneous operation data.
[0039] In step S240, the target fault is determined from the fault link based on the fault injection strategy.
[0040] In step S250, the target fault is injected into the system to perform a chaos test.
[0041] In some embodiments, in step S210, the multi-source heterogeneous operational data includes code defect information, performance metrics, and system log information. The risk prediction model is used to predict future system failures, and the risk prediction model is trained based on a fault knowledge graph and historical multi-source heterogeneous operational data. Further, the code defect information includes fault types, conditions triggering the fault, etc.; the performance metrics include memory fluctuation trends, data read / write interaction trends between the system and disk storage devices, etc.; and the system log information includes error code frequency and abnormal call chains, etc.
[0042] In some embodiments, in step S230, the fault knowledge graph stores the causal relationships of historical faults, such as high CPU utilization leading to high process load, and database response timeout leading to downstream service timeout.
[0043] According to the embodiments of this application, a fault injection strategy is first determined based on the risk value predicted by the risk prediction model. Then, based on multi-source heterogeneous operating data, a fault link is determined from the fault knowledge graph. The fault injection strategy is then used to further determine the target fault to be injected from the fault link. By determining the specific target fault step by step in this way, the system fault can be accurately predicted, the accuracy of the target fault can be improved, and thus the accuracy of the test can be improved.
[0044] Figure 3 A schematic diagram illustrates the complete flowchart of a chaos testing method according to an embodiment of this application.
[0045] like Figure 3 As shown, the complete steps of the chaos testing method include steps S310 to S3100. Specifically, in step S310, the test objective is first set, for example, to test the system's disaster recovery capability, which is the test objective.
[0046] In step S320, the system status is monitored in real time, or in other words, the multi-source heterogeneous operating data of the system is acquired in real time, so as to determine the fault to be injected using method 200, that is, to generate a fault scenario, such as determining that the fault to be injected is network latency, node failure, etc.
[0047] In step S330, it is determined whether the predicted risk value obtained by method 200 in step S320 is greater than a preset risk threshold. If not, step S320 is continued; if so, step S340 is executed.
[0048] In step S340, a fault injection strategy is generated. For example, whether to inject a combination of faults or a single fault.
[0049] In step S350, a test environment is selected according to requirements, including a pre-release test environment and other test environments.
[0050] In step S360, the fault to be injected is injected through the automated platform to perform a chaos test.
[0051] In step S370, multi-dimensional monitoring is performed. That is, the performance indicators of the system after a fault is injected are monitored from multiple dimensions.
[0052] In step S380, test results are analyzed to assess the system's recovery capability and performance loss.
[0053] In step S390, the fault knowledge graph is updated based on the analysis results. For example, if a new fault is identified from the test results, information related to the new fault is added to the fault knowledge graph to update it. Based on the updated fault knowledge graph, step S320 is then executed to regenerate the fault to be injected according to the test objective.
[0054] In step S3100, the test environment is restored to ensure that the test environment can perform chaos tests again.
[0055] Figure 4 The diagram illustrates a method for obtaining predicted risk values according to an embodiment of this application.
[0056] In some embodiments, the risk prediction model is used to analyze multi-source heterogeneous operation data and obtain predicted risk values, including: performing time-series analysis on multi-source heterogeneous operation data and obtaining anomaly scores for multi-source heterogeneous operation data based on historical multi-source heterogeneous operation data; tracing back the fault knowledge graph from the current node to determine the graph propagation score, where the graph propagation score represents the probability of the occurrence of the historical fault that has the greatest impact on the current node; and fusing the anomaly score and the graph propagation score to obtain the predicted risk value.
[0057] like Figure 4 As shown, step S410 is executed to perform time-series analysis on the multi-source heterogeneous operation data 401, obtain anomaly scores, trace back the fault knowledge graph 402 from the current node, determine the graph propagation score 403, and obtain the predicted risk value 404 based on the anomaly score and the graph propagation score 403.
[0058] Furthermore, the anomaly scores include code defect risk scores, performance anomaly scores, and log anomaly scores. Specifically, the code defect risk score represents the degree to which the code defect density deviates from its baseline, characterizing the risk level of the code defects. The code defect risk score is obtained by multiplying the deviation between the code defect density and its baseline by the weight corresponding to the code defect density. This weight can be obtained through supervised learning based on the impact of historical failure code defect density on historical failures. The performance anomaly score represents the degree to which the performance metric deviates from its normal range. The performance anomaly score is corrected using a Gaussian correction method to suppress the impact of occasional fluctuations on the score. The Gaussian correction value is determined based on the fluctuation range of the current performance metric, the historical fluctuation range of the performance metric, and a tolerance parameter. The log anomaly score represents the degree to which the frequency of errors in the log deviates from the historical baseline value. Its calculation formula is as follows: 'a' represents the weight, 'b' and 'c' represent the frequency of errors in the current log and the baseline value of the log error frequency, respectively, and 'd' represents the upper limit cutoff value. Its purpose is to avoid interference from single-point extreme values, that is, The maximum value is
[0059] Furthermore, the graph propagation score can be understood as the maximum probability that an upstream fault, predicted based on the fault knowledge graph, will cause the current system failure. For example, if the current system failure is due to an application programming interface (API) response timeout, there may be multiple reasons for the API response timeout. Among them, database table locking has the highest probability of causing the API response timeout. This probability value is the graph propagation score.
[0060] Furthermore, the normal range of performance indicators can be determined by unsupervised learning based on historical performance indicators, and the baseline value of log error frequency can be obtained by deep learning based on historical log information.
[0061] According to the embodiments of this application, the risk value of system failure is predicted from multiple dimensions. By combining the abnormal scores of multi-source heterogeneous operating data and the maximum probability that the system may fail, the predicted risk value is determined, which can improve the prediction accuracy.
[0062] In some embodiments, obtaining anomaly scores for multi-source heterogeneous operation data based on historical multi-source heterogeneous operation data includes: determining the deviation between each multi-source heterogeneous operation data and its historical multi-source heterogeneous operation data; obtaining the weight of each multi-source heterogeneous operation data based on the degree of influence of each historical multi-source heterogeneous operation data on historical faults; weighting the deviations and weights of the multi-source heterogeneous operation data and fusing the weighted values to obtain anomaly scores.
[0063] According to the embodiments of this application, using historical multi-source heterogeneous operation data to measure the current multi-source heterogeneous operation data can quantify the abnormality of the current multi-source heterogeneous operation data, thereby improving the accuracy and reliability of the predicted risk value.
[0064] Figure 5 The diagram illustrates a method for obtaining faulty links according to an embodiment of this application.
[0065] In some embodiments, obtaining fault links from a fault knowledge graph based on multi-source heterogeneous operational data includes: mapping the multi-source heterogeneous operational data to entities in the fault knowledge graph and marking abnormal nodes; traversing the fault knowledge graph starting from the abnormal nodes to obtain candidate fault links; scoring the candidate fault links based on their impact range, the duration of the fault in the candidate fault links, and the recovery time, and sorting them from high to low scores; and selecting the candidate fault link corresponding to the highest score as the fault link.
[0066] like Figure 5 As shown, the method for obtaining the faulty link includes steps S510 to S540.
[0067] In step S510, the multi-source heterogeneous operational data is mapped to entities in the fault knowledge graph, and abnormal nodes are marked.
[0068] Specifically, after mapping multi-source heterogeneous operational data to entities in a fault knowledge graph, abnormal nodes are determined from the mapped entities based on the fault nodes in the fault knowledge graph.
[0069] In step S520, starting from the abnormal node, the fault knowledge graph is traversed to obtain candidate fault links.
[0070] In step S530, the candidate fault links are scored based on their impact range, the duration of the fault in the candidate fault link, and the recovery time, and then sorted from high to low scores.
[0071] In step S540, the candidate fault link corresponding to the score with the highest ranking is selected as the fault link.
[0072] According to the embodiments of this application, the candidate fault links are scored based on their impact range, the duration of the fault in the candidate fault link, and the recovery time. The link with the highest score is selected as the fault link. This can accurately pinpoint the fault propagation link that has the greatest impact on the core business of the system, avoid indiscriminate fault injection, make chaotic testing more targeted, reduce the cost of ineffective testing, and improve testing efficiency and the validity of results.
[0073] Figure 6 A schematic diagram of a fault injection strategy determination method according to an embodiment of this application is shown.
[0074] In some embodiments, a fault injection strategy is determined based on the relationship between the predicted risk value and a preset risk threshold, including: if the predicted risk value is greater than a first preset risk threshold, then a combined fault injection strategy is executed; if the predicted risk value is less than the first preset risk threshold but greater than a second preset risk threshold, then a single fault injection strategy is executed; if the predicted risk value is less than the second preset risk threshold, then no fault injection operation is performed and the operating status of the system is monitored.
[0075] like Figure 6 As shown, in operation S610, it is determined whether the predicted risk value 601 is greater than the first preset risk threshold. If so, operation S620 is executed to execute the combined fault injection strategy. If not, operation S630 is executed to determine whether the predicted risk value 601 is greater than the second preset risk threshold. If so, operation S640 is executed to execute the combined fault injection strategy. If not, operation S650 is executed to not perform the fault injection operation and to monitor the operating status of the system.
[0076] Furthermore, the combined fault in the combined fault injection strategy can be network latency and data read / write interaction failure between the system and the disk storage device, and the corresponding fault type is CPU overload; the single fault in the single fault injection strategy can be an application programming interface error (the corresponding fault type is node crash) or a message queue backlog (the corresponding fault type is network partition).
[0077] According to the embodiments of this application, when the predicted risk value is large, a combined fault injection method is used; when the predicted risk value is relatively small, a single fault injection method is used. When the predicted risk value is relatively small, it means that the probability of the system failing is small, so no fault is injected first, and the system status is continuously monitored. Different fault injection strategies are used for different predicted risk values, which can be used to inject faults in a targeted manner and improve testing efficiency.
[0078] In some embodiments, determining the target fault from the fault chain based on the fault injection strategy includes: determining key fault nodes from the fault chain based on the fault knowledge graph, obtaining the propagation probability of the key fault nodes and sorting them from high to low according to the probability value, where the propagation probability represents the probability that the fault is generated by the key fault node; if a combined fault injection strategy is executed, the top N key fault nodes are taken as the target faults, where N is an integer greater than 1; if a single fault injection is executed, the key fault node ranked first is taken as the target fault.
[0079] According to the embodiments of this application, combined fault injection can simulate the chain propagation effect of fault links and accurately verify the system's fault tolerance and shock resistance to complex faults; single fault injection can focus on the root cause of the fault, quickly locate the most critical fault point, and has a lower testing cost, enabling rapid verification of the system's basic stability.
[0080] Figure 7 The illustration shows a schematic diagram of a fault knowledge graph updating method according to an embodiment of this application.
[0081] In some embodiments, the method further includes: if the system availability index after the chaos test exceeds its corresponding benchmark value, and the excess part is greater than a preset threshold, then the target fault and its fault link are stored in the fault knowledge graph to update the fault knowledge graph, wherein the system availability index includes fault recovery time, fault-tolerant switching success rate and service completion rate; and the risk prediction model is optimized based on the updated fault knowledge graph.
[0082] like Figure 7 As shown, operation S710 is executed to determine whether the system availability index 701 exceeds its corresponding benchmark value. If so, operation S720 is executed to further determine whether the excess portion of the system availability index 701 is greater than a preset threshold. If so, operation S730 is executed to store the target fault and its fault link in the fault knowledge graph and update the fault knowledge graph. Then, operation S740 is executed to optimize the risk prediction model based on the updated fault knowledge graph. If the availability index 701 does not exceed its corresponding benchmark value or the availability index 701 does not exceed the preset threshold, operation S750 is executed without updating the fault knowledge graph.
[0083] For example, if the injected fault is CPU resource exhaustion, the system repair method is as follows: automatically trigger a circuit breaker to terminate the high-CPU-consuming process, and simultaneously downgrade the service layer to switch to the backup database. After the CPU resources recover to normal, the node automatically restores its service capabilities. During the repair process, after triggering the circuit breaker, monitor whether the CPU resources are overloaded. If not, collect system availability indicators such as system recovery time. If the system availability indicators exceed their corresponding benchmark values, it indicates that a new fault has occurred. Add the new fault to the fault knowledge graph to update the fault knowledge graph, and optimize the weights of the risk prediction model based on the updated fault knowledge graph. Further, the optimized risk prediction model weights... The calculation formula is as follows:
[0084]
[0085] in, This represents the weights of the risk prediction model before optimization. , These represent the current system recovery time and the recovery time baseline value, respectively.
[0086] According to an embodiment of this application, if the system availability index after the chaos test exceeds its corresponding benchmark value and the excess is greater than a preset threshold, it indicates that the system has generated a new fault. The new fault needs to be added to the fault knowledge graph to optimize the risk prediction model. The fault knowledge graph can be continuously updated with the test, and the risk prediction model can also be continuously optimized accordingly. This method can improve the accuracy of model prediction. This feedback mechanism is conducive to quickly locating faults and improving the accuracy of chaos testing.
[0087] Based on the above-described chaos testing method, embodiments of this application also provide a chaos testing apparatus. The following will be combined with... Figure 8 The device is described in detail.
[0088] Figure 8 A schematic block diagram of a chaos testing apparatus according to an embodiment of this application is shown.
[0089] like Figure 8 As shown, the chaos testing device 800 of this embodiment includes a fault prediction module 810, a fault injection strategy determination module 820, a fault link acquisition module 830, a target fault determination module 840, and a fault injection module 850.
[0090] The fault prediction module 810 is used to input the current multi-source heterogeneous operating data of the system into the risk prediction model, analyze the multi-source heterogeneous operating data using the risk prediction model, and obtain the predicted risk value. The multi-source heterogeneous operating data includes code defect information, performance indicators, and system log information. The risk prediction model is used to predict faults that will occur in the system in the future. The risk prediction model is trained based on a fault knowledge graph and historical multi-source heterogeneous operating data. In one embodiment, the fault prediction module 810 can be used to execute step S210 described above, which will not be repeated here.
[0091] The fault injection strategy determination module 820 is used to determine a fault injection strategy based on the relationship between the predicted risk value and a preset risk threshold. In one embodiment, the fault injection strategy determination module 820 can be used to execute step S220 described above, which will not be repeated here.
[0092] The fault link acquisition module 830 is used to acquire fault links from a fault knowledge graph based on multi-source heterogeneous operational data. Entities in the fault knowledge graph include historical multi-source heterogeneous operational data, which includes historical code defects, historical performance indicators, and historical system log information. The relationships between entities represent the causal associations of historical faults. In one embodiment, the fault link acquisition module 830 can be used to execute step S230 described above, which will not be repeated here.
[0093] The target fault determination module 840 is used to determine the target fault from the faulty link based on the fault injection strategy. In one embodiment, the target fault determination module 840 can be used to perform step S240 described above, which will not be repeated here.
[0094] The fault injection module 850 is used to inject a target fault into the system for chaos testing. In one embodiment, the fault injection module 850 can be used to execute step S250 described above, which will not be repeated here.
[0095] According to the embodiments of this application, the device 800 can predict the probability of a system failure and determine a fault injection strategy based on the probability. Then, based on multi-source heterogeneous operating data, a fault link is determined from the fault knowledge graph. The fault injection strategy is used to further determine the target fault to be injected from the fault link. By determining the specific target fault step by step, the system failure can be accurately predicted, the accuracy of the target fault can be improved, and the accuracy of the test can be improved.
[0096] In some embodiments, the fault prediction module 810 is specifically used to: perform time-series analysis on multi-source heterogeneous operation data, and obtain anomaly scores of multi-source heterogeneous operation data based on historical multi-source heterogeneous operation data; trace back the fault knowledge graph from the current node to determine the graph propagation score, the graph propagation score representing the probability of the occurrence of the historical fault that has the greatest impact on the current node; and fuse the anomaly score and the graph propagation score to obtain the predicted risk value.
[0097] In some embodiments, the fault prediction module 810 is further configured to: determine the deviation between each multi-source heterogeneous operating data and its historical multi-source heterogeneous operating data; obtain the weight of each multi-source heterogeneous operating data based on the degree of influence of each historical multi-source heterogeneous operating data on historical faults; weight the deviation of the multi-source heterogeneous operating data and its weight, and fuse the weighted values to obtain an anomaly score.
[0098] In some embodiments, the fault link acquisition module 830 is specifically used to: map multi-source heterogeneous operating data into entities in a fault knowledge graph and mark abnormal nodes; traverse the fault knowledge graph starting from the abnormal nodes to obtain candidate fault links; score the candidate fault links based on their influence range, duration of faults in the candidate fault links, and recovery time, and sort them from high to low scores; and select the candidate fault link corresponding to the highest score as the fault link.
[0099] In some embodiments, the fault injection strategy determination module 820 is specifically used to: execute a combined fault injection strategy if the predicted risk value is greater than a first preset risk threshold; execute a single fault injection strategy if the predicted risk value is less than the first preset risk threshold but greater than a second preset risk threshold; and not perform a fault injection operation and monitor the system's operating status if the predicted risk value is less than the second preset risk threshold.
[0100] In some embodiments, the target fault determination module 840 is specifically used to: determine key fault nodes from the fault chain based on the fault knowledge graph, obtain the propagation probability of the key fault nodes and sort them from high to low according to the probability value, the propagation probability representing the probability that the fault is generated by the key fault node; if a combined fault injection strategy is executed, the first N key fault nodes are taken as target faults, where N is an integer greater than 1; if a single fault injection is executed, the first-ranked key fault node is taken as the target fault.
[0101] In some embodiments, the device 800 is further configured to: if the system availability index after the chaos test exceeds its corresponding benchmark value, and the excess part is greater than a preset threshold, store the target fault and its fault link in the fault knowledge graph to update the fault knowledge graph, wherein the system availability index includes fault recovery time, fault-tolerant switching success rate and service completion rate; and optimize the risk prediction model based on the updated fault knowledge graph.
[0102] According to embodiments of this application, any multiple modules among the fault prediction module 810, fault injection strategy determination module 820, fault link acquisition module 830, target fault determination module 840, and fault injection module 850 can be combined into one module, or any one of these modules can be split into multiple modules. Alternatively, at least some of the functions of one or more of these modules can be combined with at least some of the functions of other modules and implemented in one module. According to embodiments of this application, at least one of the fault prediction module 810, fault injection strategy determination module 820, fault link acquisition module 830, target fault determination module 840, and fault injection module 850 can be at least partially implemented as hardware circuits, such as field-programmable gate arrays, programmable logic arrays, systems-on-a-chip, systems-on-a-substrate, systems-on-package, application-specific integrated circuits, or other reasonable means of integrating or packaging circuits, or implemented in software, hardware, and firmware, or in any suitable combination of any of these three implementation methods. Alternatively, at least one of the fault prediction module 810, fault injection strategy determination module 820, fault link acquisition module 830, target fault determination module 840, and fault injection module 850 can be at least partially implemented as a computer program module, which can perform corresponding functions when the computer program module is run.
[0103] Figure 9 A block diagram schematically illustrates an electronic device suitable for implementing a chaos testing method according to an embodiment of this application.
[0104] like Figure 9 As shown, an electronic device 900 according to an embodiment of this application includes a processor 901, which can perform various appropriate actions and processes according to a program stored in a read-only memory 902 or a program loaded from a storage portion 908 into a random access memory 903. The processor 901 may include, for example, a general-purpose microprocessor, an instruction set processor and / or an associated chipset and / or a dedicated microprocessor. The processor 901 may also include onboard memory for caching purposes. The processor 901 may include a single processing unit or multiple processing units for executing different steps of the method flow according to an embodiment of this application.
[0105] Random access memory 903 stores various programs and data required for the operation of electronic device 900. Processor 901, read-only memory 902, and random access memory 903 are interconnected via bus 904. Processor 901 executes various steps of the method flow according to embodiments of this application by executing programs stored in read-only memory 902 and / or random access memory 903. It should be noted that the programs may also be stored in one or more memories other than read-only memory 902 and random access memory 903. Processor 901 may also execute various steps of the method flow according to embodiments of this application by executing programs stored in said one or more memories.
[0106] According to embodiments of this application, the electronic device 900 may further include an input / output interface 905, which is also connected to a bus 904. The electronic device 900 may also include one or more of the following components connected to the input / output interface 905: an input section 906 including a keyboard, mouse, etc.; an output section 907 including a cathode ray tube, liquid crystal display, etc., and a speaker, etc.; a storage section 908 including a hard disk, etc.; and a communication section 909 including a network interface card, such as a local area network card, modem, etc. The communication section 909 performs communication processing via a network such as the Internet. A drive 910 is also connected to the input / output interface 905 as needed. A removable medium 911, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 910 as needed so that computer programs read from it can be installed into the storage section 908 as needed.
[0107] Embodiments of this application also provide a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The computer-readable storage medium carries one or more programs, which, when executed, implement the method according to the embodiments of this application.
[0108] According to embodiments of this application, the computer-readable storage medium can be a non-volatile computer-readable storage medium, such as including but not limited to: portable computer disks, hard disks, random access memory, read-only memory, erasable programmable read-only memory, portable compact disk read-only memory, optical storage devices, magnetic storage devices, or any suitable combination thereof. In embodiments of this application, the computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of this application, the computer-readable storage medium may include the read-only memory 902 described above, and / or random access memory 903, and / or one or more memories other than read-only memory 902 and random access memory 903.
[0109] Embodiments of this application also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowchart. When the computer program product is run on a computer system, the program code is used to cause the computer system to implement the methods provided in the embodiments of this application.
[0110] In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device or a magnetic storage device. In another embodiment, the computer program may also be transmitted and distributed in the form of signals over a network medium, and downloaded and installed via the communication section 909, and / or installed from a removable medium 911. The program code contained in the computer program can be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination thereof.
[0111] In embodiments of this application, the computer program can be downloaded and installed from a network via communication section 909, and / or installed from removable medium 911. When the computer program is executed by processor 901, it performs the functions defined in the system of embodiments of this application. According to embodiments of this application, the systems, devices, apparatuses, modules, units, etc., described above can be implemented by computer program modules.
[0112] According to embodiments of this application, program code for executing the computer programs provided in the embodiments of this application can be written in any combination of one or more programming languages. Specifically, these computational programs can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. The program code can be executed entirely on the user's computing device, partially on the user's device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).
[0113] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0114] Those skilled in the art will understand that the features described in the various embodiments of this application can be combined and / or combined in various ways, even if such combinations or combinations are not explicitly described in this application. In particular, the features described in the various embodiments of this application can be combined and / or combined in various ways without departing from the spirit and teachings of this application. All such combinations and / or combinations fall within the scope of this application.
Claims
1. A chaos testing method, characterized in that, The method includes: The system's current multi-source heterogeneous operating data is input into the risk prediction model. The risk prediction model is then used to analyze the multi-source heterogeneous operating data and obtain the predicted risk value. The multi-source heterogeneous operating data includes code defect information, performance indicators, and system log information. The risk prediction model is used to predict the failures that will occur in the system in the future. The risk prediction model is trained based on a fault knowledge graph and historical multi-source heterogeneous operating data. Based on the relationship between the predicted risk value and the preset risk threshold, a fault injection strategy is determined; Based on the multi-source heterogeneous operation data, fault links are obtained from the fault knowledge graph. The entities in the fault knowledge graph include the historical multi-source heterogeneous operation data, which includes historical code defects, historical performance indicators, and historical system log information. The relationships between the entities represent the causal associations of historical faults. Based on the fault injection strategy, the target fault is determined from the fault link; The target fault is injected into the system to perform a chaos test.
2. The method according to claim 1, characterized in that, The step of analyzing the multi-source heterogeneous operating data using the risk prediction model and obtaining the predicted risk value includes: Time series analysis is performed on the multi-source heterogeneous operation data, and anomaly scores are obtained based on the historical multi-source heterogeneous operation data. The fault knowledge graph is traced back from the current node to determine the graph propagation score, which represents the probability of the occurrence of the historical fault that has the greatest impact on the current node among the historical faults. The predicted risk value is obtained by combining the anomaly score and the map propagation score.
3. The method according to claim 2, characterized in that, The step of obtaining anomaly scores for the multi-source heterogeneous operation data based on the historical multi-source heterogeneous operation data includes: Determine the deviation between each multi-source heterogeneous operation data point and its historical multi-source heterogeneous operation data; Based on the degree of influence of each historical multi-source heterogeneous operation data on the historical fault, the weights of each multi-source heterogeneous operation data are obtained respectively. The deviations and weights of the multi-source heterogeneous operating data are weighted and the weighted values are merged to obtain the anomaly score.
4. The method according to claim 1, characterized in that, The step of obtaining fault links from the fault knowledge graph based on the multi-source heterogeneous operational data includes: The multi-source heterogeneous operational data is mapped to the entities in the fault knowledge graph, and abnormal nodes are marked. Starting from the abnormal node, traverse the fault knowledge graph to obtain candidate fault links; Based on the impact range of the candidate fault links, the duration of the fault in the candidate fault links, and the recovery time, the candidate fault links are scored and sorted from high to low according to the scores; The candidate faulty link corresponding to the score with the highest ranking is selected as the faulty link.
5. The method according to claim 1, characterized in that, The step of determining the fault injection strategy based on the relationship between the predicted risk value and the preset risk threshold includes: If the predicted risk value is greater than the first preset risk threshold, then the combined fault injection strategy is executed; If the predicted risk value is less than the first preset risk threshold and greater than the second preset risk threshold, then a single fault injection strategy is executed. If the predicted risk value is less than the second preset risk threshold, then no fault injection operation will be performed and the operating status of the system will be monitored.
6. The method according to claim 5, characterized in that, The step of determining the target fault from the fault link based on the fault injection strategy includes: Based on the fault knowledge graph, key fault nodes are identified from the fault links, the propagation probability of the key fault nodes is obtained and sorted from high to low according to the probability value, and the propagation probability represents the probability that the fault is caused by the key fault node. If the combined fault injection strategy is executed, the first N critical fault nodes will be used as the target faults, where N is an integer greater than 1. If the single fault injection is performed, the critical fault node ranked first will be used as the target fault.
7. The method according to claim 1, characterized in that, The method further includes: If the system availability index after the chaos test exceeds its corresponding benchmark value, and the excess is greater than a preset threshold, then the target fault and its fault link are stored in the fault knowledge graph to update the fault knowledge graph. The system availability index includes fault recovery time, fault-tolerant switching success rate, and service completion rate. The risk prediction model is optimized based on the updated fault knowledge graph.
8. A chaos testing device, characterized in that, The device includes: The fault prediction module is used to input the multi-source heterogeneous operating data of the system at the current moment into the risk prediction model, and use the risk prediction model to analyze the multi-source heterogeneous operating data and obtain the predicted risk value. The multi-source heterogeneous operating data includes code defect information, performance indicators and system log information. The risk prediction model is used to predict the faults that will occur in the system at future moments. The risk prediction model is trained based on the fault knowledge graph and historical multi-source heterogeneous operating data. The fault injection strategy determination module is used to determine the fault injection strategy based on the relationship between the predicted risk value and the preset risk threshold. The fault link acquisition module is used to acquire fault links from the fault knowledge graph based on the multi-source heterogeneous operation data. The entities in the fault knowledge graph include the historical multi-source heterogeneous operation data, which includes historical code defects, historical performance indicators, and historical system log information. The relationships between the entities represent the causal associations of historical faults. The target fault determination module is used to determine the target fault from the fault link based on the fault injection strategy; The fault injection module is used to inject the target fault into the system for chaos testing.
9. An electronic device, comprising: One or more processors; Memory, used to store one or more computer programs. The characteristic feature is that the one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program or instructions stored thereon, characterized in that, When the computer program or instructions are executed by a processor, they implement the steps of the method according to any one of claims 1 to 7.
11. A computer program product, comprising a computer program or instructions, characterized in that, When the computer program or instructions are executed by a processor, they implement the steps of the method according to any one of claims 1 to 7.