Causal dag discovery method with fusion soft priors for online service systems
By combining observational data with textual meta-knowledge in cloud-native/online service systems, and optimizing the causal DAG discovery method using large language models and conditional independence tests, the problems of instability and interpretability in causal structure identification are solved, achieving efficient root cause analysis and fault location.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2026-05-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for causal DAG discovery in root cause analysis of cloud-native/online service systems suffer from unstable causal structure identification, difficulty in utilizing textual meta-knowledge, lack of reliability calibration and engineering interpretability, and are unable to meet the needs of industrial applications.
By acquiring observational data and textual meta-knowledge, natural language descriptions of variables are generated. A large language model is used to obtain causal probability vectors. By combining conditional independence tests and linear regression, a mixed scoring function is optimized, a causal directed acyclic graph is constructed, and a counterfactual self-consistency penalty is introduced to achieve causal DAG discovery.
It improves the accuracy and interpretability of causal DAG discovery, can run in ordinary processor environments, adapts to root cause analysis and fault location in cloud-native/online service systems, and enhances the efficiency and accuracy of intelligent operation and maintenance systems.
Smart Images

Figure CN122242707A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of intelligent operation and maintenance technology, and in particular to a causal DAG discovery method with integrated soft priors for online service systems. Background Technology
[0002] In the intersection of statistical causal discovery and machine learning, structure learning of causal directed acyclic graphs (DAGs) is a core technology for uncovering causal relationships between variables. It is widely used in social sciences, medical follow-up, industrial process control, and user behavior analysis. However, root cause analysis (RCA) and fault location in intelligent operation and maintenance (AIOps) of cloud-native / online service systems, as core industrial applications, place higher demands on the accuracy, stability, and engineering deployability of causal DAG discovery. Traditional causal DAG structure learning methods are mainly divided into two categories: constraint-based and scoring-based. Constraint-based methods infer the graph skeleton and edge directions through a large number of conditional independent tests. When the samples are sufficient and the tests are reliable, they have theoretical guarantees. However, in AIOps scenarios, fault samples are scarce, monitoring data is noisy, and frequent system changes cause distribution drift. The error of a single test is easily amplified by cascading, resulting in unstable causal structure identification. Scoring-based methods transform structure learning into a score maximization problem in the DAG space. They can make full use of data fitting information, but they are difficult to overcome the problem of direction indiscernibility of Markov equivalence classes. In AIOps scenarios with complex microservice dependencies, they cannot accurately determine the causal direction between indicators and services, which directly affects the accuracy of root cause localization.
[0003] With the rapid development of Large Language Models (LLM), related technologies have attempted to incorporate LLM as a source of causal knowledge into causal DAG discovery, aiming to integrate textual meta-knowledge to compensate for the lack of observational data. This approach has a natural fit in AIOps scenarios—cloud-native systems possess a wealth of textual knowledge, such as architecture documents, runbooks, and fault debriefings, which can serve as knowledge input for LLM. Existing fusion solutions mainly fall into three categories: first, allowing LLM to directly generate complete causal graphs or edge lists; second, using the edge existence and direction scores output by LLM as priors to superimpose them into traditional scoring algorithms; and third, adopting a pipeline structure, first obtaining candidate causal structures through statistical methods, and then using LLM for rule-based or local modifications. Some solutions also construct causal modeling intelligent agent frameworks, combining LLM with deep structural causal models for graph space exploration. However, when these methods are applied to AIOps root cause analysis scenarios, they reveal significant engineering and modeling deficiencies, failing to meet the needs of industrial applications.
[0004] Existing technical solutions combining LLM with statistical causal discovery suffer from three major shortcomings when adapted to core industrial scenarios such as AIOps: First, the modules are loosely coupled through prompts, lacking a single, explicit, and analyzable objective function. This makes it difficult to deeply integrate the fitting information from observed data, independent statistical constraints, and textual causal knowledge from LLM, hindering the simultaneous utilization of well-structured monitoring metrics and rich textual meta-knowledge in AIOps scenarios. Second, the lack of formal probabilistic modeling and reliability calibration mechanisms for LLM outputs leads to the direct use of LLM outputs as black-box information. In AIOps scenarios, there are gold standard edges such as service dependency topologies and classic fault causal chains that can be utilized. Existing methods do not perform scaled correction on the causal probabilities output by LLM, which can easily introduce biased priors and lead to structural misjudgments. Thirdly, there is a lack of a unified conflict resolution and trade-off mechanism between conditional independence information, LLM textual priors, and data likelihood. It is impossible to discuss the identifiability and consistency tendency of causal structures theoretically. In AIOps root cause analysis, situations may arise where the accidental correlation of data contradicts the causal mechanism described by textual knowledge, resulting in the discovery of causal structures that lack engineering interpretability and are difficult to apply in practice. At the same time, existing technologies do not consider some closed-world scenarios in industrial settings where there is no numerical observation data, such as alarm event sequences and release pipeline processes in AIOps. It is impossible to construct causal DAGs for such scenarios, which further limits the industrial adaptability of the technology. Summary of the Invention
[0005] Therefore, it is necessary to provide a causal DAG discovery method based on integrated soft priors for online service systems, which can improve the efficiency and accuracy of intelligent operation and maintenance systems and address the aforementioned technical problems.
[0006] A causal DAG discovery method based on fused soft priors for online service systems, the method comprising:
[0007] Acquire observation datasets and textual meta-knowledge from cloud-native / online service systems, preprocess the observation datasets to obtain standardized sample sets, identify the type and semantic description of each variable in the standardized sample sets, and output the standardized sample sets and variable type semantic information. Natural language descriptions of variables are generated based on semantic information of variable types. A unified template query is initiated to the large language model for each pair of ordered variables. The response of the large language model is parsed to obtain probability vectors for three types of causal situations. The probability vectors are calibrated and corrected through the gold standard edge set or calibration technology to obtain the edge-level prior probabilities. Select a conditional independence test method that suits the variable type, perform conditional independence tests on the variable pairs and condition sets in the standardized sample set, divide the test results into a high-confidence conditional independence statement set and a high-confidence conditional related statement set according to the test probability value threshold, map the test probability value to the weight of each test result, and output the high-confidence conditional independence statement set, the high-confidence conditional related statement set, and the test result weights. Using any candidate directed acyclic graph as the initial structure, linear regression is performed with each variable in the standardized sample set as the dependent variable and the set of parent nodes of each variable in the candidate directed acyclic graph as the independent variable. The structural equation parameters and noise variance are estimated, and the data fitting score is calculated based on the structural equation parameters and noise variance. The language prior score is calculated based on the edge-level prior probability. At the same time, the graph separation algorithm is used to check the constraint satisfaction of each test result in the high-confidence conditionally independent statement set and the high-confidence conditionally related statement set, and the conditional independence penalty term is calculated. If the counterfactual consistency constraint needs to be introduced, the counterfactual consistency penalty term is calculated by combining the structural equation parameters and the large language model's response to the intervention statement. The language prior score, conditional independence penalty term, and counterfactual consistency penalty term are output. The data fitting score, language prior score, conditional independence penalty term, and counterfactual consistency penalty term are integrated into a hybrid scoring function. Neighborhood operations of candidate directed acyclic graphs are defined under the constraints of directed acyclic graphs. The hybrid scoring function is optimized in the causal directed acyclic graph space through a discrete optimization strategy, and the directed acyclic graph structure with the highest hybrid score is selected as the output result.
[0008] The aforementioned causal DAG discovery method for online service systems, which integrates soft priors from observed data, statistical independent constraints, and the edge-level soft prior knowledge of large language models, unifies these into a single hybrid scoring function. It also introduces an optional counterfactual self-consistency penalty, addressing the structural instability and directional difficulty issues of traditional causal DAG discovery methods with limited samples, as well as the shortcomings of existing LLM fusion schemes, such as loose modules, lack of a unified objective function, and lack of reliability calibration. Optimized through a lightweight statistical module and discrete search algorithm, it requires no graphics processing unit, can run on ordinary processors, and retains good interpretability and engineering deployability. In AIOps root cause analysis and fault location scenarios of cloud-native / online service systems, it effectively suppresses structural misjudgments, accurately uncovers causal relationships between indicators and services, provides reliable causal evidence for root cause localization, and improves the efficiency and accuracy of intelligent operation and maintenance systems. Attached Figure Description
[0009] Figure 1 This is a flowchart illustrating a causal DAG discovery method for an online service system that incorporates soft priors, as shown in one embodiment. Figure 2 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0010] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0011] In one embodiment, such as Figure 1 As shown, a causal DAG discovery method based on fused soft priors for online service systems is provided, including the following steps: Step 102: Obtain the observation dataset and textual meta-knowledge of the cloud-native / online service system, preprocess the observation dataset to obtain a standardized sample set, identify the type and semantic description of each variable in the standardized sample set, and output the standardized sample set and variable type semantic information.
[0012] The observation dataset comprises observable service health metrics in cloud-native / online service systems, specifically including service component-level metrics, dependency chain metrics, and critical event configurations. Service component-level metrics include p95 latency, QPS, error rate, CPU utilization, memory utilization, GC time, thread pool count, connection pool count, queue depth, and cache hit rate. Dependency chain metrics include RPC call latency, timeout rate, retry rate, and downstream return code distribution. Critical event configurations include deployment version, feature flag, rate limiting threshold, scaling actions, and alarm events. Textual metadata consists of textual information related to system causality that can be parsed by a large language model, specifically including system architecture documents, runbooks, fault replay reports, alarm description texts, microservice dependency descriptions, fault tickets, change logs, and SOPs (Standard Operating Procedures). Preprocessing is a data processing operation adapted to subsequent statistical analysis and model calculation. It includes at least one or more of the following: sampling by sliding time window, index standardization, outlier correction, and discrete event encoding. The sliding time window can be flexibly set to units such as 1 minute or 5 minutes according to the system monitoring frequency. After preprocessing, a standardized sample set with regular dimensions is obtained. At the same time, the type (continuous / discrete) and semantic description (meaning of index, service to which it belongs, unit, upstream and downstream relationship, etc.) of each variable in the sample set are identified and recorded, providing a foundation for subsequent large language model query and statistical testing.
[0013] Step 104: Generate natural language descriptions of variables based on variable type semantic information. Initiate a query for a unified template to the large language model for each pair of ordered variables. Parse the response of the large language model to obtain probability vectors for three types of causal situations. Use the gold standard edge set or calibration technique to scale and correct the probability vectors to obtain edge-level prior probabilities.
[0014] The natural language description of a variable is a textual expression generated by combining the variable type and semantic description, which can be understood by a large language model. It is necessary to clearly define the meaning of the variable, its system module, and its relationships. A unified template query is used for each pair of ordered variables (…). , The standardized question asks for the probability of three types of causal scenarios. and No direct causal relationship For the reason For the result ( → v is the cause and u is the effect. → The probability vectors corresponding to the three scenarios are obtained by parsing the responses from the large language model. The gold standard edge set is the set of edges with clearly defined causal directions in cloud-native / online service systems, specifically including one or more of the following: static service dependency graph edges, distributed tracing topology edges, and classic fault causal chain edges confirmed by SRE. Calibration techniques are methods for scaling and correcting the original probability vectors output by the large language model. By using the gold standard edge set or calibration techniques to perform reliability calibration on the original probability vectors, the output bias of the large language model is corrected, treating the large language model as a calibrable stochastic "soft expert," ultimately obtaining calibrated edge representations. → The existence of possible edge-level prior probabilities enables the transformation of large language model outputs into formal prior terms.
[0015] Step 106: Select a conditional independence test method that suits the variable type, perform conditional independence tests on the variable pairs and condition sets in the standardized sample set, divide the test results into a high-confidence conditional independence statement set and a high-confidence conditional related statement set according to the test probability value threshold, map the test probability value to the weight of each test result, and output the high-confidence conditional independence statement set, the high-confidence conditional related statement set, and the test result weights.
[0016] The appropriate conditional independence test method depends on whether the variable is continuous or discrete. For continuous indicators in the preprocessed standardized sample set, one or more of partial correlation tests and kernel conditional independence tests can be selected. In time-sensitive scenarios, lag feature expansion of the variables can be performed first, followed by conditional independence testing to meet the constraints of the directed acyclic graph. Conditional independence tests are then performed on all possible variable pairs and their corresponding condition sets in the standardized sample set to obtain the probability values of each test result. A reasonable threshold for the probability values is set, and results whose probability values meet the independence criteria are classified into the high-confidence conditional independence statement set. Results that meet the relevant judgment criteria will be included in the set of statements related to the high confidence condition. The test probability value is mapped to the weights via a monotonic function. ( ) Test probability value Convert to corresponding weights ,Right now The higher the confidence level of the test, the greater the corresponding weight. This weight is used to measure the importance of each statement in the subsequent conditional independence penalty term. The two sets of output and their corresponding weights provide statistical constraints for the subsequent construction of the conditional independence penalty term.
[0017] Step 108: Using any candidate directed acyclic graph as the initial structure, perform linear regression with each variable in the standardized sample set as the dependent variable and the set of parent nodes of each variable in the candidate directed acyclic graph as the independent variable to estimate the structural equation parameters and noise variance, and calculate the data fitting score based on the structural equation parameters and noise variance.
[0018] Candidate Directed Acyclic Graphs (DAGs) are candidate solutions to causal graph structures. The initial structure can be an empty graph, a full graph, or a graph structure output by classical statistical methods, and must strictly satisfy the constraint of no directed cycles. The structural equation is a linear Gaussian structural equation model, with each variable in the candidate DAG as an example. The dependent variable is the set of its corresponding parent nodes. The variables in the equation are used as independent variables. A linear regression equation is constructed, and the regression coefficients of the structural equation are analyzed using the least squares method. and noise variance Perform analytical estimation, noise term Follows a mean of 0 and a variance of The data fit score is calculated based on the log-likelihood score obtained from the linear Gaussian structure equation model or the information criterion score after adding a complexity penalty. This score is used to characterize the degree of fit of the candidate directed acyclic graph structure to the standardized sample set. The better the fit, the higher the data fit score.
[0019] Step 110: Calculate the language prior score based on the edge-level prior probability. Simultaneously, use the graph separation algorithm to check the constraint satisfaction of each test result in the high-confidence conditionally independent statement set and the high-confidence conditionally related statement set, calculate the conditional independence penalty term, and if counterfactual consistency constraints need to be introduced, calculate the counterfactual consistency penalty term by combining the structural equation model parameters and the large language model's response to the intervention statement. Output the language prior score, conditional independence penalty term, and counterfactual consistency penalty term.
[0020] The language prior score, based on calibrated edge-level prior probabilities, is obtained by summing all ordered variable pairs in the candidate directed acyclic graph using a Bayesian prior formula. It characterizes the degree to which the soft prior knowledge of the large language model supports the candidate graph structure. The graph separation algorithm is a standard d-separation algorithm. This algorithm evaluates each test result in the high-confidence conditionally independent statement set and the high-confidence conditionally related statement set to determine whether the corresponding variable pairs in the candidate directed acyclic graph satisfy the separation constraints under the condition sets. The inconsistency between the candidate graph structure and the statistical test results is transformed into discrete conditional independence penalty terms; the higher the degree of constraint violation, the larger the penalty term. The counterfactual self-consistency constraint is an optional cross-domain consistency constraint. If this constraint is introduced, variable pairs must first be extracted from the structural equation parameters. , Local causal effect symbol Then, a query for intervention statements targeting this variable pair is sent to the large language model, and the directional distribution of the intervention question in the large language model is obtained through parsing. By a function that measures the difference in distribution The inconsistency between the two is calculated, and the inconsistency of all variable pairs is summed to obtain the counterfactual consistency penalty term, so as to achieve the goal of keeping the direction of local causal effects in the data world consistent with the intervention narrative in the language world. The distribution difference function can be one or more of the number of sign inconsistencies and divergence.
[0021] Step 112: Integrate the data fitting score, language prior score, conditional independence penalty term, and counterfactual consistency penalty term into a mixed scoring function. Define the neighborhood operation of the candidate directed acyclic graph under the constraint of the directed acyclic graph. Optimize the mixed scoring function in the causal directed acyclic graph space through a discrete optimization strategy. Select the directed acyclic graph structure with the highest mixed score as the output result.
[0022] The hybrid scoring function is a single scalar objective function that integrates four types of scoring items through weighted aggregation. Non-negative adjustable weight coefficients are used to weight the language prior score, conditional independence penalty, and counterfactual consistency penalty, achieving a fine-grained trade-off between data fitting information, soft prior knowledge of the large language model, statistical conditional independence constraints, and counterfactual consistency constraints within a unified framework. Neighborhood operations are local modifications to the candidate directed acyclic graph, including one or more of adding, deleting, or reversing edge directions. Before performing any neighborhood operation, it is necessary to check for the generation of a directed cycle; if one is generated, the operation is discarded to ensure that the neighborhood graph still satisfies the directed acyclic constraint. The discrete optimization strategy is an algorithm for finding the optimal solution in the causal directed acyclic graph space. One or more of greedy search, random restart, and simulated annealing can be used. To improve optimization efficiency, the scores for data fitting, language prior, etc., can be cached. The hybrid score is incrementally updated when making local modifications to the candidate graph, without recalculating all scores. By using a discrete optimization strategy to maximize the mixture score function in the causal directed acyclic graph space, the structure of the directed acyclic graph with the highest mixture score is finally selected as the output result, which is the causal relationship structure between variables obtained.
[0023] The aforementioned causal DAG discovery method for online service systems, which integrates soft priors from observed data, statistical independent constraints, and the edge-level soft prior knowledge of large language models, unifies these into a single hybrid scoring function. It also introduces an optional counterfactual self-consistency penalty, addressing the structural instability and directional difficulty issues of traditional causal DAG discovery methods with limited samples, as well as the shortcomings of existing LLM fusion schemes, such as loose modules, lack of a unified objective function, and lack of reliability calibration. Optimized through a lightweight statistical module and discrete search algorithm, it requires no graphics processing unit, can run on ordinary processors, and retains good interpretability and engineering deployability. In AIOps root cause analysis and fault location scenarios of cloud-native / online service systems, it effectively suppresses structural misjudgments, accurately uncovers causal relationships between indicators and services, provides reliable causal evidence for root cause localization, and improves the efficiency and accuracy of intelligent operation and maintenance systems.
[0024] In one embodiment, the observation dataset includes service component-level metrics, dependency link metrics, and key event configuration quantities of cloud-native / online service systems. The textual metadata includes system architecture documents, runbooks, fault replay reports, and alarm description texts. Preprocessing includes one or more of the following: sampling by sliding time window, metric standardization, outlier correction, and discrete event encoding.
[0025] Specifically, service component-level metrics are the operational status metrics of each independent service and component in the cloud-native / online service system, covering performance metrics (p95latency, QPS, error rate), resource metrics (CPU, memory, GCtime, threadpool, connectionpool), and business metrics (queuedepth, cachehitratio), etc.; dependency link metrics are the operational metrics of the call links between services and components in the system, including RPCA→B latency, timeout rate, retry rate, downstream return code distribution, etc.; critical event configuration quantities are discrete events and configuration parameters during system operation, including deployment version, feature flag, rate limiting threshold, scaling actions, alarm events, etc. These metrics can be directly used as discrete variables in the analysis or processed after encoding. Textual meta-knowledge refers to textual information inherent in cloud-native / online service systems that relates to causal relationships. Besides system architecture documents, runbooks, fault review reports, and alarm descriptions, it can also include microservice dependency descriptions, fault tickets, change logs, and standard operating procedures (SOPs). This type of text can be directly parsed by large language models, providing a source for soft prior knowledge. The core purpose of preprocessing is to transform raw observation data into a standardized sample set suitable for subsequent statistical analysis and model calculations. Sampling by a sliding time window can transform time-series data into a regular sample matrix; the sliding window size can be flexibly set according to the system monitoring frequency. Indicator standardization can eliminate the influence of dimensions, improving the accuracy of linear regression and statistical tests. Outlier correction can reduce the interference of noisy data on the analysis results. Discrete event encoding can transform discrete key event configuration quantities into numerical variables, adapting to algorithm calculation requirements. This embodiment provides high-quality input data for subsequent causal DAG discovery by clearly defining the observation dataset and textual meta-knowledge, and by standardizing preprocessing operations. This ensures the adaptability of the method in cloud-native / online service system scenarios. At the same time, the rich textual meta-knowledge provides sufficient information for the construction of soft priors for large language models.
[0026] In one embodiment, the three causal scenarios are: no direct causal relationship, the former pointing to the latter, and the latter pointing to the former. The gold standard edge set includes one or more of the following: static service dependency graph edges of cloud-native / online service systems, distributed tracing topology edges, and classic fault causal chain edges confirmed by SRE. Language prior scores are calculated based on edge-level prior probabilities. ; in, Let be the prior probability of the edge. For language prior scores, For candidate directed acyclic graphs, For large language models, For an edge, there exists an indicator function, when a directed edge exists. → The value is 1 if it belongs to the set of edges of a candidate directed acyclic graph, and 0 otherwise.
[0027] Specifically, for each pair of ordered variables ( , When initiating a unified template query to a large language model, only focus on queries with no direct causal relationship. → , → Three core causal scenarios are identified to avoid interference from irrelevant information and ensure that the probability vector obtained from the analysis has a clear causal orientation. The gold standard edge set is a set of edges that have been practically verified and whose causal direction has been determined in cloud-native / online service systems. It serves as an important basis for calibrating the reliability of the large language model output. Among them, static service dependency graph edges are service call edges determined during the system design phase, distributed tracing topology edges are actual call edges obtained through link tracing technology, and classic fault causal chain edges confirmed by SRE are causal edges verified by operations and maintenance personnel and obtained from fault case summaries. This gold standard edge set can effectively correct the output bias of the large language model and reduce the risk of structural misjudgment caused by biased priors. The language prior score is calculated by summing the log-likelihood in Bayesian prior form, using the edge existence indicator function. Distinguish between existing and non-existent edges in a candidate directed acyclic graph, and for existing edges... → The calibrated prior probability of that edge is included. The logarithm of the edge, for non-existent edges, is included in (1- The logarithm of the result is used to sum the results of all ordered variable pairs to obtain the final language prior score. This score directly represents the degree to which the soft prior knowledge of the large language model supports the candidate graph structure. The larger the edge → The greater the existence of an edge, the greater its contribution to the language prior score. → The less a prior term is present, the smaller its contribution to the language prior score. This embodiment transforms the output of the large language model into analyzable and optimizable formal prior terms, rather than uncontrollable black-box information, by clearly defining the three types of causal cases, calibrating the reliability of the gold standard edge set, and using a formalized formula for calculating the language prior score. This achieves a deep integration of the soft prior knowledge of the large language model with the discovery of causal DAGs, while retaining all the soft information in the output of the large language model.
[0028] In one embodiment, the conditional independence test method includes one or more of partial correlation tests and kernel conditional independence tests, and the test probability values are mapped to weights through a monotonic function, i.e. ,in, For variable pairs The weights of the test results under the condition set Z. It is a monotonic function. For variable pairs and The test probability value under the condition set Z.
[0029] Specifically, this embodiment addresses the characteristic of preprocessed standardized sample sets in cloud-native / online service systems primarily consisting of continuous indicators. It selects partial correlation tests and kernel conditional independence tests as the core conditional independence methods. Both methods are suitable for determining conditional independence of continuous variables. Partial correlation tests offer higher computational efficiency and are suitable for testing low-dimensional condition sets, while kernel conditional independence tests have stronger nonlinear fitting capabilities and are suitable for testing high-dimensional condition sets. The choice can be flexibly made based on the actual variable dimensionality and condition set size. For time-sensitive cloud-native / online service system scenarios, lag feature expansion of the variables can be performed first to correlate data from adjacent time points in the time series data before executing the conditional independence test. This breaks feedback loops in the system and satisfies the constraints of a directed acyclic graph. (Test probability values are then presented.) The statistical probability value obtained during the conditional independence test represents the pair of variables. and The confidence level of independence / correlation under the condition set Z is determined by the monotonic function g( Map the test probability value to the corresponding weight. The choice of monotonic function should ensure that the higher the confidence level of the test, the greater the corresponding weight. In other words, a more reliable statistical test result will have a greater weight in the subsequent conditional independence penalty term, thus exerting a stronger constraint on the candidate graph structure. This embodiment transforms the statistical conditional independence test results into quantifiable statistical constraints by selecting an appropriate conditional independence test method and a monotonic mapping from test probability values to weights. This provides a reliable basis for constructing the subsequent conditional independence penalty term and is adapted to the temporal characteristics and data type features of cloud-native / online service systems, ensuring the effectiveness of the statistical constraints.
[0030] In one embodiment, the structural equation takes the form of: ; in, For the first j One variable, For the first j The variables in the candidate directed acyclic graph The set of parent nodes in the middle, For regression coefficients, The noise term is a variable with a mean of 0 and a variance of 0. The data fit score is the log-likelihood score or information criterion score under the linear Gaussian structural equation model, following a normal distribution.
[0031] Specifically, this embodiment uses a linear Gaussian structure equation model as the core model for data fitting. This model assumes that each variable in the candidate directed acyclic graph... All can be derived from the set of their parent nodes. The variables in the expression are represented by a linear combination plus noise, and the noise term is... Follows a mean of 0 and a variance of The normal distribution of the variables conforms to the linear approximation of monitoring metrics in cloud-native / online service systems within a local time window after preprocessing. For each variable... , based on its set of parent nodes Linear regression is performed with the variables in the equation as independent variables. The regression coefficients can be adjusted using the least squares method. and noise variance Analytical estimation is performed, a computationally efficient method that requires no graphics processing unit and can run on ordinary processors, meeting the requirements of engineering deployability. The data fitting score is calculated based on the parameter estimation results of the linear Gaussian structure equation model. Either the log-likelihood score or the information criterion score with added complexity penalty can be used. The log-likelihood score directly characterizes the degree of fit of the candidate graph structure to the standardized sample set, while the information criterion score introduces a model complexity penalty on top of the fit to avoid overfitting. The choice can be flexible according to the actual scenario requirements. Through analytical estimation of the linear Gaussian structure equation model and calculation of the data fitting score, reliable data fitting information is provided for causal DAG discovery. At the same time, the characteristics of analytical estimation ensure the computational efficiency and engineering deployability of the method, making it suitable for industrial application scenarios in cloud-native / online service systems.
[0032] In one embodiment, the conditionally independent penalty term is: ; in, For conditionally independent penalty items, To standardize the sample set, This is a set of independent statements with high confidence conditions. This is a set of statements related to high-confidence conditions. To test the weight of the results, For the separation decision function, in the candidate directed acyclic graph In the variable, in the condition set The value is 1 when the separation occurs and 0 when the separation does not occur.
[0033] Specifically, the core design goal of the conditional independence penalty term is to transform the candidate directed acyclic graph. Inconsistencies with statistical conditional independence test results are transformed into discrete penalty values. A standard d-separation algorithm is then used to check the constraint satisfaction of the candidate graph structure. The separation decision function... It is a 0-1 function, which intuitively represents the pairs of variables in the candidate graph. , Does the separation constraint satisfy the condition set Z? For the set of high-confidence conditionally independent statements... For each result in the candidate graph, the corresponding variable pair must satisfy separation under the condition set. =1), if not separated ( =0), then [1- The value is 1, generating a weight. The corresponding penalty value; for the set of statements related to high confidence conditions. For each result in the candidate graph, the corresponding variable pair must not satisfy the separation condition under the condition set. =0), if separated ( If =1), then directly generate the weight. The corresponding penalty value. Summing the penalty values of all statements yields the final conditional independent penalty term. The higher the degree to which the candidate graph structure violates statistical constraints, the larger the penalty term value, and the lower the corresponding mixed score. Without relying on differentiable neural networks, a discrete conditional independence penalty term is constructed through the separation relationships on the graph, achieving a deep fusion of statistical conditional independence constraints and causal DAG discovery. Simultaneously, the weights of the test results are used... The introduction of this method enables differentiated constraints on candidate graph structures based on statistical test results with different confidence levels, ensuring the rationality and effectiveness of the statistical constraints.
[0034] In one embodiment, the counterfactual consistency penalty is: ; in, As a counterfactual self-consistency penalty item, Variables estimated for structural equation modeling For variables The local effect sign, For the directional distribution of variables and intervention issues in a large language model, h( , ) is a function that measures the difference in distributions, including one or more of the following: number of sign inconsistencies and divergence.
[0035] Specifically, the counterfactual consistency penalty term is an optional penalty term and a core module for achieving cross-domain consistency constraints between the data world and the language world. Its design goal is to ensure that the direction of local causal effects estimated by structural equation modeling aligns with the narrative direction of the large language model regarding the intervention issue. (Local effect symbol) Regression coefficients from linear Gaussian structural equation model Extracting and characterizing variables For variables The direction of causal effect (positive correlation / negative correlation / no effect); the directional distribution of intervention issues. By launching variable pairs into a large language model , The intervention statements were retrieved from the query. These statements represent actual operational issues in cloud-native / online service systems, such as "If we increase the DBconnectionpool from 100 to 200, will it reduce p95 latency?" and "If we increase the cache capacity, will it reduce DBCPU utilization?". The corresponding causal distribution was obtained by parsing the responses from the large language model. A function was used to measure the difference in distribution. h ( , ) used to quantify local effect symbols Distribution with intervention direction Inconsistencies between distributions can be assessed using one or more methods, such as the number of inconsistencies and divergence. The number of inconsistencies is a simple and intuitive quantification method, while divergence provides a more precise measure of the differences between distributions. The final counterfactual consistency penalty term is obtained by summing the inconsistencies of all variable pairs. The higher the inconsistency between the two, the larger the penalty term value, and the lower the corresponding mixed score. By constructing a counterfactual self-consistency penalty term, the consistency constraint between the causal effect obtained from data fitting and the text intervention knowledge provided by the large language model is achieved. This effectively avoids the algorithm learning false causal edges that are "accidentally related to data but have unreasonable mechanisms," and improves the rationality and interpretability of the causal DAG structure. This design has not been seen in the prior art and is one of the important innovations of this application.
[0036] In one embodiment, the hybrid scoring function is: ; in, For mixed scoring, Score the data fit. These are the weighting coefficients for the language prior scores. For language prior scores, The weighting coefficients for the conditional independence penalty terms. For conditionally independent penalty items, The weighting coefficient for the counterfactual consistency penalty term. This is a counterfactual self-consistency penalty item.
[0037] Specifically, the hybrid scoring function is one of the core innovations of this application, which fits scores to the data. Language prior score Conditional penalty items Counterfactual self-consistency penalty item The four types of information are integrated into a single scalar objective function, which realizes a fine trade-off between the fitting information of the observed data, the edge-level soft prior knowledge of the large language model, the independent constraints of statistical conditions, and the self-consistency constraints of counterfactual facts in a unified framework. This is different from the existing technologies that simply weight the existing scoring function or use various constraints in isolation in the pipeline. , , All are non-negative and adjustable weighting coefficients, which can be flexibly adjusted according to the actual application scenarios of cloud-native / online service systems. For example, in scenarios where fault samples are scarce and data noise is high, the weighting coefficients can be appropriately increased. The value of strengthens the constraint of soft prior knowledge in large language models; in scenarios with sufficient samples and reliable statistical test results, the value can be appropriately increased. The value of strengthens the independence constraint of statistical conditions; if the counterfactual consistency constraint is introduced, it can be adjusted according to the consistency requirements of data and linguistic knowledge. The value of . Mixed rating The larger the value, the better the overall fit of the corresponding candidate directed acyclic graph structure. The optimization objective of this application is to find the mixed score among all graphs that satisfy the directed acyclic constraint. Maximizing the graph structure. This embodiment solves the problem of the lack of a unified conflict resolution and trade-off mechanism for various types of information in the prior art by using a unified hybrid scoring function. As the sample size increases and the error of the large language model is controllable, the hybrid scoring can reach its maximum in the intersection of the true causal graph or the true equivalence class and the high-confidence prior set, thereby possessing identifiability and consistency in terms of tendency.
[0038] In one embodiment, the neighborhood operation includes one or more of adding edges, deleting edges, and reversing the edge direction. When performing the neighborhood operation, operations that generate directed cycles are eliminated. The discrete optimization strategy includes one or more of greedy search, random restart, and simulated annealing. During the discrete optimization process, the scores of each part are cached to achieve incremental updates of the mixed scores of the neighborhood graph.
[0039] Specifically, neighborhood operations are the core operations for locally modifying candidate directed acyclic graphs (DAGs) to generate new candidate graphs. They include only three basic operations: adding edges, deleting edges, and reversing edge direction, ensuring simplicity and interpretability. Before executing any neighborhood operation, a graph algorithm is used to check for directed cycles; if a directed cycle is generated, the operation is discarded, strictly ensuring that all candidate graph structures satisfy the DAG constraint. The discrete optimization strategy is the core algorithm for finding the optimal solution that maximizes the mixed score in the causal DAG space. It uses one or more of greedy search, random restart, and simulated annealing. Greedy search is a local search algorithm based on the initial graph, iterating towards higher-scoring graph structures in the neighborhood until a local optimum is reached. Random restart involves repeatedly executing the greedy search under multiple different initial graph structures to avoid local optima caused by a single initial graph. Simulated annealing is a global optimization algorithm that introduces random factors, allowing for the acceptance of lower-scoring neighborhood graphs with a certain probability, further enhancing the ability to escape local optima. To improve the computational efficiency of discrete optimization, scores for data fitting, language priors, conditional independence penalties, and counterfactual consistency penalties are cached during the optimization process. When performing local neighborhood operations on the candidate graph, only the affected portion of the score is incrementally updated, eliminating the need to recalculate the entire score and significantly reducing computational load. Through lightweight neighborhood operations and discrete optimization strategies, efficient optimization of the hybrid scoring function is achieved. All algorithms are implemented based on analytical statistical models and discrete graph operations, requiring no graphics processing unit and running on ordinary processors. This meets the engineering deployability requirements of cloud-native / online service systems while maintaining good interpretability, facilitating understanding and verification by operations and maintenance personnel.
[0040] In one embodiment, when no numerical observation data is available, script constraints are extracted from the alarm event sequence, release pipeline, and automated remediation script of the cloud-native / online service system. A soft causal matrix is constructed by combining this with the large language model. A global sequential search is then used to maximize the weighted objective of the soft causal matrix under the constraints of a directed acyclic graph, thus restoring the process causal directed acyclic graph. The objective function of the global sequential search is: ; in, The set of all graphs that satisfy the directed acyclic constraint. Variables in a soft causal matrix u Pointer variable v The soft score, For automatically adjusted edge penalty parameters, For an edge, there exists an indicator function, when a directed edge exists. → The value is 1 if it belongs to the set of edges of a candidate directed acyclic graph, and 0 otherwise.
[0041] Specifically, this embodiment is the minimum feasible implementation for closed-world activity script scenarios in cloud-native / online service systems. It is applicable to scenarios where there is no numerical observation data, only event sequences and script constraints. These include alarm event sequences, such as AlertA→AlertB→Change Event C→Fault D; deployment pipelines, such as build→deploy→warmup→trafficshift→errorspike; and automated repair scripts. First, script constraints are extracted from the event sequences and scripts of this type of scenario, including a candidate unordered pair set, a prohibited directed edge set, and a termination event set. The candidate unordered pair set consists of event pairs that may have a causal relationship; the prohibited directed edge set consists of edges that clearly do not have a causal relationship or have opposite causal directions; and the termination event set consists of the final events of the process. A soft causal matrix is constructed by combining a large language model with the answer. W Specifically, for each pair of candidate events, questions are posed regarding the choice of initiation direction and a comparison of two-way interventions. The probability weights for each direction are parsed from the responses of the large language model and aggregated. After normalization and summarization, a soft causal matrix is obtained. W Elements in the matrix ∈[0,1] represents "event" u As v The soft score for "direct cause" prohibits directed edges. Set it directly to 0. Global sequential search is a discrete optimization strategy suitable for closed-world script scenarios. It transforms causal graph search into a sequential search across all possible node permutations, allowing only edges pointing from pre-order nodes to post-order nodes. The objective function maximizes the selection of the optimal node permutation and its corresponding edge set. The automatically adjusted edge penalty parameter is used to control the number of edges and suppress weak signals, resulting in soft scoring. minus Only edges with positive results are retained. This embodiment realizes causal DAG discovery in closed-world scenarios without numerical observation data, filling the gap in existing technologies for this type of scenario. The recovered causal DAG can directly explain the fault propagation chain, narrow down the investigation scope, and provide a basis for automated handling and prioritization. It has important engineering application value in AIOps root cause analysis of cloud-native / online service systems.
[0042] It should be understood that, although Figure 1 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
[0043] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 2 As shown, the computer device includes a processor, memory, network interface, display screen, and input devices connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The network interface is used to communicate with external terminals via a network connection. When executed by the processor, the computer program implements a causal DAG discovery method based on fused soft priors for online service systems. The display screen can be an LCD screen or an e-ink screen. The input devices can be a touch layer covering the display screen, buttons, a trackball, or a touchpad mounted on the computer device casing, or an external keyboard, touchpad, or mouse.
[0044] Those skilled in the art will understand that Figure 2 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0045] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0046] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0047] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A causal DAG discovery method based on fused soft priors for online service systems, characterized in that, The method includes: Obtain observation datasets and textual meta-knowledge of cloud-native / online service systems, preprocess the observation datasets to obtain standardized sample sets, identify the type and semantic description of each variable in the standardized sample sets, and output the standardized sample sets and variable type semantic information. Based on the semantic information of the variable type, a natural language description of the variable is generated. For each pair of ordered variables, a query with a unified template is initiated to the large language model. The answer of the large language model is parsed to obtain the probability vectors of three causal situations. The probability vectors are calibrated and corrected by the gold standard edge set or calibration technology to obtain the edge-level prior probability. Select a conditional independence test method that is suitable for the variable type, perform conditional independence tests on the variable pairs and condition sets in the standardized sample set, divide the test results into a high-confidence conditional independence statement set and a high-confidence conditional related statement set according to the test probability value threshold, map the test probability value to the weight of each test result, and output the high-confidence conditional independence statement set, the high-confidence conditional related statement set, and the test result weights. Using any candidate directed acyclic graph as the initial structure, linear regression is performed with each variable in the standardized sample set as the dependent variable and the set of parent nodes of each variable in the candidate directed acyclic graph as the independent variable to estimate the structural equation parameters and noise variance. The data fitting score is then calculated based on the structural equation parameters and noise variance. Based on the edge-level prior probabilities, the language prior score is calculated. At the same time, the graph separation algorithm is used to check the constraint satisfaction of each test result in the set of high-confidence conditionally independent statements and the set of high-confidence conditionally related statements, and the conditional independence penalty term is calculated. If counterfactual consistency constraints need to be introduced, the counterfactual consistency penalty term is calculated by combining the structural equation parameters and the large language model's response to the intervention statement. The language prior score, conditional independence penalty term, and counterfactual consistency penalty term are output. The data fitting score, language prior score, conditional independence penalty term, and counterfactual consistency penalty term are integrated into a hybrid scoring function. Neighborhood operations of candidate directed acyclic graphs are defined under directed acyclic graph constraints. The hybrid scoring function is optimized in the causal directed acyclic graph space through a discrete optimization strategy. The directed acyclic graph structure with the highest hybrid score is selected as the output result.
2. The method according to claim 1, characterized in that, The observation dataset includes service component-level metrics, dependency link metrics, and key event configuration quantities of cloud-native / online service systems. The textual meta-knowledge includes system architecture documents, runbooks, fault review reports, and alarm description texts. The preprocessing includes one or more of the following: sliding time window sampling, metric standardization, outlier correction, and discrete event encoding.
3. The method according to claim 1, characterized in that, The three causal scenarios are: no direct causal relationship, the former pointing to the latter, and the latter pointing to the former. The gold standard edge set includes one or more of the following: static service dependency graph edges of cloud-native / online service systems, distributed tracing topology edges, and classic fault causal chain edges confirmed by SRE. Language prior scores are calculated based on the edge-level prior probabilities. in, Let be the prior probability of the edge. For language prior scores, For candidate directed acyclic graphs, For large language models, For an edge, there exists an indicator function, when a directed edge exists. → The value is 1 if it belongs to the set of edges of a candidate directed acyclic graph, and 0 otherwise.
4. The method according to claim 1, characterized in that, The conditional independence test method includes one or more of partial correlation tests and kernel conditional independence tests, and the test probability values are mapped to weights through a monotonic function, i.e. ,in, For variable pairs The weights of the test results under the condition set Z. It is a monotonic function. For variable pairs and The test probability value under the condition set Z.
5. The method according to claim 1, characterized in that, The structural equation is in the form of: in, For the first j One variable, For the first j The variables in the candidate directed acyclic graph The set of parent nodes in the middle, For regression coefficients, The noise term is a variable with a mean of 0 and a variance of 0. The data fit score is a log-likelihood score or an information criterion score under a linear Gaussian structural equation model, following a normal distribution.
6. The method according to claim 1, characterized in that, The conditionally independent penalty term is: in, For conditionally independent penalty items, To standardize the sample set, This is a set of independent statements with high confidence conditions. This is a set of statements related to high-confidence conditions. To test the weight of the results, For the separation decision function, in the candidate directed acyclic graph In the variable, in the condition set The value is 1 when the separation occurs and 0 when the separation does not occur.
7. The method according to claim 1, characterized in that, The counterfactual consistency penalty is as follows: in, As a counterfactual self-consistency penalty item, Variables estimated for structural equation modeling For variables The local effect sign, For the directional distribution of variables and intervention issues in a large language model, h( , ) is a function that measures the difference in distributions, including one or more of the following: number of sign inconsistencies and divergence.
8. The method according to claim 1, characterized in that, The hybrid scoring function is: in, For mixed scoring, Score the data fit. These are the weighting coefficients for the language prior scores. For language prior scores, The weighting coefficients for the conditional independence penalty terms. For conditionally independent penalty items, The weighting coefficient for the counterfactual consistency penalty term. This is a counterfactual self-consistency penalty item.
9. The method according to claim 1, characterized in that, The neighborhood operations include one or more of adding edges, deleting edges, and reversing the edge direction. When performing neighborhood operations, operations that generate directed cycles are eliminated. The discrete optimization strategy includes one or more of greedy search, random restart, and simulated annealing. During the discrete optimization process, the scores of each part are cached to achieve incremental updates of the mixed scores of the neighborhood graph.
10. The method according to claim 1, characterized in that, The method further includes: When no numerical observation data is available, script constraints are extracted from alarm event sequences, release pipelines, and automated remediation scripts of cloud-native / online service systems. These constraints are then combined with a large language model to construct a soft causal matrix. A global sequential search is then used to maximize the weighted objective of the soft causal matrix under the constraints of a directed acyclic graph, thus restoring the process causal directed acyclic graph. The objective function of the global sequential search is: in, The set of all graphs that satisfy the directed acyclic constraint. Variables in a soft causal matrix u Pointer variable v The soft score, For automatically adjusted edge penalty parameters, For an edge, there exists an indicator function, when a directed edge exists. → The value is 1 if it belongs to the set of edges of a candidate directed acyclic graph, and 0 otherwise.