Cue-driven population segment and behavior co-analysis system

By using a prompt-driven joint analysis system for user demographics and behaviors, the system addresses the issues of cluster instability and causal effect estimation bias in multi-source mixed variable data, enabling efficient user behavior data analysis and decision support.

CN121434835BActive Publication Date: 2026-06-16HANGZHOU DAZHU YUNZHI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU DAZHU YUNZHI TECH CO LTD
Filing Date
2025-11-06
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately capture subtle structural differences between samples in a high-dimensional sparse feature space when processing user behavior data with multiple sources and mixed variables. This results in unstable clustering results, blurred cluster boundaries, and biased causal effect estimations, making it difficult to form accurate and actionable conclusions.

Method used

A cue-driven joint analysis system for population composition and behavior is adopted. By parsing the cue words and constructing Copula space units, the input data is uniformly mapped into a pseudo-observation table of unit intervals, eliminating marginal distribution differences and preserving dependency structure. Combined with sparse subspace clustering and intra-cluster causal Uplift estimation and dual correction, an actionable population is generated and a joint analysis report is output.

🎯Benefits of technology

Accurately identifying local dependency structures and potential population subgroups in high-dimensional mixed data enables stable causal score estimation and generation of actionable populations, thereby improving the stability of data analysis and the operability of decision-making.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121434835B_ABST
    Figure CN121434835B_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of data analysis, and further relates to a prompt word driven population group component and behavior joint analysis system, which comprises: a prompt word analysis and Copula space construction unit, receiving a prompt word containing a target behavior field name, an intervention field name, a variable value domain description and an output field name; a Copula space sparse subspace clustering unit, constructing a data matrix based on the unit interval pseudo-observation value table output by the prompt word analysis and Copula space construction unit; a clustered intra-causal uplift estimation and double correction unit, independently executing a process for each low-dimensional dependent cluster obtained by the Copula space sparse subspace clustering unit; and an actionable population generation and joint analysis report output unit, sorting the individual causal uplift scores obtained by the clustered intra-causal uplift estimation and double correction unit in each cluster from high to low. The present application realizes the collaborative fusion of distribution uniformity, structural sparsity and causal consistency, and significantly improves the accuracy of population division.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of data analysis technology, specifically relating to a cue word-driven joint analysis system for population composition and behavior. Background Technology

[0002] Existing technologies typically rely on feature engineering and machine learning algorithms to discover potential patterns through aggregated statistics and supervised learning of user behavior data. For example, common user segmentation methods include clustering algorithms based on K-means or Gaussian mixture models, which divide users into groups by minimizing distance or maximizing likelihood functions in the feature space. However, these methods often assume similar marginal distributions of input variables, ignoring the complex dependency structures between variables. When input data comes from different sources or heterogeneous systems, such as online behavioral data and offline transaction data, their distributions differ significantly, leading to unstable clustering results and blurred group boundaries.

[0003] Another class of existing technologies identifies intervention effects through causal inference methods, such as propensity score matching, inverse probability weighting, and dual robust estimation. These methods perform well with single data distributions, but in multi-source mixed variable scenarios, they often struggle to distinguish between spurious correlations caused by dependency structures and genuine causal effects. Furthermore, traditional causal estimation methods typically assume a densely distributed covariate space, failing to accurately capture subtle structural differences between samples in high-dimensional sparse feature spaces. For complex behavioral data, nonlinear dependencies between individuals and potential subgroup structures are ignored, leading to biased causal effect estimations and making it difficult to form precise, actionable conclusions. In recent years, some studies have attempted to bridge the gap between distribution unification and causal estimation. For example, some studies have used Copula theory to model dependencies between variables, mapping data with different marginal distributions to a unified dependency space, and then performing clustering or correlation analysis within that space. However, existing Copula modeling is mostly used for financial risk or biostatistical analysis, lacking an efficient and feasible implementation path for large-scale, multi-dimensional mixed variable user behavior data. Especially in scenarios with a large number of categorical variables, time variables, and missing terms, Copula structural inference is complex and computationally expensive, making it difficult to form an applicable pipeline. Summary of the Invention

[0004] The main objective of this invention is to provide a cue-driven joint analysis system for population composition and behavior.

[0005] To solve the above problems, the technical solution of the present invention is implemented as follows:

[0006] This system is a cue-driven joint analysis system for population composition and behavior. The system consists of a computing device and storage media, storing executable program instructions. These instructions instruct the computing device to perform operations on a set of units, including: a cue-word parsing and Copula space construction unit, configured to: receive cue words containing target behavior field names, intervention field names, variable range descriptions, and output field names; locate columns in the input data table based on the cue words; and map each column to a single unit interval pseudo-observation table to eliminate marginal distribution differences and preserve dependency structure; and a Copula space sparse subspace clustering unit, configured to: cluster based on the unit interval output by the cue-word parsing and Copula space construction unit. A data matrix is ​​constructed using an interval pseudo-observation table; multiple low-dimensional dependency clusters are obtained through self-expression solving and spectral clustering; the intra-cluster causal uplift estimation and dual correction unit is configured to independently execute the process for each low-dimensional dependency cluster obtained by the Copula space sparse subspace clustering unit; the individual causal uplift score corresponding to the sample and the cluster-level strength corresponding to the cluster are obtained; the actionable population generation and joint analysis report output unit is configured to sort the individual causal uplift scores obtained by the intra-cluster causal uplift estimation and dual correction unit in each cluster from high to low; extract samples to form a candidate set; filter and merge the samples according to the execution constraints in the prompt words to obtain the actionable population; and output the joint analysis report.

[0007] Furthermore, the prompt word parsing and Copula space construction unit is further configured to execute a unit interval mapping process when processing continuous variables. This process includes: counting the total number of samples of continuous variables; generating an index sequence and sorting it from smallest to largest according to the values ​​of continuous variables; when encountering the same value, recording the first and last positions of the same value in the sorted sequence, and setting the rank of all samples with the same value as the average of the first and last positions recorded; dividing the rank of each sample by the total number of samples obtained and adding one to obtain the unit interval value.

[0008] Furthermore, the prompt word parsing and Copula space construction unit are further configured to perform a unit interval mapping process when processing categorical variables. This process includes: extracting all deduplicated category labels of the categorical variable; performing a normalization step on each deduplicated category label, which involves converting the label to Unicode standard form and then to a UTF-8 byte sequence, and then sorting it by bytes from smallest to largest to obtain a definite category order; assigning a sequence number starting from 1 to each category order; replacing the sample category with the corresponding sequence number; counting the number of samples with sequence numbers less than the currently processed sample's sequence number and the number of samples with sequence numbers equal to the currently processed sample's sequence number within the categorical variable; adding half of the count of samples less than the count of samples equal to the count of samples equal to the count of samples equal to the count of samples less than the count of samples less than the count of samples equal to ...

[0009] Furthermore, the prompt word parsing and Copula space construction unit are further configured such that, when handling missing items, the execution process includes: placing the missing item at the beginning of the sorting sequence when sorting the original variables, and assigning a unit interval value of 1 divided by the total number of samples of the corresponding variable plus one; simultaneously filling the missing sample with 1 and the non-missing sample with 0 in the missing indicator column corresponding to the original variable; and performing the categorical variable process on the missing indicator column to obtain the unit interval value of the missing indicator column.

[0010] Furthermore, the prompt word parsing and Copula space construction unit are also configured to execute a dependency description generation and Copula family selection process. This process includes: for any two unit interval variables, enumerating all sample pairs without missing values; if the difference direction of the current sample pair in the two columns is consistent, it is considered coordinated; if the direction is opposite, it is considered repulsive; sample pairs with the same value are skipped; the difference between the number of coordinated and repulsive pairs is calculated and then divided by the number of sample pairs involved in the calculation to obtain the rank correlation score; at the same time, the frequency of co-occurrence in intervals above 0.95 and the frequency of co-occurrence in intervals below 0.05 are counted and recorded as the upper tail co-occurrence rate and the lower tail co-occurrence rate, respectively, to generate a dependency signature for the variable pair; based on the generated dependency signature, using a preset rule mapping, a Copula family matching the dependency signature is selected from the Copula family set including Frank family, Gumbel family, Clayton family, t family, and Gaussian family; the selection result, along with the rank correlation score, the upper tail co-occurrence rate, and the lower tail co-occurrence rate, are recorded as a structured dependency description.

[0011] Furthermore, the Copula space sparse subspace clustering unit is configured to perform the following self-expression coefficient estimation process when executing self-expression solutions: construct a data matrix such that each sample is a column vector and each variable is a row; perform orthogonal matching pursuit on each column of sample vector to obtain sparse reconstruction coefficients, specifically including: a) initializing the coefficient vector of the current column to all zeros and fixing the diagonal elements to zero; b) calculating the current residual, which is the difference between the target column vector and the weighted linear combination of other column vectors; c) in the undefined... d) Find the column index with the largest absolute value of the inner product with the current residual among the selected columns and add it to the selected set; d) Use the matrix containing the selected column vectors as the design matrix, call the QR decomposition least squares solver with column pivoting, obtain the coefficient values ​​that minimize the square sum of the current residuals and update the current residuals; e) Repeat steps c) to d) until the number of selected columns reaches 8 or the Euclidean norm of the current residual decreases by less than 1e-6; f) Write the coefficient values ​​corresponding to the selected columns back to the coefficient vector of the current column, keeping the diagonal elements zero.

[0012] Furthermore, the Copula space sparse subspace clustering unit is configured such that, when performing spectral clustering, its process includes: taking the absolute value of the coefficient matrix composed of sparse reconstruction coefficients obtained from the self-expression solution, and adding it to its transpose to obtain a symmetric adjacency matrix; calculating the first 20 smallest eigenvalues ​​and corresponding eigenvectors of the symmetric normalized Laplacian matrix of the symmetric adjacency matrix; calculating the ratio of adjacent eigenvalues ​​in ascending order of eigenvalues, and finding the breakpoint with the largest ratio, using its index as the cluster number; performing k-means++ initialized mean iterative clustering on the eigenvectors corresponding to the determined cluster number, with the maximum number of iterations of mean iterative clustering set to 300 and the center change threshold set to 1e-6; and outputting the low-dimensional dependent cluster label for each sample.

[0013] Furthermore, the intra-cluster causal uplift estimation and dual-calibration unit is configured such that, during CATE estimation, the process includes: randomly splitting the low-dimensional dependency cluster into 5 folds, specifying the intervention identifier column and the outcome column, with the remaining columns as covariates; training a random forest regression model on the intervention subset to predict outcomes under intervention conditions, and training a random forest regression model on the control subset to predict outcomes under control conditions. The random forest regression model has 500 trees, a maximum depth of 30, a minimum number of leaf node samples of 5, and uses features selected by randomly sampling the square root of the feature size for each split. Out-of-bag error is used for early stopping. Training and validation are performed alternately on the 5 folds, and the validation... The predicted data for the intervention were obtained through cross-prediction. Within the intervention subset, the spurious effect was obtained by subtracting the cross-prediction of the control outcome model from the actual outcome, and a random forest effect regression model with covariates as input was trained. Within the control subset, the spurious effect was obtained by subtracting the actual outcome from the cross-prediction of the intervention outcome model, and another set of random forest effect regression models was trained. Logistic regression with L2 regularization was used to fit the intervention propensity score on the entire cluster sample. A threshold selection strategy was adopted to determine the CATE estimate for each sample as follows: when the intervention propensity score is greater than or equal to 0.5, the prediction of the intervention effect regression model on that sample was taken; when the intervention propensity score is less than 0.5, the prediction of the control effect regression model on that sample was taken.

[0014] Furthermore, the intra-cluster causal uplift estimation and dual calibration unit is further configured such that, when performing dual calibration, the process includes: first, calculating the cross-prediction residuals for the intervention subset and the control subset respectively, and training a regression tree of depth 3 with covariates to obtain the residual correction function; then, using the difference in output of the residual correction function for the corresponding group on the sample as the correction term, adding it to the determined CATE estimate to obtain the individual causal uplift score; subsequently, in the validation layer, using the individual causal uplift score of each sample as the independent variable and the sign consistency between the difference in prediction and the difference in actual results between the intervention outcome model and the control outcome model as the dependent variable, training an independent ordinal-preserving regression calibrator for each cluster, and performing monotonic calibration on the individual causal uplift score.

[0015] Furthermore, the actionable population generation and joint analysis report output unit is configured to execute the following processes: sorting individuals within each cluster by causal uplift score from highest to lowest, and extracting the top 20 percentile samples within each cluster to form a candidate set; filtering out unreachable samples based on execution constraints in the prompts, and then merging them to obtain the actionable population; the joint analysis report output includes: outputting population component identifiers, a structured dependency description summary, a sparse geometric profile, intra-cluster CATE estimation results, and causal uplift-driven intervention suggestions; and the intra-cluster causal uplift estimation and dual correction unit is also configured to: use the median of the causal uplift score of individuals within each cluster as the cluster-level strength; and calculate the top 20 variables using permutation importance in the effect regression model, constructing a list of key covariates and outputting it.

[0016] The cue-driven joint analysis system for population composition and behavior of this invention has the following beneficial effects: At the data level, this invention systematically eliminates the marginal distribution differences of the original data by uniformly mapping input variables to a unit interval pseudo-observation table, while preserving the dependency structure between variables. This processing method allows data from different sources or types to be compared and analyzed in the same statistical space, fundamentally improving the stability of population structure identification. Secondly, at the structure identification level, this invention discovers low-dimensional dependency clusters by performing sparse subspace clustering in the Copula space, combining self-expression solving and spectral clustering, which can accurately identify local dependency structures and potential population subgroups in high-dimensional mixed data. This process is based on sparse geometric relationships, significantly reducing the ambiguity and randomness of traditional clustering algorithms under heterogeneous data. Thirdly, at the causal estimation level, this invention independently performs causal uplift estimation and dual correction within each low-dimensional dependency cluster. Through cross-prediction, residual correction, and ordinal-preserving regression calibration, stable and unbiased individual causal uplift scores are obtained, achieving accurate quantification of intra-cluster effects. This intra-cluster local causal estimation mechanism tightly integrates statistical structure with intervention effects, enabling the estimation results to possess both local consistency and global interpretability. Furthermore, this invention combines individual causal enhancement scores with cluster-level strength, generates candidate sets by ranking, and forms actionable populations based on cue word constraints. This allows the model output to be directly executed by the business system, thus achieving a closed loop from data modeling to intervention decision-making. Finally, at the system application level, the joint analysis report integrates dependency structure, sparse geometric profiles, and causal-driven suggestions, providing structured decision-making support for product optimization, market launch, and user operations. Compared to existing technologies, this invention simultaneously possesses three major characteristics: uniform distribution, structural sparsity, and causal consistency. This allows population segmentation, behavioral interpretation, and intervention optimization to be completed collaboratively within a single system, significantly improving the efficiency of complex behavioral data analysis and the operability of decision-making. Attached Figure Description

[0017] Figure 1 This is a schematic diagram illustrating the principle of unit interval mapping provided in an embodiment of the present invention;

[0018] Figure 2 A comparison diagram of the dependency structure features of different Copula families provided in embodiments of the present invention;

[0019] Figure 3 This is a diagram of the dependency structure space and Copula family selection region provided for embodiments of the present invention. Detailed Implementation

[0020] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0021] This system is a cue-driven joint analysis system for population composition and behavior. The system consists of a computing device and storage media, storing executable program instructions. These instructions instruct the computing device to perform operations on a set of units, including: a cue-word parsing and Copula space construction unit, configured to: receive cue words containing target behavior field names, intervention field names, variable range descriptions, and output field names; locate columns in the input data table based on the cue words; and map each column to a single unit interval pseudo-observation table to eliminate marginal distribution differences and preserve dependency structure; and a Copula space sparse subspace clustering unit, configured to: cluster based on the unit interval output by the cue-word parsing and Copula space construction unit. A data matrix is ​​constructed using an interval pseudo-observation table; multiple low-dimensional dependency clusters are obtained through self-expression solving and spectral clustering; the intra-cluster causal uplift estimation and dual correction unit is configured to independently execute the process for each low-dimensional dependency cluster obtained by the Copula space sparse subspace clustering unit; the individual causal uplift score corresponding to the sample and the cluster-level strength corresponding to the cluster are obtained; the actionable population generation and joint analysis report output unit is configured to sort the individual causal uplift scores obtained by the intra-cluster causal uplift estimation and dual correction unit in each cluster from high to low; extract samples to form a candidate set; filter and merge the samples according to the execution constraints in the prompt words to obtain the actionable population; and output the joint analysis report.

[0022] In one embodiment, the prompt word parsing and Copula space construction unit receives prompt words containing the target behavior field name, intervention field name, variable value range description, and output field name. The prompt words are presented as plain text or structured text, and the field names use identifiers consistent with the column names in the input data table. After loading the prompt words, the system first performs standardized text processing, including removing leading and trailing whitespace, collapsing consecutive whitespace into single spaces, standardizing punctuation to half-width characters, and maintaining case sensitivity to ensure a one-to-one correspondence with the data table. Then, a column location process is executed: precise matching is performed on the column name set of the input data table according to the target behavior field name, intervention field name, and output field name; for field names related to the variable value range description, if multiple candidate column names have a literal similarity of 0.9 or higher, the column name with the same root word as the field name that appears most frequently in the prompt words is selected from these candidates. For example, the target behavior field given by the prompt is named purchase_result, the intervention field is named offer_flag, and the variable value range description includes two columns: region and device_type. The system sequentially confirms the position and data type of the four columns purchase_result, offer_flag, region, and device_type in the column name set, and generates a list of column location results as the entry point for subsequent processing.

[0023] For each column in the column localization results list, the system maps all columns to a unified unit interval pseudo-observation table. This table uses samples as rows and variables as columns, with each entry falling between 0 and 1 without considering endpoint values. This eliminates marginal distribution differences without altering the relative order relationships between variables. The direct effect of this approach is that while the same user group may exhibit different value ranges and scales in data tables from different sources, after passing through the unit interval pseudo-observation table, all variables share a unified value domain. Furthermore, the coordination and repulsion patterns formed by the order relationships between variables remain consistent, providing a stable foundation for subsequent dependency structure modeling and sparse subspace clustering.

[0024] refer to Figure 1 , Figure 1 The principle of unit interval mapping in this invention is illustrated, including two parts: continuous variable mapping and categorical variable mapping.

[0025] Figure 1The upper part illustrates the rank mapping process for continuous variables. On the original value axis, the original values ​​of five samples are marked from left to right: 2.1, 5.3, 8.7 (tied), 12.4, and 18.9, with the two samples with the value 8.7 forming a tied group. The downward arrow indicates the mapping process, which uses the formula "rank divided by the total number of samples plus one" (labeled "rank / (n+1)"). On the lower unit interval axis, the horizontal axis is evenly marked from 0 to 1 with scales of 0, 0.25, 0.5, 0.75, and 1. The corresponding unit interval values ​​for the five mapped samples are: 0.167, 0.333, 0.500, 0.667, and 0.833. The key features of this mapping process are: first, it preserves the relative order of the original data; second, samples in the tied group are assigned the same rank value; and third, all mapped values ​​strictly fall within the open interval between 0 and 1, avoiding endpoint values. Figure 1 The lower half of the diagram illustrates the frequency mapping process for categorical variables. In the category example area, four rectangles are arranged from left to right, labeled "Asia (30%)", "Europe (25%)", "Americas (35%)", and "Africa (10%)", with the numbers in parentheses representing the frequency of each category in the sample. Downward arrows indicate the mapping transformation process. In the unit interval mapping results below, the horizontal axis is also marked from 0 to 1. The intervals corresponding to the four categories are: Africa occupies the interval from 0 to 0.15, America occupies the interval from approximately 0.15 to 0.40, Asia occupies the interval from approximately 0.40 to 0.70, and Europe occupies the interval from approximately 0.70 to 0.90. Each interval is marked with a dashed rectangle, and the interval length is proportional to the cumulative frequency of that category. The characteristic of this mapping is that after the categories are ordered according to deterministic rules, the unit interval length occupied by each category reflects its relative frequency in the sample, while preserving the order information between categories. Figure 1The two mapping methods shown uniformly transform different types of variables to the same unit interval [0,1], laying the foundation for subsequent Copula space construction. This mapping process eliminates marginal distribution differences between different variables while fully preserving the dependency structure information between variables. The unit interval mapping process for continuous variables is centered on the rank. The system counts the total number of samples for the variable, generates an index sequence starting from 1 and ending with the total number of samples, and stably sorts the variable's values ​​from smallest to largest. Samples with identical values ​​form a tie group. The system records the first and last positions of the tie group in the sorted sequence and sets the rank of all samples in the group as the average of these two positions. Then, the rank of each sample is divided by the total number of samples plus one to obtain the unit interval value between 0 and 1. The design of using the total number of samples plus one as the denominator has two considerations: first, the unit interval value is far from the endpoints, which can avoid tail probability degradation when fitting the Copula family in the future; second, different dataset sizes are comparable under this mapping. Taking a specific numerical example, a continuous variable has 10 samples. After sorting, the values ​​of the 3rd and 4th positions are the same, so the rank of these two samples is 3.5. Dividing 3.5 by 11 gives 0.3181818181, and the mapping result falls between 0 and 1, which is consistent with its relative position in the population.

[0026] The unit interval mapping process for categorical variables is centered on a deterministic order. The system first extracts all unique category labels for the variable and performs a normalization step on each label: converting the label to Unicode standard form and then to a UTF-8 byte sequence; sorting them by byte size from smallest to largest to obtain a deterministic category order. This order is independent of the runtime environment, thus maintaining consistency across platforms and languages. The system then assigns a sequence number starting from 1 to each category order and replaces the sample's category with the corresponding number. Within this variable, for each sample, the system counts the number of samples with a number less than the assigned number and the number of samples with a number equal to the assigned number; adding half the number of samples less than or equal to the assigned number, and then dividing by the total number of samples plus one, yields the unit interval value. For example, the categorical variable `region` contains the labels Asia, Europe, Americas, and Africa. After normalization and UTF-8 sorting, the sequence is Africa, Americas, Asia, Europe, corresponding to numbers 1, 2, 3, and 4. The total number of samples is 100, with 20 samples numbered 3 and 30 samples with a number less than 3. For any sample numbered 3, the unit interval value is obtained by adding 30 and half of 20 to get 40, then dividing by 101 to get 0.3960396040. This mapping allows the category to occupy the interval position related to its occurrence frequency in the unit interval, while preserving the order relationship and relative density information between categories.

[0027] The handling of missing items focuses on both stable ranking and explicit information. When ranking the original variables, the system places missing items at the beginning of the ranking sequence and assigns a unit interval value of 1 divided by the total number of samples plus one to missing samples. Based on this, the system adds a missing indicator column for each original variable, filling in 1 for missing samples and 0 for non-missing samples. This missing indicator column is then treated as a categorical variable and subjected to the same unit interval mapping process. The benefit of this approach is that missing states have a stable position in the unit interval pseudo-observation table, and the missing pattern is entered into subsequent dependency structure modeling as a separate column, facilitating the identification of common changes caused by the missing mechanism. For example, if a variable has a total of 500 samples and 20 missing samples, the unit interval value for each missing sample is 1 divided by 501, resulting in 0.0019960079. After mapping, the missing indicator column provides a unit interval value range close to 1 for missing samples and a unit interval value range close to or below 0.5 for non-missing samples, allowing for clear coordination or conflicting relationships with other columns in subsequent dependency structure analysis.

[0028] After completing the unit interval mapping for each column, the system generates the information needed for dependency description and Copula family selection. For any two unit interval variables, the system enumerates all sample pairs without missing values. For each sample pair, the direction of the difference between the two columns is compared; if the directions are the same, it is considered coordinated; if the directions are opposite, it is considered repulsive; sample pairs with exactly the same value are skipped. The system accumulates the number of coordinated and repulsive pairs and divides the difference between the two by the number of sample pairs involved in the calculation to obtain the rank correlation score. This score falls between -1 and 1; the closer the value is to 1, the stronger the monotonic coordination; the closer it is to -1, the stronger the monotonic repulsion. Simultaneously, the system calculates the upper tail co-occurrence rate and the lower tail co-occurrence rate: when the unit interval values ​​of both columns are greater than 0.95, it is counted as upper tail co-occurrence; when the unit interval values ​​of both columns are less than 0.05, it is counted as lower tail co-occurrence. The frequencies are obtained by dividing the number of occurrences by the total number of samples. The rank correlation score reflects global consistency, while the two tail frequencies characterize the co-occurrence strength in extreme regions. The combination of these three factors forms the dependency signature of the variable pair, helping to distinguish different dependency patterns while maintaining the same unit interval scale. The system selects the Copula family of the variable pair based on the dependency signature and a preset rule mapping, and records the structured dependency description. An example of the rule mapping is as follows: When the absolute value of the rank correlation score is less than 0.15, and both the upper and lower tail co-occurrence rates are less than 0.02, the Frank family is selected to suit the case of symmetry and weak tail dependencies. When the difference between the aforementioned two tail frequencies reaches 0.05 or more, and the upper tail co-occurrence rate is higher, the Gumbel family is selected to suit the case of more frequent co-occurrence increases in the right tail; when the lower tail co-occurrence rate is higher and the difference reaches 0.05 or more, the Clayton family is selected to suit the case of more frequent co-occurrence decreases in the left tail. When both the upper and lower tail co-occurrence rates are above 0.05 and their difference is less than 0.02, the t-family is selected to accommodate the case of two-tailed symmetry and significant tail dependence. When the absolute value of the rank correlation score is above 0.5, and the frequencies of the two tails are between 0.02 and 0.05, the Gaussian family is selected to accommodate the case of significant overall linear consistency and moderate tailing. The system records the selected Copula family, rank correlation score, upper tail co-occurrence rate, lower tail co-occurrence rate, and variable name combination as a structured dependency description, which is used for subsequent sparse subspace clustering and intra-cluster causal Uplift estimation and dual correction.

[0029] refer to Figure 2 , Figure 2 The paper presents a comparison of the dependency structure features of six commonly used Copula families, and uses contour lines and scatter points to show the density features of each Copula family within a unit square region. Figure 2It contains six subgraphs, which are, from left to right and from top to bottom, Gaussian Copula, t-Copula, Gumbel Copula, Clayton Copula, Frank Copula, and Independent Copula. Figure 2 The Gaussian Copula subplot in the upper left corner illustrates a symmetric linear dependency structure. The plot shows three concentric elliptical contour lines, with the line width gradually increasing from the outside in, representing a change in density from low to high. The major axis of the ellipse runs along the diagonal, indicating a positive correlation between the two variables. The scatter points are evenly distributed within the ellipse, with lower density at the four corners. The subplot is labeled "τ=0.65, weak tail dependency," where τ is the rank correlation coefficient, indicating a strong overall correlation between the variables, but weaker linkage in the extreme value region (tail).

[0030] Figure 2 The t-Copula subplot in the upper right corner also shows a symmetrical structure, with contour lines similar to those of the Gaussian Copula, but with more scatter points at the four corners, represented by larger solid circles. These dense scatter points at the corners indicate that the t-Copula exhibits stronger correlation in extreme value regions. The subplot is labeled "τ=0.65, strong two-tailed dependence," indicating that although the overall correlation is the same as the Gaussian Copula, this family shows significant joint extreme events in both the upper tail (upper right corner) and the lower tail (lower left corner). Figure 2 The GumbelCopula subplot in the middle left shows upper-tail dependence. The contour lines exhibit a curved shape sloping towards the upper right, with density gradually increasing from the lower left to the upper right. The scatter points in the upper right region are significantly denser, represented by large solid circles, indicating a strong correlation when both variables simultaneously take large values. The subplot is labeled "τ=0.58, significant upper-tail correlation," indicating that this family is suitable for characterizing the common increase of variables in high-value regions. Figure 2 The Clayton-Copula subplot in the middle right shows lower-tail dependence. The contour lines are mirror-symmetric to the Gumbel-Copula, sloping towards the lower left, with density gradually increasing from the upper right to the lower left. The dense scatter points in the lower left region, represented by large solid circles, indicate strong correlation when both variables simultaneously reach low values. The subplot is labeled "τ=0.58, significant lower-tail correlation," indicating that this family of subplots is suitable for characterizing the common decrease of variables in low-value regions. Figure 2 The FrankCopula subplot in the lower left corner illustrates a symmetrical, weakly tailed dependency structure. The contour lines are circular, with density concentrated in the central region and gradually decreasing outwards. The scatter points are relatively evenly distributed throughout the region, with the density at the four corners comparable to other areas. The subplot is labeled "τ=0.12, tail independent," indicating that this family is suitable for characterizing situations with weak overall correlation and no special linkages at the tails.

[0031] Figure 2The independent Copula subplot in the lower right corner illustrates an independent structure. There are no contour lines in the plot; the scatter points are completely randomly distributed within unit squares, with uniform density across all regions. The subplot is labeled "τ=0.00, Completely Independent," indicating that this family corresponds to the case where the two variables are statistically independent. Figure 2 The bottom features a legend area marked with rectangles and containing three lines of explanation. The first line uses a thick solid line to represent "strongly dependent contour lines," the second line uses a thin solid line to represent "weakly dependent contour lines," and the third line uses a large solid circle to represent "dense tail areas." The text on the right includes the definitions of three indicators: "τ: rank correlation coefficient," "upper tail dependence: λ_U," and "lower tail dependence: λ_L." Figure 2 The comparison of the six Copula families shown clearly demonstrates the differences in geometric features of different dependency structures, providing an intuitive reference for selecting the appropriate Copula family based on data characteristics.

[0032] In one optional implementation, the deterministic order of categorical variables can be determined using a region-aware sorting strategy: when the variable's value range specification includes language or region information, the system first determines the local order by sorting according to Unicode and UTF-8 within the same language or region, and then concatenates these local orders into a global order in ascending order of region code. This approach improves the semantic consistency of category labels in multi-region fusion applications. In another optional implementation, the ranks of parallel groups of continuous variables can employ an interpolation offset strategy: after assigning an average rank to the parallel groups, the system adds a very small, incremental offset to each sample within the group in row number order, for example, 0.000001 per sample, to obtain a strictly increasing sequence of unit interval values. This approach reduces the number of skips caused by equal values ​​in subsequent dependency signature calculations, thereby enhancing the resolution of rank-related scores. In another optional implementation, the thresholds for the simultaneous occurrence rate of the upper tail and the simultaneous occurrence rate of the lower tail can be configured according to the variable value range description in the prompt words. For example, in financial behavior data with a large proportion of extreme values, the thresholds can be adjusted from 0.95 and 0.05 to 0.9 and 0.1 to improve the sensitivity to tail linkage; in scenarios where user behavior is relatively concentrated and extreme values ​​are rare, 0.95 and 0.05 can be maintained to ensure robust identification of occasional linkage phenomena.

[0033] In practice, the system generates unit interval pseudo-observation tables and structured dependency descriptions in a streaming manner. Specifically, each column is first independently mapped to unit intervals and written to a disk or memory-mapped file. Then, during the dependency signature calculation phase, the unit interval values ​​of the two columns are read in fixed batches. An integer counter is used to accumulate the number of reconciliations and repulsions, and the counts of simultaneous occurrences at the upper and lower tails are summed based on the sample block size. Finally, all three indicators are obtained by dividing by the total number of samples. This process utilizes sequential access and fixed batches to adapt to sample sizes of tens of millions while ensuring the consistency of mapping and dependency signatures across multiple runs. The final output includes a unit interval pseudo-observation table and a list of structured dependency descriptions covering all variable pairs. Both serve as inputs for subsequent Copula space sparse subspace clustering, intra-cluster causal Uplift estimation, and dual correction.

[0034] In one embodiment, the data matrix is ​​constructed using a layout where columns represent samples and behavioral variables. To form a stable geometric representation, two preprocessing steps are performed on each column. The first step is column scaling: the Euclidean norm of the column is calculated, and the column is divided by this norm to ensure that each column is within the same dimensional range. This ensures that directional relationships dominate similarity calculations, and amplitude differences no longer alter adjacency relationships. The second step is endpoint safety zone processing: edge shrinking is performed on unit interval values, replacing values ​​less than or equal to 0.001 with 0.001 and values ​​greater than or equal to 0.999 with 0.999. This processing results in stable numerical behavior in subsequent inner product and eigenvalue decomposition steps, significantly reducing the interference of extreme values ​​on sparse solutions, especially in scenarios with a sample size of 100,000 and a number of variables of 50.

[0035] Self-representation solvers operate independently on each column of sample vectors, aiming to reconstruct that column using linear combinations of a few other columns, thus allowing samples within the same low-dimensional dependency structure to mutually represent each other. In practice, an orthogonal matching pursuit strategy is employed, with the following steps: The coefficient vector of the current column is initialized to all zeros, with diagonal elements fixed to zero to exclude its own representation. The residual is set equal to the current column, and a stepwise selection loop is initiated. In each iteration, the absolute value of the inner product of the residual and all candidate columns is calculated, and the column index with the largest value is added to the selected set. This selection has a clear geometric meaning: the residual represents an uninterpreted direction, and the largest inner product indicates the sample column that best explains that direction; adding it helps convergence along the true subspace direction. Next, using the design matrix composed of the selected columns, coefficients are calculated and the residuals are updated using a QR decomposition least squares solver with column pivoting. QR decomposition combined with the pivoting strategy maintains a good condition number even with strong column correlations, making it suitable for common correlated clusters in Copula spaces. The loop terminates under one of two conditions: the number of selected columns reaches 8, or the Euclidean norm of the residuals between two adjacent rounds decreases by less than 0.000001. The former controls sparsity, and the latter controls approximation accuracy. The coefficients corresponding to the selected columns are written back to the current column of the coefficient matrix, keeping the diagonal elements zero. To improve throughput, inner product and updates can be performed in batches by column, with a batch size of 1024 columns, thus fully utilizing the vectorization and cache-friendly nature of matrix multiplication.

[0036] The sparse solution described above achieves the core characteristic of mutual representation of points within the same cluster. Geometrically, samples on the same low-dimensional dependency structure reside in the same subspace, and the target column can be reconstructed using a small number of basis samples from this subspace. For cross-cluster samples, due to significant directional differences, it is difficult to continuously improve the residual convergence during stepwise selection, thus the coefficients remain close to zero. This mechanism directly leads to a block diagonal adjacency structure, creating a clear segmentation signal for spectral clustering.

[0037] After assembling the sparse coefficients of all columns into a coefficient matrix, the adjacency construction and spectral embedding process begins. First, the absolute value of the coefficient matrix is ​​taken, and then it is added to its transpose to obtain a symmetric adjacency matrix. Taking the absolute value allows for a unified metric of the strengths represented by each other, while adding the transposes introduces bidirectional connectivity, thereby amplifying the edge weights within the same cluster. Subsequently, the first 20 smallest eigenvalues ​​and corresponding eigenvectors of the symmetric normalized Laplacian matrix are calculated, using a Lanczos-like iterative method and sparse matrix storage to reduce memory usage. Regarding the determination of the number of clusters, the ratio of adjacent eigenvalues ​​is calculated from smallest to largest, and the breakpoint with the largest ratio is selected as the number of clusters. This approach creates significant spectral gaps in block diagonal adjacency scenarios. For example, when the first five eigenvalues ​​are 0.005, 0.007, 0.009, 0.011, and 0.210, the ratio between 0.011 and 0.210 is the largest in this list, thus determining the number of clusters to be 4. The direct benefit of choosing this approach is that the number of clusters adapts to the data, avoiding both over-segmentation and merging of heterogeneous structures.

[0038] After determining the number of clusters, the corresponding number of feature vectors are row-normalized to ensure that each sample has a comparable directional representation in the embedding space. Subsequently, k-means++ initialized mean-based iterative clustering is performed in this embedding space. During initialization, initial centers are selected using distance weighting to increase the chance of reaching the global optimum. During iteration, a maximum of 300 iterations and a center change threshold of 0.000001 are used as stopping conditions. To improve robustness, 10 repeated runs with random starting points can be performed, and the result with the smallest total squared cohesion is selected as the final result. This clustering operates on spectral embeddings, essentially allocating along the graph's connected structure, thus performing well even for non-convex clusters. The output is a low-dimensional dependency cluster label for each sample; in practice, the average adjacency strength and sample size of each cluster are also recorded to facilitate subsequent intra-cluster causal uplift estimation and dual-calibration unit parallelization.

[0039] Under large-scale operation, several key implementation points ensure throughput and consistency. First, column-batch orthogonal matching pursuit and block QR decomposition are employed, with a batch size of 1024 columns and a thread parallelism of 16. Memory usage remains within 16GB with 100,000 samples, 50 variables, and a sparsity limit of 8. Second, thresholding is performed after adjacency matrix generation, setting entries with a symmetry strength less than 0.002 (resulting from the sum of the absolute coefficient and its transpose) to zero to form a clearer block structure and improve feature decomposition speed. Third, sparse multiplication is used to accelerate the spectral decomposition stage, with a fixed number of multiplications of 2 per iteration. Combined with a restart strategy, convergence to the top 20 smallest eigenvalues ​​and corresponding eigenvectors within 100 iterations is achieved, with typical data processing time remaining within 3 minutes. Finally, local triangular inequality pruning is used to accelerate distance evaluation during the mean-based iterative clustering stage, achieving stable completion after 10 repetitions.

[0040] The key choices in the above process all revolve around sparse geometric representation. Orthogonal matching pursuit uses the residual direction as a guide to concentrate energy on a small number of representative neighbors; symmetric adjacency enhances the connectivity strength of the same cluster by merging bidirectional relationships; spectral embedding transforms the block diagonal structure into a low-dimensional Euclidean structure, making it easier for mean-based iterative clustering to define boundaries. This link, from self-representation to graph structure to Euclidean embedding, amplifies the consistency signal of the same cluster layer by layer, ultimately forming multiple low-dimensional dependent clusters.

[0041] In one alternative implementation, the self-expression solution can be replaced by an alternating update strategy of coordinate descent and soft thresholding shrinkage. Specifically, the coefficient vector of each column is updated coordinate-by-coordinate, and column normalization and diagonal zeroing are performed after each update. The maximum number of iterations is 200, and the convergence threshold is 0.00001. This approach is more likely to obtain a smooth sparse structure when the noise level is high or the number of samples is small. In another alternative implementation, the dictionary size is reduced by nearest neighbor candidate screening before solving column by column: the cosine similarity between the current column and the other columns is calculated, and the top 500 columns with the highest similarity are retained as candidates before entering the orthogonal matching pursuit loop. This approach significantly reduces the solution time when the number of samples exceeds 200,000, while maintaining the consistency of cluster boundaries. In another optional implementation, the determination of the number of clusters introduces a silhouette coefficient verification in addition to the spectral gap method: perform one spectral embedding and one mean-based iterative clustering in the range of 3 to 12 candidate clusters respectively, and calculate the average silhouette coefficient in the embedding space; when the results of the two methods differ by less than 1, the larger one is taken; when the difference exceeds 1, the result of the spectral gap method is preferred, and the verification process is recorded for auditing.

[0042] For quality assurance, it is recommended to perform a consistency check after outputting low-dimensional dependent cluster labels. For each cluster, calculate the median and interquartile range of their mutual representation strengths. If a cluster's median is below 0.01 and its sample size is less than 20, it is considered a sparse cluster and merged once using the average similarity with neighboring clusters. After merging, the embedding center and mean of the merged cluster are recalculated for iterative clustering assignment. This step can improve the stability of subsequent intra-cluster causal uplift estimation and double correction in extremely sparse scenarios.

[0043] In one embodiment, the intra-cluster causal uplift estimation and dual calibration unit uses the low-dimensional dependency clusters output by the Copula space sparse subspace clustering unit as the processing granularity. Within each low-dimensional dependency cluster, it independently completes data splitting, outcome regression model training, pseudo-effect construction and effect regression model training, intervention propensity score estimation and threshold selection strategy, dual calibration and monotonic calibration, index summarization and key covariate list generation, to obtain the individual causal uplift score corresponding to the sample and the cluster-level strength corresponding to the cluster, and provides stable and consistent input to downstream processes.

[0044] Within each low-dimensional dependency cluster, the intervention identifier column, outcome column, and covariate set are first determined. The intervention identifier column takes values ​​of 0 and 1, representing the control and intervention states, respectively; the outcome column is a numerical result; and the covariate set consists of the remaining columns from the unit interval pseudo-observation table. To ensure the generalization ability of the estimation, a stratified cross-split is performed, randomly dividing the cluster sample into 5 folds of similar size and intervention ratios, numbered from 1 to 5. In each iteration, 4 folds are selected as training data, and the remaining 1 fold is used as validation data, until all 5 folds have been rotated, forming a cross-prediction trajectory spanning the entire cluster sample. The direct benefit of using stratified cross-split is that the intervention and control maintain a similar distribution across each fold, and subsequent model predictions are based on homogeneous data allocation, resulting in more robust individual-level estimations.

[0045] The regression models were trained independently on the intervention and control subsets, respectively. Each subset used a random forest regression algorithm with 500 trees, a maximum depth of 30, and a minimum number of leaf node samples of 5. The number of features randomly selected for each split was the square root of the number of covariates, and out-of-bag error was used as a reference for early stopping. Training was conducted within a 5-fold framework: for each sample in the validation fold, the model trained in the remaining 4 folds was used to predict the outcome, resulting in cross-predictions corresponding to the intervention and control states. These cross-predictions and the actual results were stored together, forming two outcome prediction trajectories at the whole-cluster sample level. The two outcome regression models respectively characterize the relationship between the outcome generation under intervention and control conditions. The cross-prediction design ensures that the prediction for each sample comes from training data not including that sample, reducing the impact of overfitting.

[0046] The construction of pseudo-effects and the training of the effect regression models followed the X-learner process. For samples in the intervention subset, the actual result was subtracted from the cross-prediction of the control result model on that sample to obtain the pseudo-effect for that sample in the intervention subset. For samples in the control subset, the cross-prediction of the intervention result model on that sample was subtracted from the actual result to obtain the pseudo-effect for that sample in the control subset. The pseudo-effects thus formed supplement the individual's outcome in another state through model extrapolation, approximating the individual effect signal with the difference between the actual observation and the prediction from the other side. Subsequently, two types of random forest effect regression models were trained on the intervention and control subsets, respectively, using covariates as inputs and corresponding pseudo-effects as targets. The hyperparameter settings were consistent with the outcome regression models, and pseudo-effect cross-prediction for the validation fold was generated at the 5-fold. The two types of effect regression models learned local approximations of the individual effect from their respective observable distributions. Subsequently, the more reliable side was selected based on the intervention propensity score of the sample, reducing model dependency bias.

[0047] The intervention propensity score is fitted to the entire cluster using L2-regularized logistic regression. The input is a set of covariates, and the output is the probability value falling between 0 and 1. Before training, the covariates are standardized to zero mean and unit variance, with a regularization strength of 1.0, a maximum number of iterations of 1000, and a convergence threshold of 0.000001. Logistic regression characterizes the intervention selection mechanism in the covariate space with a linear boundary, exhibiting fast numerical convergence and stable probability output. Based on the intervention propensity score, a threshold selection strategy is implemented to determine the initial individual effect estimate for each sample: when the intervention propensity score is greater than or equal to 0.5, the cross-prediction of the intervention effect regression model on that sample is used; when the intervention propensity score is less than 0.5, the cross-prediction of the control effect regression model on that sample is used. This threshold selection strategy utilizes the degree of fit between the sample and the intervention or control distribution: samples closer to the intervention side are closer to the data distribution of the intervention subset, and the intervention effect regression model fits more adequately in that region; samples closer to the control side are closer to the data distribution of the control subset, and the control effect regression model is more representative in that region. Taking a specific numerical example, if the intervention propensity score of a certain sample is 0.73, the cross-prediction of the intervention effect regression model is 0.085, and the cross-prediction of the control effect regression model is 0.060, then the initial individual effect estimate is taken as 0.085.

[0048] The dual calibration is performed in two steps: residual calibration and monotonic calibration. The first step calculates the cross-prediction residuals of the outcome regression models for both the intervention and control subsets. For the intervention subset residuals, a regression tree of depth 3 is trained to learn the systematic bias of the residuals as the covariates change; similarly, a regression tree of depth 3 is trained to fit the control subset residuals. For any sample, the difference between the output of the intervention residual regression tree and the output of the control residual regression tree on that sample is used as a correction term added to the initial individual effect estimate to obtain the individual causal improvement score. The residual regression tree interpretably feeds back the unexplained systematic errors of the outcome model to the individual effect, using the difference between the two residuals to correct the initial estimate, maintaining consistency even when both intervention and control predictions are biased. The second step trains an ordinal-preserving regression calibrator on the validation fold. The specific approach is as follows: Using individual causal improvement scores as the independent variable, the sample is divided into several equal-frequency intervals, for example, 10 intervals. Within each interval, two indicators are calculated: the first is the sign of the difference between the cross-prediction of the intervention outcome model and the cross-prediction of the control outcome model for that sample; the second is the sign of the difference between the actual result and the reference result of the control or intervention side to which the sample belongs. A 1 is recorded when the two signs are consistent, and a 0 is recorded when they are inconsistent, serving as the label for that sample. An ordinal-preserving regression calibrator is trained using the individual causal improvement scores and the labels. The calibrator outputs a monotonic function between 0 and 1, mapping the original individual causal improvement scores to the calibrated values. This calibration process treats observable consistency events as weakly supervised signals, and monotonic constraints ensure that the score ranking remains consistent while simultaneously correcting the scale. For example, before calibration, the median of the individual causal improvement scores for a certain cluster was 0.042, and after calibration, it was 0.038, with a slight scaling reduction while maintaining the same ranking.

[0049] After individual-level processing, cluster strength is obtained through aggregation. For each low-dimensional dependency cluster, the median of the individual causal enhancement scores of all samples within that cluster is calculated and used as the cluster strength. The median remains stable even with outliers and is favorable for intervention prioritization. For example, in a low-dimensional dependency cluster of 6000 samples, if the 3000th sample has an individual causal enhancement score of 0.057 after sorting by size, the cluster strength is 0.057. This indicator is directly used for cluster ranking and resource allocation in the actionable population generation and joint analysis report output units.

[0050] The list of key covariates is generated using permutation importance in the effect regression model. For each covariate, the baseline error measure of the effect regression model on the validation fold is first recorded, such as the mean squared error. Then, the values ​​of the covariate are randomly permuted on the validation data, keeping other covariates unchanged, and the error is recalculated. The increase in error is the importance score of that covariate. The importance score is calculated separately for the intervention effect regression model and the control effect regression model, and the average of the two is taken as the final importance score. The top 20 variables are selected according to their importance scores from highest to lowest to form the list of key covariates, and the variable names and their normalized percentage importance scores are output. For example, the normalized importance of the variable `device_type` is 21.3, the variable `region` is 17.8, and the variable `recent_activity` is 15.4, ranked up to the 20th position.

[0051] In large-scale data environments, parallel and persistence strategies are employed to ensure throughput and repeatability. The 5-fold cross-tabulation process is parallelized fold-wise within clusters, tree model training is parallelized tree-wise, and residual regression tree training and ordinal-preserving regression calibration are parallelized fold-wise. Intermediate products, including cross-prediction of two types of outcome regression models, cross-prediction of two types of effect regression models, individual causal improvement scores, and calibration results, are written to disk in columnar storage format. Field order is fixed, and numerical precision is fixed to 8 decimal places for easy review and auditing. For a workload containing 10 low-dimensional dependency clusters, a total sample size of 200,000, and 50 covariates, using 16 computation threads and sequential writing to a solid-state drive, the typical time taken is 20 to 30 minutes.

[0052] In one alternative implementation, the outcome regression model and the effect regression model can employ a gradient boosting regression algorithm with 600 trees, a learning step size of 0.05, a maximum depth of 5, and consistent 5-fold cross-validation and cross-prediction procedures. This approach offers higher fitting efficiency in scenarios with complex nonlinear boundaries but low noise levels. In another alternative implementation, the intervention propensity score can be estimated using an extreme random tree classification algorithm with 400 trees and a maximum depth of 20. Stabilization is achieved through equidistant binning of the probability output, followed by a threshold selection strategy. This approach provides smoother propensity scores when there is a nonlinear interaction between the covariate and the intervention. In yet another alternative implementation, the threshold selection strategy can be refined based on intra-cluster quantiles. For example, the threshold can be set to the median of the intra-cluster intervention propensity score distribution. When the intervention proportion is between 0.3 and 0.4, the median typically falls between 0.3 and 0.4, better reflecting the cluster's distribution structure.

[0053] In one embodiment, the actionable population generation and joint analysis report output unit sorts individuals within each low-dimensional dependency cluster based on their causal improvement scores. For each low-dimensional dependency cluster, it performs four processes: sorting, candidate set extraction, constraint filtering, and output merging. The joint analysis report is generated in a stable and reproducible order. The entire process revolves around prioritizing the conversion of samples with the highest improvement potential into reachable objects. Fixed thresholds and deterministic rules solidify the selection process into clear steps, facilitating reproducibility and auditing.

[0054] Within each low-dimensional dependency cluster, a sequence to be ranked is first constructed. This sequence includes a unique sample identifier, an individual causal improvement score, an intervention identifier column, an outcome column, and a set of covariates. The ranking uses a two-level key: the first level is descending order of individual causal improvement scores, and the second level is ascending order of sample unique identifiers. This two-level key combination brings two direct benefits. First, samples with higher individual causal improvement scores tend to generate greater expected gains after intervention, thus prioritizing their inclusion in the candidate set improves resource utilization. Second, the lexicographical order of sample unique identifiers is stable, arranging samples with the same score in a fixed order, ensuring consistency in the ranking results during repeated runs. For a concrete numerical example, when the number of samples within a low-dimensional dependency cluster is 10,000, the ranking will concentrate samples with individual causal improvement scores at the 95th percentile into the top 500, creating a dense distribution of high-potential samples at the beginning of the sequence, facilitating subsequent truncation.

[0055] The candidate set is truncated using a fixed-ratio strategy. Within each low-dimensional dependency cluster, the 20th percentile of the total number of samples is taken and rounded up, with a minimum of one sample truncated. For example, 20 samples are truncated when the total number of samples is 100, 5 when the total number of samples is 23, and 1 when the total number of samples is 1. This fixed-ratio strategy provides consistent selection strength for low-dimensional dependency clusters of different sizes, ensuring fairness in subsequent cross-cluster merging. Rounding up ensures that small-scale low-dimensional dependency clusters still produce candidate sets, avoiding empty sets under extreme conditions. After truncating, a candidate set for each low-dimensional dependency cluster is obtained, and the sorting position index is retained for stable concatenation during the merging phase.

[0056] Execution constraint filtering is performed sample-by-sample at the candidate set level. Rules are derived from the execution constraints in the prompts and evaluated sequentially in a fixed order. The suggested order is as follows: First, check geographic and compliance constraints, such as whether the `region` column belongs to the allowed geographic set; regulated geographic codes are directly eliminated. Second, check communication channel reachability, such as whether email or SMS reachability indicators are in a sendable state. Third, check time window constraints, based on the service time period provided by the prompts and the current local time in the sample's time zone, to determine if it falls within the allowed sending interval, such as 09:00 to 18:00. Fourth, check the contact frequency constraint for the same object in this batch; for example, each object is only selected once per batch, the first matching channel is directly retained, and other channels are marked as delayed in this batch. Fifth, check the capacity limit; for example, if the prompts indicate a batch capacity of 50,000, when the qualified samples exceed the capacity limit, prioritize retaining samples that are ranked higher and come from lower-dimensional dependency clusters with higher cluster strength. The reason for adopting this order is that geographical and compliance constraints are mandatory, channel accessibility directly affects the possibility of reaching users, time windows determine immediate executability, frequency constraints ensure consistent user experience, and capacity limits form a hard boundary on the resource side. The order tightens layer by layer from external conditions to internal resources. To illustrate with a specific numerical example: a candidate set of a low-dimensional dependency cluster contains 2000 samples. The geographical compliance pass rate is 95%, the channel accessibility pass rate is 92%, the time window pass rate is 85%, and the retention rate after frequency constraints is 98%. Therefore, the number of samples retained before capacity pruning for this low-dimensional dependency cluster is approximately 2000 multiplied by 0.95 multiplied by 0.92 multiplied by 0.85 multiplied by 0.98, which equals approximately 1444. If there is still capacity available in this batch, all samples are retained; if the capacity is full, pruning is performed based on the sorting position and cluster strength.

[0057] The merged output concatenates all low-dimensional dependency clusters from the filtered samples using a unified three-key order. The first-level key is descending order of cluster strength, the second-level key is descending order of individual causal improvement scores, and the third-level key is ascending order of sample unique identifiers. This order prioritizes resource allocation to low-dimensional dependency clusters with higher overall improvement potential while preserving individual-level differentiation. The concatenated output generates an actionable population. To support batch execution, this order can be segmented into fixed batches, for example, 10,000 samples per batch, with batch numbers starting from 1 and incrementing. Segmentation uses stable boundaries, does not cross sorting sequences, and ensures no overlap between batches. For scenarios where prompts involve multiple channels, channels are allocated in a rotating manner within each batch, for example, by cyclically sampling according to the channel list order, forming a balanced channel distribution while maintaining the sorting position.

[0058] The joint analysis report takes statistical information on the actionable population and each low-dimensional dependency cluster as input, and outputs structured content covering both overview and cluster-by-cluster views. The overview section includes the following items: number of low-dimensional dependency clusters, total sample size, total candidate set size, number of actionable populations, overall pass rate, capacity reduction, average individual causal uplift score, 25th and 75th percentiles of individual causal uplift scores, and weighted median of cluster strength. The percentiles in these items help quickly assess the distribution pattern, while the capacity reduction helps resource providers assess remaining demand. The cluster-by-cluster section outputs the following items for each low-dimensional dependency cluster: population component identifier, cluster strength, sample size, candidate set size, filtered size, pass rate, unique identifiers and individual causal uplift scores of the top 10 ranked samples, structured dependency description summary, sparse geometric profile, overview indicators and intervals of CATE estimation results within the cluster, and causal uplift-driven intervention recommendations. The structured dependency description summary is derived from the dependency signature generated by cue word parsing and Copula space building blocks, as well as the Copula family selection results. The sparse geometric profile is derived from the adjacency strength and core sample set of the sparse subspace clustering units in the Copula space. The overview index of the intra-cluster CATE estimation results is derived from the output of the intra-cluster causal uplift estimation and dual-calibration unit. The causal uplift-driven intervention suggestions are generated based on the actual reachable channels for the actionable population and the intervention field names given by the cue words. For example, for a low-dimensional dependency cluster with a cluster strength of 0.057 and a filtered number of 1444, the report shows that the 75th percentile of the individual causal improvement score is 0.084 and the 25th percentile is 0.031, and provides an intervention suggestion that the sending channel is SMS.

[0059] refer to Figure 3 , Figure 3 A two-dimensional decision space based on rank correlation scores and tail dependency differences is shown to guide the automatic selection of Copula families. Figure 3A Cartesian coordinate system is used, with the horizontal axis representing the rank correlation score τ, ranging from -1.0 to 1.0, and the vertical axis representing the tail dependency difference Δλ, ranging from -0.10 to 0.10. The coordinate system is divided into five regions, corresponding to the Frank, Gumbel, Clayton, t, and Gaussian families, respectively, each marked with a boundary of different line types. The Frank family region, located near the origin, is marked with a dashed ellipse, centered at (0,0), with its major axis extending along the vertical axis. This region is labeled with "Frank family," "|τ|<0.15," and "weak tail dependency." The boundary conditions indicate that the Frank family is selected when the absolute value of the rank correlation score is less than 0.15 and the tail dependency difference is close to zero. This family is suitable for symmetric dependency structures with weak overall correlation and no special tail linkages. The Gumbel family region, located in the upper right corner of the coordinate system, is marked with a solid-line irregular quadrilateral, roughly covering the area where τ is between 0.5 and 1.0 and Δλ is above 0.05. This region is labeled with the text "Gumbel Family," "Δλ>0.05," and "Upper Tail Linkage." The boundary condition indicates that the Gumbel family is chosen when the tail dependency difference is greater than 0.05, meaning the upper tail dependency is significantly stronger than the lower tail dependency. This family is suitable for situations where variables show a common increase in high-value regions. The Clayton family region, located in the lower right corner of the coordinate system, is marked with a solid-line irregular quadrilateral, roughly covering the area where τ is between 0.5 and 1.0 and Δλ is below -0.05. This region is labeled with the text "Clayton Family," "Δλ<-0.05," and "Lower Tail Linkage." The boundary condition indicates that the Clayton family is chosen when the tail dependency difference is less than -0.05, meaning the lower tail dependency is significantly stronger than the upper tail dependency. This family of variables is suitable for situations where variables exhibit a common decrease in the low-value region. The t-family region, located in the middle right of the coordinate system, is marked with a solid-line rounded rectangle, roughly covering the area where τ is between 0.5 and 0.8 and Δλ is between -0.02 and 0.02. This region is labeled with "t-family," "|Δλ|<0.02," and "two-tailed symmetry." The boundary conditions indicate that the t-family is chosen when the absolute value of the tail dependency difference is less than 0.02, meaning the upper and lower tail dependencies are roughly equivalent and both reach a certain strength. This family is suitable for situations with two-tailed symmetry and significant tail linkage. The Gaussian family region is marked with a dashed ellipse. The ellipse is relatively flat, with its major axis along the horizontal axis, and its center located at approximately τ = 0.65 and Δλ = 0, covering the outer perimeter of the t-family region. This region is labeled in the lower right corner of the ellipse, containing the text "Gaussian family," "|τ|>0.5," and "moderate tail dependency." Boundary conditions indicate that when the absolute value of the rank correlation score is greater than 0.5, indicating significant overall linear consistency and moderate tail dependence, the Gaussian family should be chosen. This family is suitable for cases with strong overall correlation but not extreme tail phenomena. Figure 3Five solid dots are plotted in the center, each located in a typical position within its respective region, representing the distribution of example data points in the decision space. These example dots demonstrate that samples with different dependency features will fall into the corresponding Copula family selection regions. Figure 3 The bottom label "Note: Δλ = λupper tail - λlower tail" clarifies the calculation definition of the vertical axis variable, where λupper tail represents the simultaneous occurrence rate of the upper tail, and λlower tail represents the simultaneous occurrence rate of the lower tail. Through Figure 3 The decision space partitioning shown allows the system to automatically map each pair of variables to the corresponding Copula family selection region based on the rank correlation score and tail dependency difference calculated from the pseudo-observation table of unit intervals, thereby achieving automated identification and modeling of dependency structures. This decision rule discretizes the continuous dependency feature space into five typical dependency patterns, ensuring both the determinism of selection and coverage of the main dependency forms in practical applications.

[0060] Intervention recommendations are centered on direct actions driven by causal uplift, and the generation of recommendations follows these rules: If the cluster strength of a low-dimensional dependency cluster is in the top 30 percentile of all low-dimensional dependency clusters, and the pass rate of this low-dimensional dependency cluster in terms of channel accessibility is above 80%, then it is recommended to execute the action indicated by the intervention field name in the prompt word with this low-dimensional dependency cluster as the first priority. If the cluster strength of a low-dimensional dependency cluster is in the middle 40 percentile, it is recommended to use a smaller batch for trial operation, such as executing a batch of 5000 samples first and expanding the scope in the next round of updates. If the cluster strength of a low-dimensional dependency cluster is in the bottom 30 percentile but the pass rate is high, it is recommended to mix it with high-intensity low-dimensional dependency clusters for execution, distributing the proportion evenly within the batch in a round-robin manner to improve the overall pass rate. The motivation for adopting the above allocation strategy is to concentrate intervention resources on areas with higher returns, while utilizing low-dimensional dependency clusters with good channel accessibility to increase the expected number of successful reach.

[0061] To ensure reproducibility, all decision points employ deterministic rules and stable ranking. Ranking and truncation are based on individual causal uplift scores and unique sample identifiers; filtering is based on direct judgment of execution constraint fields in prompts; and merging is based on cluster strength and ranking position. All statistical indicators in the report are calculated using the same logic in each run, with numerical precision fixed to four decimal places. The data is exported in columnar storage format, with the following field order: population component identifier, cluster strength, sample size, candidate set size, filtered size, pass rate, upper percentile, lower percentile, intervention recommendation, structured dependency description summary, sparse geometric profile, and key covariate list. The key covariate list directly references the top 20 variables generated by in-cluster causal uplift estimation and dual-correction units to replace importance, retaining the variable name and normalized importance percentage. For example, in a report's key covariate list, device_type has a normalized importance of 21.3, region 17.8, and recent_activity 15.4, listed sequentially up to the 20th position.

[0062] When implementing at scale, a batch output and backfilling mechanism is recommended. After generating the actionable population, mark each sample with a batch number and channel. After exporting the current batch, backfill the status columns in the system, such as the "sent" flag and the response window start time column, for evaluation of time windows and frequency constraints in subsequent rounds. For example, with a total capacity of 50,000 and 120,000 qualified samples, generate 5 batches: the first 4 batches each contain 10,000 samples, and the 5th batch contains 20,000 samples. After backfilling, the next round will only select from unreached samples to avoid duplicate contact.

[0063] In one alternative implementation, the candidate set cutoff ratio can be adaptively adjusted upwards and downwards based on the 20th percentile. Specifically, the adjacent difference sequence of the individual causal improvement score curves within each low-dimensional dependency cluster is calculated. When the adjacent difference shows a significant downward step in the early region, for example, from 0.020 to 0.008 and then to 0.004, it indicates that high-potential samples are concentrated in the early part of the sequence. In this case, the cutoff ratio is increased from the 20th percentile to the 30th percentile to expand the coverage. When the adjacent difference remains approximately smooth in the early region, for example, from 0.010 to 0.009 and then to 0.009, it indicates that the high-potential samples are relatively evenly distributed, and the 20th percentile is maintained. This method improves the capture rate in low-dimensional dependency clusters where high-potential samples are highly concentrated, and maintains resource efficiency in low-dimensional dependency clusters with a more even distribution. In another alternative implementation, the capacity limit can be pruned into quotas according to the cluster-level strength ratio during the merging stage, and then the quota quantity is pruned in order within each low-dimensional dependency cluster. For example, with a capacity of 50,000 and 10 low-dimensional dependency clusters, the cluster strength ratio from high to low is approximately 5, 4, 4, 3, 3, 3, 2, 2, 2, 2. The quotas would then be approximately 10,000, 8,000, 8,000, 6,000, 6,000, 6,000, 4,000, 4,000, 4,000, 4,000. This approach improves the transparency of cross-cluster resource allocation and is suitable for scenarios where multiple teams are executing in parallel. In another optional implementation, the time window for executing constraint filtering can be determined using a set of working hours from the local time zone for different regions. For example, 09:00 to 18:00 for North America, 08:00 to 17:00 for Europe, and 10:00 to 19:00 for the Asia-Pacific region, thereby increasing the response probability during the first contact period.

[0064] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A cue-driven joint analysis system for population composition and behavior, characterized in that, The system consists of a computing device and a storage medium, storing executable program instructions. These instructions instruct the computing device to perform operations on a set of units, including: a prompt word parsing and Copula space construction unit, configured to: receive prompt words containing target behavior field names, intervention field names, variable range descriptions, and output field names; locate columns in the input data table based on the prompt words; and map each column uniformly to a unit interval pseudo-observation table to eliminate marginal distribution differences and preserve dependency structure; and a Copula space sparse subspace clustering unit, configured to: construct a clustering table based on the unit interval pseudo-observation table output by the prompt word parsing and Copula space construction unit. Based on the matrix, multiple low-dimensional dependency clusters are obtained through self-expression solving and spectral clustering. The intra-cluster causal uplift estimation and dual correction unit is configured to: independently execute the process for each low-dimensional dependency cluster obtained by the Copula space sparse subspace clustering unit; obtain the individual causal uplift score corresponding to the sample and the cluster-level strength corresponding to the cluster; the actionable population generation and joint analysis report output unit is configured to: sort the individual causal uplift scores obtained by the intra-cluster causal uplift estimation and dual correction unit in each cluster from high to low; extract samples to form a candidate set; filter and merge the samples according to the execution constraints in the prompt words to obtain the actionable population; and output the joint analysis report.

2. The system according to claim 1, characterized in that, The prompt word parsing and Copula space construction unit are further configured to perform a unit interval mapping process when processing continuous variables. This process includes: counting the total number of samples of continuous variables; generating an index sequence and sorting it from smallest to largest according to the values ​​of continuous variables; when encountering the same value, recording the first and last positions of the same value in the sorted sequence, and setting the rank of all samples with the same value as the average of the first and last positions recorded; dividing the rank of each sample by the total number of samples obtained and adding one to obtain the unit interval value.

3. The system according to claim 1, characterized in that, The prompt word parsing and Copula space construction unit are further configured to perform a unit interval mapping process when processing categorical variables. This process includes: extracting all deduplicated category labels of the categorical variable; performing a normalization step on each deduplicated category label, which involves converting the label to Unicode standard form and then to a UTF-8 byte sequence, and then sorting it by bytes from smallest to largest to obtain a definite category order; assigning a sequence number starting from 1 to each category order; replacing the sample category with the corresponding sequence number; counting the number of samples with sequence numbers less than the currently processed sample's sequence number and the number of samples with sequence numbers equal to the currently processed sample's sequence number within the categorical variable; adding half of the count of samples less than the count of samples equal to the count of samples equal to the count of samples equal to the count of samples less than the count of samples less than the count of samples equal to ...

4. The system according to claim 1, characterized in that, The prompt word parsing and Copula space construction unit are further configured such that, when handling missing items, the execution process includes: when sorting the original variables, the missing item is fixed at the beginning of the sorting sequence and assigned a unit interval value of 1 divided by the total number of samples of the corresponding variable plus one; at the same time, the missing indicator column corresponding to the original variable is filled with 1 for missing samples and 0 for non-missing samples; and the categorical variable process is executed on the missing indicator column to obtain the unit interval value of the missing indicator column.

5. The system according to claim 1, characterized in that, The prompt word parsing and Copula space construction unit are also configured to execute a dependency description generation and Copula family selection process. This process includes: for any two unit interval variables, enumerating all sample pairs without missing values; if the difference direction of the current sample pair in the two columns is consistent, it is considered coordinated; if the direction is opposite, it is considered repulsive; sample pairs with the same value are skipped; the difference between the number of coordinated and repulsive pairs is calculated and then divided by the number of sample pairs involved in the calculation to obtain the rank correlation score; at the same time, the frequency of co-occurrence in intervals above 0.95 and the frequency of co-occurrence in intervals below 0.05 are counted and recorded as the upper tail co-occurrence rate and the lower tail co-occurrence rate, respectively, to generate a dependency signature for the variable pair; based on the generated dependency signature, using a preset rule mapping, a Copula family matching the dependency signature is selected from the Copula family set including Frank family, Gumbel family, Clayton family, t family, and Gaussian family; the selection result, along with the rank correlation score, upper tail co-occurrence rate, and lower tail co-occurrence rate, are recorded as a structured dependency description.

6. The system according to claim 1, characterized in that, The Copula sparse subspace clustering unit is configured to perform the following self-expressive coefficient estimation process when executing self-expressive solutions: Construct a data matrix such that each sample is a column vector and each variable is a row; perform orthogonal matching pursuit on each column of sample vector to obtain sparse reconstruction coefficients, specifically including: a) initializing the coefficient vector of the current column to all zeros and fixing the diagonal elements to zero; b) calculating the current residual, which is the difference between the target column vector and a weighted linear combination of other column vectors; c) in the unselected... d) Find the column index with the largest absolute value of the inner product with the current residual and add it to the selected set; d) Use the matrix containing the selected column vectors as the design matrix, call the QR decomposition least squares solver with column pivoting, obtain the coefficient values ​​that minimize the square sum of the current residuals and update the current residuals; e) Repeat steps c) to d) until the number of selected columns reaches 8 or the Euclidean norm of the current residual decreases by less than 1e-6; f) Write the coefficient values ​​corresponding to the selected columns back to the coefficient vector of the current column, keeping the diagonal elements zero.

7. The system according to claim 1 or 6, characterized in that, The Copula sparse subspace clustering unit is configured to perform the following process when performing spectral clustering: taking the absolute value of the coefficient matrix composed of sparse reconstruction coefficients obtained from the self-expression solution and adding it to its transpose to obtain a symmetric adjacency matrix; calculating the first 20 smallest eigenvalues ​​and corresponding eigenvectors of the symmetric normalized Laplacian matrix of the symmetric adjacency matrix; calculating the ratio of adjacent eigenvalues ​​in ascending order of eigenvalues ​​and finding the breakpoint with the largest ratio, and using its index as the number of clusters; performing k-means++ initialized mean iterative clustering on the eigenvectors corresponding to the determined number of clusters, with the maximum number of iterations of mean iterative clustering set to 300 and the center change threshold set to 1e-6; and outputting the low-dimensional dependent cluster label for each sample.

8. The system according to claim 1, characterized in that, The intra-cluster causal uplift estimation and dual-calibration unit is configured such that, during CATE estimation, the process includes: randomly splitting the low-dimensional dependency cluster into 5 folds, specifying the intervention identifier column and the outcome column, with the remaining columns as covariates; training a random forest regression model on the intervention subset to predict outcomes under intervention conditions, and training a random forest regression model on the control subset to predict outcomes under control conditions. The random forest regression model has 500 trees, a maximum depth of 30, a minimum number of leaf node samples of 5, and uses features selected by randomly sampling the square root of the feature size at each split. Out-of-bag error is used for early stopping. Training and validation are performed alternately on the 5 folds, and the validation folds are evaluated. The prediction records are cross-predictions; within the intervention subset, the spurious effect is obtained by subtracting the cross-predictions of the control outcome model from the actual outcome, and a random forest effect regression model with covariates as input is trained; within the control subset, the spurious effect is obtained by subtracting the actual outcome from the cross-predictions of the intervention outcome model, and another set of random forest effect regression models is trained; logistic regression with L2 regularization is used to fit the intervention propensity score on the entire cluster sample; a threshold selection strategy is adopted to determine the CATE estimate of each sample as follows: when the intervention propensity score is greater than or equal to 0.5, the prediction of the intervention effect regression model on that sample is taken; when the intervention propensity score is less than 0.5, the prediction of the control effect regression model on that sample is taken.

9. The system according to claim 1 or 8, characterized in that, The cluster-based causal uplift estimation and dual calibration unit is further configured such that, when performing dual calibration, the process includes: first, calculating the cross-prediction residuals for the intervention subset and the control subset respectively, and training a regression tree of depth 3 with covariates to obtain the residual correction function; then, using the difference in output of the residual correction function for the corresponding group on the sample as the correction term, adding it to the determined CATE estimate to obtain the individual causal uplift score; subsequently, in the validation layer, using the individual causal uplift score of each sample as the independent variable and the sign consistency between the difference in prediction and the difference in actual results between the intervention outcome model and the control outcome model as the dependent variable, training an independent ordinal-preserving regression calibrator for each cluster, and performing monotonic calibration on the individual causal uplift score.

10. The system according to claim 1, characterized in that, The actionable population generation and joint analysis report output unit is configured to execute the following processes: sorting individuals within each cluster by causal uplift score from highest to lowest, and extracting the top 20 percentile samples within each cluster to form a candidate set; filtering out unreachable samples based on execution constraints in the prompts, and then merging them to obtain the actionable population; the joint analysis report output includes: outputting population component identifiers, a structured dependency description summary, a sparse geometric profile, intra-cluster CATE estimation results, and causal uplift-driven intervention suggestions; furthermore, the intra-cluster causal uplift estimation and dual calibration unit is configured to: use the median of the causal uplift score of individuals within each cluster as the cluster-level strength; and calculate the top 20 variables using permutation importance in the effect regression model, constructing a list of key covariates and outputting it.