A survivorship bias resistant scene graph generation method based on causal inference feature decoupling
By using a causal inference feature decoupling method, the intrinsic feature branch and the false feature branch respectively handle visual evidence and statistical bias, solving the survivor bias problem in scene graph generation models and improving the accuracy and robustness of generated details.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- QINGDAO UNIV OF SCI & TECH
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-19
AI Technical Summary
Existing scene graph generation models are prone to 'survivorship bias', overfitting high-frequency statistical co-occurrences while ignoring low-frequency but realistic visual details, resulting in inaccurate generation results.
We adopt a two-stage cognitive heuristic diffusion generation framework based on causal inference feature decoupling. Visual evidence and statistical bias are handled by intrinsic feature branches and false feature branches respectively. The accuracy of generated details is improved by adversarial correction loss and cross-modal regularization.
It significantly improves the model's inference robustness and the accuracy of generating details in long-tailed complex scenarios, cuts off the reliance on statistical shortcuts, and enhances the ability to understand visual details.
Smart Images

Figure CN122244192A_ABST
Abstract
Description
Technical Field
[0001] This application relates to a method for generating scene graphs against survivor bias based on causal inference feature decoupling, which belongs to the field of computer vision. Background Technology
[0002] Scene Graph Generation (SGG) aims to transform unstructured visual data into structured semantic graphs, that is, to construct graph structures by identifying object instances and their paired interactions. As a bridge connecting low-level visual perception and high-level semantic cognition, SGG provides crucial structured knowledge support for downstream tasks such as industrial intelligence, smart security, and visual question answering. In these critical applications, the quality of the system's reasoning about entity relationships in complex scenes directly determines the security and reliability of the decision-making system.
[0003] However, existing multi-source information fusion mechanisms often have flaws, making models prone to the "survivorship bias" trap. Due to the severe long-tail distribution of training data, existing models tend to overfit high-frequency statistical co-occurrences (i.e., "survivors"), thus generating spurious correlations while ignoring the more essential visual semantics (i.e., the "crashers" / long-tail samples ignored by the model). For example, in predicting...<man, riding, bike> In practice, models often rely on statistical priors to directly predict high-frequency predicates such as "on" or "riding," while ignoring low-frequency but real visual evidence such as "near" or "walking with" that may exist in the image. Current solutions mainly focus on rebalancing or reweighting strategies, or simple feature fusion, failing to explicitly decouple visual features from statistical bias at the root.
[0004] The dual-process theory in cognitive psychology posits that human cognition consists of two systems: System 1, responsible for rapid, intuitive, and heuristic judgments, and System 2, responsible for slow, analytical, and logical reasoning. This theory has resonated widely in the field of deep learning, helping to evolve deep learning from simple pattern matching (System 1) to higher-level cognition with causal reasoning capabilities (System 2), thus improving the generalization ability and robustness of models. In the field of scene graph generation, existing models often fall into the cognitive trap of "System 1." Due to the long-tailed distribution of training data, models tend to use availability heuristics, i.e., prioritizing the most easily extracted statistical regularities (such as...).<man, on, bike> This approach to prediction directly leads to the classic "survivorship bias" phenomenon. Like statisticians in World War II who only counted returning aircraft, the model overemphasizes the head relationships that "survive" in high-frequency statistics, while systematically ignoring the rare visual details that "crash" in the long-tail distribution.
[0005] Application content
[0006] To address the problems existing in the prior art, this application proposes a scene graph generation method based on causal inference feature decoupling to resist survivor bias. It has a two-stage cognitive heuristic diffusion generation framework, which improves the dual visual conflict through the two-stage process, enabling the model to fully learn more comprehensive text semantics. It can further improve the accuracy of generated details in the process of collaborative optimization, and guide the model to understand the differences in semantics and details in the text-generated image from the perspective of cognitive reasoning and human thinking.
[0007] To address the aforementioned technical problems, the technical solution adopted in this application is a method for generating scene maps against survivor bias based on causal inference feature decoupling, comprising the following steps:
[0008] 1) Acquire the image to be processed, and use an object detector to extract the objects in the image and their initial visual features;
[0009] 2) Construct intrinsic feature branches to simulate direct causal paths. Construct fake feature branches to simulate backdoor paths. ;
[0010] 3) Generate fact predictions and counterfactual predictions based on complete feature inputs and biased inputs respectively, and maximize the distribution difference between fact predictions and counterfactual predictions through adversarial correction loss;
[0011] 4) Perform cross-modal regularization, integrate visual causal representation with linguistic contextual representation, and use linguistic common sense to constrain the semantic consistency of decoupled features;
[0012] 5) Train the model based on the comprehensive loss function, and use the trained model to generate scene graphs.
[0013] The optimized, survivor bias-resistant scene map generation method based on causal inference feature decoupling described above, in step 1), during the input initialization phase, follows the SGG feature extraction scheme to obtain the detection boxes b of all objects in the image. i and its initial visual features o i ;
[0014] The SGG feature extraction scheme includes the use of a pre-trained object detector, ResNeXt-101-FPN.
[0015] In the optimized, above-mentioned scene graph generation method based on causal inference feature decoupling, which is designed to resist survivor bias, in step 2), the initial information extracted from the original image is represented as a joint to obtain the interaction information between pairs of entities; the extracted joint features and the initial features are fed together into the intrinsic feature branch and the false feature branch for the extraction of visual cues and the modeling of statistical bias.
[0016] The optimized method for generating scene graphs against survivor bias based on causal inference feature decoupling, in step 2), utilizes a correlation discriminator and feature amplification mechanism to enhance the weight of weak visual cues when constructing intrinsic feature branches.
[0017] The initial visual features are correlated, a soft gating mechanism is generated to separate strongly biased features from weak visual cues, and the weak visual cues are amplified to obtain an intrinsic feature representation. ;
[0018] When constructing spurious feature branches, the joint confusion factor is modeled within a structured sensitive subspace to capture multi-granular statistical patterns;
[0019] Based on the subject and object category embedding vectors of objects and coarse-grained predicate logic, a joint confusion factor is constructed and mapped to a sensitive subspace to obtain spurious feature representations. .
[0020] In the optimized, survivor bias-resistant scene map generation method based on causal inference feature decoupling, in step 1), the target extractor is pre-trained on standard datasets such as Visual Genome based on the backbone network of Faster R-CNN, and the training datasets are all image datasets.
[0021] The model is trained on the Visual Genome dataset to learn intrinsic and spurious features, enabling it to fully learn the entity information and interaction regions associated with each predicate in the dataset. This allows the model to distinguish the different information expressed by different predicates in different scenarios, especially the visual details of low-frequency long-tail samples.
[0022] The optimized, survivor bias-resistant scene map generation method based on causal inference feature decoupling described above, in step 2), utilizes a correlation discriminator. Initial visual features Mapped to dimensional bias correlation score ;
[0023] Using fractions The features are divided into bias-dominated components. and the main part of the clues ;in,
[0024]
[0025]
[0026] This is element-wise multiplication;
[0027] Capture high-frequency deviation characteristics, and Identify low-frequency but valuable visual cues;
[0028] Introducing scaling factor By magnifying the dominant part of the clues, we can obtain... ,
[0029]
[0030] The magnified features are then fused with the dominant bias to generate the final intrinsic features. .
[0031] The optimized survivor bias-resistant scene graph generation method based on causal inference feature decoupling described above, in step 2), uses a pre-trained, classic SGG model to extract the predicate logic of the pre-trained model as a coarse-grained predicate representation. ;
[0032] Embed subject categories Object category embedding and By splicing together, a joint confusion factor is constructed. ;
[0033] By leveraging cross-attention mechanisms combined with residual connections, bias features are aggregated and generated within the sensitive subspace. :
[0034]
[0035] in This indicates the combined context between the subject and the object. , , All are learnable matrices;
[0036] Using contrastive loss function Apply structured constraints to the sensitive subspace.
[0037] The optimized, survivor bias-resistant scene graph generation method based on causal inference feature decoupling described above, in step 3), constructs the factual input when the model has observed all the evidence. Fact input Includes intrinsic features and false features :
[0038] ;
[0039] The model obtains fact predictions through the main classifier. ,
[0040] ;
[0041] Constructing counterfactual inputs by ablation masking intrinsic features Counterfactual input Only retain false features ;
[0042] Counterfactual prediction Obtained through the main classifier
[0043] ;
[0044] calculate and The KL divergence between them serves as the adversarial correction loss. This forces model predictions to deviate from simple statistical bias.
[0045] In the optimized survivor bias-resistant scene graph generation method based on causal inference feature decoupling, step 4) involves the following specific process for cross-modal regularization: extracting the visual causal representation of the penultimate layer of the main classifier. ;
[0046] Word embedding models are used to extract and fuse word vectors of the subject and object to obtain a language context representation. ;
[0047] Will and Element-wise multiplication fusion is performed, prediction is made using a cross-modal classifier, and cross-modal regularization loss is calculated. .
[0048] The optimized survivor bias-resistant scene map generation method based on causal inference feature decoupling described above, in step 5), integrates the loss function. Including the loss in generating the main scene graph and joint causal decoupling loss :
[0049]
[0050] in, It includes cross-entropy loss for object classification and relationship classification loss based on focus loss;
[0051] Calculate the joint causal decoupling loss hour,
[0052] ,
[0053] in , , These are hyperparameters, and are set to 0.8, 1.2, and 1.0 respectively.
[0054] The optimized, survivor bias-resistant scene graph generation method based on causal inference feature decoupling improves the ability of the intrinsic feature branch to extract pairwise object information from visual information and to initially integrate weak visual cues, thereby enhancing the grasp of the true semantics of entity context; the spurious feature branch deeply divides the relationship instances between similar statistical patterns, ensuring that the predicates embedded between entity pairs can capture the discovered joint confusion factors without affecting the semantics.
[0055] In the optimized survivor bias-resistant scene graph generation method based on causal inference feature decoupling, step S4) balances the cross-modal regularization factor and causal intervention loss to improve the ability to identify long-tail relationships in model inference. Simultaneously, the settings of each loss weight reflect the impact of different modules on model accuracy and long-tail distribution.
[0056] The beneficial effects of this application are as follows:
[0057] The technical solution of this application breaks the cognitive trap of survivor bias by integrating visual reasoning and cognitive mechanisms. It designs the spurious feature branch as an intuitive module simulating System 1, specifically designed to capture statistical shortcuts that lead to survivor bias. Simultaneously, causal intervention is designed as the monitor of System 2. When the system detects that the model's predictions are based solely on statistical availability rather than visual evidence, the model forcibly activates an anti-survivor correction process. This mechanism forces the model to slow down when faced with potential survivor bias, cutting off its reliance on statistical shortcuts and instead calling upon its intrinsic feature branches to deeply examine visual details. This not only endows the model with human-like metacognitive abilities but also fundamentally improves its robustness of reasoning in long-tailed and complex scenarios.
[0058] In this application, the technical solution formalizes the long-tail distribution problem in scene graph generation as a cognitive task to combat survivor bias, guided by causal inference theory. It explicitly decouples the inherent visual semantics representing "the world as it appears" from the spurious statistical associations representing "the world is usually as it is." It integrates weak visual cues with strong perturbation statistical co-occurrence patterns to form a dual-branch parallel feature processing architecture, attempting to solve the "survivor bias" dilemma where the model ignores low-frequency tail visual evidence due to overfitting high-frequency head patterns. The process of eliminating bias is explicitly modeled as an adversarial causal intervention task, more clearly demonstrating the logical reasoning process of the model cutting off statistical backdoor paths and returning to the visual essence during inference. Through the amplification and enhancement of intrinsic features and the structured modeling of spurious features, a novel causal inference paradigm based on feature decoupling is proposed to alleviate the long-tail distribution bias problem caused by improper fusion of multi-source information in scene graph generation tasks.
[0059] The technical solution of this application proposes an intrinsic feature branch to improve the ability to extract weak but crucial visual cues from strong statistical noise, and utilizes a correlation-driven discrimination mechanism to enhance the perceptual sensitivity to long-tail categories ("crashers"). Simultaneously, a spurious feature branch is constructed to accurately simulate joint confusion factors within a structured sensitive subspace, improving the ability to capture multi-granularity statistical patterns. To further ensure the robustness of inference, an adversarial causal intervention mechanism and a cross-modal regularization module are designed. Counterfactual reasoning forces the model to maximize the distribution difference between factual predictions and purely biased predictions, and common sense is used to anchor visual semantics, ensuring that the model prevents semantic drift while cutting off spurious statistical shortcuts, achieving accurate inference of entity relationships in complex scenes. Experiments on the Visual Genome dataset demonstrate that this method achieves excellent performance on unbiased metrics while maintaining a balance in overall performance. It effectively achieves a robust fusion of visual perception and statistical knowledge, significantly improving the model's generalization ability in open and dynamic environments. Attached Figure Description
[0060] Figure 1 The overall architecture diagram of ACID, the anti-survivor bias framework for decoupling causal inference features proposed in this application;
[0061] Figure 2 This is a comparison diagram of the biased path in the traditional SGG and the causal path after decoupling in the present application, as shown in the embodiments of this application.
[0062] Figure 3 This is a comparison chart of the predicate debiasing results for the high-frequency predicates "on" and "in" in the embodiments of this application;
[0063] Figure 4 This is a comparison chart showing the effect of gradually removing survivor bias under various output conditions in the embodiments of this application. Detailed Implementation
[0064] The technical features of this application are further illustrated below with reference to specific embodiments.
[0065] In the technical solution of this application, a scene map generation system with causal inference feature decoupling and anti-survivor bias is first constructed. The main components of the system include:
[0066] The input module is used to receive images and perform object detection;
[0067] The intrinsic feature processing module is used to perform feature amplification and extraction operations, simulating the direct causal path from vision to prediction;
[0068] The fake feature processing module is used to perform sensitive subspace construction and feature extraction, simulating the backdoor path of the joint confusion factor;
[0069] The causal intervention module is used to perform adversarial corrections and cut off false backdoor paths;
[0070] The cross-modal fusion module is used to perform regularization operations to prevent semantic drift.
[0071] Based on the constructed causal inference feature decoupling anti-survivor bias scene map generation system, the main process of the causal inference feature decoupling anti-survivor bias scene map generation method of this application includes:
[0072] The original image is input into a pre-trained object detection model to extract object bounding boxes and initial visual features. Pure visual semantics are extracted through intrinsic feature branches, and weak visual cues are enhanced based on a correlation-driven discrimination mechanism to form intrinsic feature representations.
[0073] Meanwhile, coarse-grained predicate logic is combined with the category embedding of subject and object to construct a joint confusion factor. In the spurious feature branch, the sensitive subspace is used to perform structured modeling of statistical co-occurrence patterns to form spurious feature representations.
[0074] Subsequently, the adversarial causal intervention module performs adversarial inference on the outputs of the intrinsic feature branch and the false feature branch at the causal level. By maximizing the distribution difference between factual prediction and counterfactual prediction, it actively cuts off the model's dependence on false statistical backdoors.
[0075] Subsequently, the decoupled features are aligned with linguistic common sense in the cross-modal regularization module. The linguistic context is used as an anchor point to calibrate visual predictions and prevent semantic drift. The model then achieves robust fusion of visual perception and statistical knowledge to generate the final scene graph.
[0076] like Figure 1 As shown, this application provides a method and system for generating scene maps against survivor bias based on causal inference feature decoupling.
[0077] In one embodiment of this application, firstly, an object extractor is constructed using Faster-RCNN to extract the appearance features of proposed entities and capture their initial visual representations.
[0078] By capturing subtle visual cues in images through intrinsic feature branches and using soft gating mechanisms to separate statistical noise, true visual evidence can be amplified.
[0079] The joint confusion effect between entity pairs is further modeled in the false feature branch, and a structured bias representation is constructed in the sensitive subspace through contrastive learning.
[0080] Finally, we integrate intrinsic features and spurious features to perform adversarial causal intervention, and combine cross-modal regularization to achieve synergistic optimization of semantic consistency constraints and bias removal capabilities.
[0081] Specifically, the steps of the scene map generation method based on causal inference feature decoupling are as follows:
[0082] 1) Faster R-CNN is used as the object detector to extract initial information. A pre-trained ResNeXt-101-FPN is typically used as the backbone network to generate convolutional feature representations for subsequent modules. For each object detected in the input image, its corresponding bounding box is predicted. and its initial visual features These visual features will serve as input to subsequent intrinsic feature branches, used to simulate the direct causal path from vision to predicate.
[0083] 2) The intrinsic feature branch incorporates visual sensitivity and robustness into the task reasoning process. The initial visual features input often combine strong statistical correlations with weak visual cues. Typically, models tend to focus on high-frequency statistical features while ignoring low-frequency visual details. Therefore, the intrinsic feature branch first utilizes a correlation discriminator. Calculate the bias correlation score of the feature dimension This is to rationally plan visual information and enhance the model's understanding of essential visual semantics. For input features... A soft-gated signal is generated by a correlation discriminator, which decouples the features into bias-dominated components. and the main part of the clues Then a scaling factor was introduced. Weak cues are amplified to enhance the salience of "crasher" characteristics.
[0084]
[0085]
[0086] in This is element-wise multiplication. Therefore, the final intrinsic characteristics are defined as follows:
[0087]
[0088] Then processed intrinsic features This will represent pure visual evidence, unaffected by statistical contamination.
[0089] 3) Spurious features are further explicitly modeled and structured here. The joint confusion factor obtained through statistical co-occurrence is processed as a novel spurious feature. First, the coarse-grained predicate logic of the pre-trained model is extracted. and embed it with the subject and object embedding By splicing together, a joint confusion factor is constructed. :
[0090]
[0091] in, A backdoor path that simultaneously affects visual features and prediction results was simulated. Then, a cross-attention mechanism was used to... Mapping to a sensitive subspace generates false features:
[0092]
[0093] To ensure the structuring of the sensitive subspace, triple contrast loss is used. Constraints are imposed to cluster samples with the same high-frequency statistical patterns within the feature space.
[0094] 4) To make reasonable use of the information extracted from intrinsic feature branches and spurious feature branches, adversarial causal intervention is implemented. The model construction includes factual inputs containing all features. and counterfactual inputs containing only spurious features Fact predictions are obtained through the main classifier. and counterfactual prediction .
[0095]
[0096] Subsequently, by maximizing the distribution difference between the two through KL divergence, the model is forced to sever its dependence on false features and focus on intrinsic visual evidence for training. loss.
[0097] 5) Finally, to prevent semantic drift caused by excessive adversarial behavior, a cross-modal regularization module is introduced. This module regularizes visual causal features. With language context features The data is fused, predictions are made using an auxiliary classifier, and regularization loss is calculated. .
[0098]
[0099]
[0100] The final joint loss of the entire anti-survivorship bias model can be defined as:
[0101]
[0102] in, The standard cross-entropy loss is based on the objective. It is a triplet loss used to structure spurious feature branches. This represents the adversarial bias correction loss after causal intervention, used to cut off the joint backdoor path. It is a cross-modal alignment loss constructed to ensure semantic consistency and prevent semantic drift. As a focal loss, through the focal parameters Automatically reduce the loss weight for simple samples and calculate it according to the following formula:
[0103]
[0104] in This represents the predicted probability of the true label.
[0105] Of course, the above description is not intended to limit this application, nor is this application limited to the examples given above. Any changes, modifications, additions, or substitutions made by those skilled in the art within the scope of this application should fall within the protection scope of this application.
Claims
1. A method for generating scene maps against survivor bias based on causal inference feature decoupling, characterized in that: Includes the following steps: 1) Acquire the image to be processed and extract the objects and their initial visual features from the image; 2) Construct intrinsic feature branches to simulate direct causal paths. Construct fake feature branches to simulate backdoor paths. ; 3) Conduct adversarial causal intervention; Factual predictions and counterfactual predictions are generated based on complete feature inputs and biased inputs, respectively, and the distribution difference between factual predictions and counterfactual predictions is maximized through adversarial correction loss; 4) Perform cross-modal regularization; By fusing visual causal representations with linguistic contextual representations and utilizing the semantic consistency of decoupled features constrained by linguistic common sense, cross-modal regularization loss is calculated. ; 5) Train the model based on the comprehensive loss function, and use the trained model to generate scene graphs.
2. The method for generating scene maps against survivor bias based on causal inference feature decoupling according to claim 1, characterized in that: In step 1), during the input initialization phase, following the SGG feature extraction scheme, detection bounding boxes b for all objects in the image are obtained. i and its initial visual features o i ; The SGG feature extraction scheme includes the use of a pre-trained object detector, ResNeXt-101-FPN.
3. The method for generating scene maps against survivor bias based on causal inference feature decoupling according to claim 1, characterized in that: In step 2), the initial information extracted from the original image is combined to obtain the interaction information between pairs of entities; the extracted combined features and the initial features are fed into the intrinsic feature branch and the spurious feature branch to extract visual cues and model statistical bias.
4. The method for generating scene maps against survivor bias based on causal inference feature decoupling according to claim 1, characterized in that: In step 2) When constructing intrinsic feature branches, the weight of weak visual cues is enhanced by using a correlation discriminator and a feature amplification mechanism. The initial visual features are correlated, a soft gating mechanism is generated to separate strongly biased features from weak visual cues, and the weak visual cues are amplified to obtain an intrinsic feature representation. ; When constructing spurious feature branches, the joint confusion factor is modeled within a structured sensitive subspace to capture multi-granular statistical patterns; Based on the subject and object category embedding vectors of objects and coarse-grained predicate logic, a joint confusion factor is constructed, and the joint confusion factor is mapped to a sensitive subspace to obtain a false feature representation. .
5. The method for generating scene maps against survivor bias based on causal inference feature decoupling according to claim 1, characterized in that: In step 1), the target extractor is pre-trained on standard datasets such as Visual Genome based on the backbone network of Faster R-CNN. The training datasets are all image datasets. The model is trained on the Visual Genome dataset to learn intrinsic and spurious features, and learns the entity information and interaction regions associated with each predicate in the corresponding dataset, distinguishing the different information expressed by different predicates in different scenarios.
6. The method for generating scene maps against survivor bias based on causal inference feature decoupling according to claim 1, characterized in that: In step 2), a correlation discriminant is used. Initial visual features Mapped to dimensional bias correlation score ; Using fractions The features are divided into bias-dominated components. and the main part of the clues ;in, This is element-wise multiplication; Capture high-frequency deviation characteristics, and Identify low-frequency but valuable visual cues; Introducing scaling factor By magnifying the dominant part of the clues, we can obtain... , The magnified features are then fused with the dominant bias to generate the final intrinsic features. .
7. The method for generating scene maps against survivor bias based on causal inference feature decoupling according to claim 1, characterized in that: In step 2), a pre-trained, classic SGG model is used to extract the predicate logic of the pre-trained model as a coarse-grained predicate representation. ; Embed subject categories Object category embedding and By splicing together, a joint confusion factor is constructed. ; By leveraging cross-attention mechanisms combined with residual connections, bias features are aggregated and generated within the sensitive subspace. : ; in This indicates the combined context between the subject and the object. , , All are learnable matrices; Using the contrastive loss function Apply structured constraints to the sensitive subspace.
8. The method for generating scene maps against survivor bias based on causal inference feature decoupling according to claim 1, characterized in that: In step 3), the model constructs the fact input when all evidence has been observed. Fact input Includes intrinsic features and false features : ; The model obtains fact predictions through the main classifier. , ; Constructing counterfactual inputs by ablation masking intrinsic features Counterfactual input Only retain false features ; Counterfactual prediction Obtained through the main classifier ; calculate and The KL divergence between them serves as the adversarial correction loss. .
9. The method for generating scene maps against survivor bias based on causal inference feature decoupling according to claim 1, characterized in that: In step 4), during cross-modal regularization, the visual causal representation of the penultimate layer of the main classifier is extracted. ; Word embedding models are used to extract and fuse word vectors of the subject and object to obtain a language context representation. ; Will and Element-wise multiplication fusion is performed, prediction is made using a cross-modal classifier, and cross-modal regularization loss is calculated. .
10. The method for generating scene maps against survivor bias based on causal inference feature decoupling according to claim 1, characterized in that: In step 5), the comprehensive loss function is calculated. Including the loss in generating the main scene graph and joint causal decoupling loss : ; in, It includes cross-entropy loss for object classification and relationship classification loss based on focus loss; Calculate the joint causal decoupling loss hour, , in , , These are hyperparameters, and are set to 0.8, 1.2, and 1.0 respectively.