A multi-modal image panoramic scene graph generation method

By using a dual-stream independent backbone network and an object-aware modal attention module (OAMA) for asymmetric interaction, the problems of blurred fine-grained relationship recognition and performance degradation under harsh environments in RGB panoramic scene image generation are solved, and high-precision generation of panoramic scene images is achieved.

CN122244218APending Publication Date: 2026-06-19HUAZHONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Filing Date
2026-05-15
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing RGB panoramic scene image generation methods suffer from performance degradation in fine-grained relationship recognition and harsh environments. RGB-T fusion methods cannot adapt to the functional asymmetry of panoramic scene image generation tasks and lack a decoupled fusion scheme that adapts to RGB and thermal infrared modes.

Method used

Asymmetric interaction is achieved by using a dual-stream independent backbone network and an object-aware modal attention module (OAMA). In the decoding stage, modality-specific features are preserved through independent branches, and in the relationship prediction stage, modality is selected through dynamic routing to achieve complementary modal advantages.

Benefits of technology

It improves the accuracy and robustness of panoramic segmentation in all-weather scenarios, effectively distinguishes fine-grained relationships that are visually similar but have different physical states, and improves the accuracy of relationship recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244218A_ABST
    Figure CN122244218A_ABST
Patent Text Reader

Abstract

This invention discloses a method for generating panoramic scene maps from multimodal images, belonging to the field of computer vision and multimodal scene understanding technology. Addressing the problems of poor robustness and fuzzy fine-grained relationship recognition in existing technologies under complex environments such as low light, this invention first utilizes a dual-stream independent backbone network to extract multi-scale features from registered RGB and thermal infrared images. During the encoding stage, an object-aware modality attention module performs asymmetric semantic injection and geometric refinement. Subsequently, object queries and relationship-specific pairing queries are generated. In the decoding stage, a relationship-aware modality selection module dynamically calculates routing weights based on the semantic or geometric dependencies of the relationship content to select the optimal modality features for relationship reasoning, ultimately constructing an RGB-T panoramic scene map containing object nodes and edges representing relationships between objects. This invention fully utilizes the geometric and thermodynamic cues of thermal infrared images, improving the robustness and recognition accuracy of all-weather panoramic scene maps.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision and multimodal scene understanding, and more specifically, relates to a method for generating panoramic scene maps of multimodal images. Background Technology

[0002] Panoramic scene graph generation is a core technology in computer vision for achieving high-level scene understanding. It aims to parse images into a structured graph representation containing object nodes and edges representing relationships between objects through panoramic segmentation and semantic relation reasoning. This structured representation is of great significance for downstream tasks such as image description and visual reasoning.

[0003] Currently, the mainstream panoramic scene graph generation method is represented by Pair-Net, proposed in the paper "Pair Then Relation: Pair-Net for Panoptic Scene Graph Generation" (published in the journal IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 46, Issue 12, 2024, by Wang J et al.). It employs an end-to-end panoramic scene graph generation framework of "pairing first, then inference," and is currently the highest-performing RGB panoramic scene graph generation scheme. However, this type of method, and most existing technologies, rely solely on RGB sensors as input, resulting in significant drawbacks in practical applications: First, fine-grained relationship recognition is ambiguous: RGB visual appearance makes it difficult to distinguish semantically similar relationships. For example, it is difficult to distinguish between "long-term parking" and "temporary parking" of a vehicle based solely on RGB information, while thermal infrared sensors can capture unique clues such as engine heat radiation to infer the vehicle's status. Second, performance drops sharply in harsh environments: RGB sensors experience significant performance degradation under adverse conditions such as low light and severe weather, leading to a large number of missed objects. Thermal infrared sensors, on the other hand, are highly robust to these environmental factors and can effectively compensate for the failure of RGB sensors.

[0004] Existing RGB-T (Thermal) fusion methods are mainly designed for pixel-level / object-level tasks such as semantic segmentation and object detection. They cannot adapt to the functional asymmetry between RGB and thermal infrared modalities in panoramic scene image generation tasks. Forced fusion can easily lead to mutual interference between the semantic information of RGB and the geometric information of thermal infrared, which not only loses the core advantages of modality, but also fails to meet the differentiated needs of the dual tasks of "object segmentation + relation reasoning".

[0005] In summary, there is currently no dedicated technical framework for generating RGB-T panoramic scene maps, nor is there a modal decoupling fusion solution adapted to this task. There is an urgent need for a technical solution that can fully leverage the complementary advantages of dual modalities while meeting the differentiated needs of object segmentation and relational reasoning, so as to achieve all-weather, high-precision panoramic scene map generation. Summary of the Invention

[0006] To address the aforementioned deficiencies or improvement needs of existing technologies, this invention provides a method for generating panoramic scene maps from multimodal images, which can avoid the ineffective coupling of bimodal features: the encoding stage achieves complementary advantages of bimodal features through asymmetric interaction rather than forced fusion; the decoding stage retains modality-specific features through independent branches; and the relationship prediction stage achieves adaptive selection of modalities through dynamic routing. This method can maximize the release of the semantic classification advantages of the RGB modality and the geometric robustness and thermodynamic cue advantages of the thermal infrared modality.

[0007] To achieve the above objectives, according to a first aspect of the present invention, a multimodal image panoramic scene generation system is provided, comprising: The dual-stream independent backbone network consists of two backbone networks with identical structures but non-shared weights, which are used to extract M-level RGB features and M-level thermal infrared features from the registered RGB image and thermal infrared image, respectively, where M>2. M-2 object-aware modal attention modules OAMA1, ..., OAMA M-2 The input terminals are connected one-to-one with the 2nd, ..., M-1th feature extraction layers of the dual-stream independent backbone network, and the output terminals are connected to the first and second pixel decoders. Each object-aware modal attention module includes a semantic injection branch, used to employ the input RGB features. Input thermal infrared features Semantic enhancement yields enhanced thermal infrared features. Geometric refinement branch for input-based thermal infrared features For the input RGB features Enhanced RGB features are obtained by performing geometric boundary enhancement. ; The first pixel decoder is used to process the RGB features output from the last feature extraction layer of the dual-stream independent backbone network and OAMA1, ..., OAMA. M-2 The enhanced RGB features output are subjected to multi-scale feature transformation to obtain RGB memory features. ; The second pixel decoder is used to process the thermal infrared features and OAMA1, ..., OAMA output from the last feature extraction layer of the dual-stream independent backbone network. M-2 The enhanced thermal infrared features output are subjected to multi-scale feature transformation to obtain thermal infrared memory features. ; The panoramic segmentation branch includes: an object query decoder, to This is the input used to extract object queries. Panoramic mask prediction head, with and This is used as input to extract object categories and segmentation mask information from the scene; The relationship prediction branch includes a pairing proposal network, a relationship-aware modality selection module, and a relationship classifier; wherein, the pairing proposal network uses... The input is used to generate subject-object pairs; the relation-aware modality selection module includes a first and a second cross-attention layer, a feature fusion layer, a modality router, and a third computation module, used to process each subject-object pair as a query vector to obtain the corresponding relation embedding features. ; The panoramic scene graph construction module is used to form a panoramic scene graph by treating each object in the scene as a node and using the subject-predicate-object relation triple as directed edges between the corresponding nodes. Each node corresponds to a segmentation instance, including the object's category and segmentation mask information.

[0008] According to a second aspect of the present invention, a training method for a multimodal image panoramic scene generation system as described in the first aspect is provided, characterized in that it includes: Construct a training set; the training set includes multiple pairs of labeled RGB images and thermal infrared images of different scenes, and the labeling results are standardized subject-predicate-object relation triples in each scene; The multimodal image panoramic scene generation system is trained under supervised supervision using the training set, and the loss function used in the training is:

[0009] in, For the total loss function, , , , , , All are weighting coefficients. , , The cross-entropy classification loss is calculated for the subject, object, and predicate, respectively. For the binary cross-entropy loss of the pairing proposal network, , These are the panoramic segmentation losses for the first and second pixel decoders, respectively.

[0010] According to a third aspect of the present invention, a method for generating a multimodal image panoramic scene map is provided, comprising: The registered RGB image to be processed and the thermal infrared image are input into the multimodal image panoramic scene generation system as described in the first aspect to obtain a panoramic scene image.

[0011] According to a fourth aspect of the present invention, an electronic device is provided, comprising: a computer-readable storage medium and a processor; The computer-readable storage medium is used to store executable instructions; The processor is configured to read executable instructions stored in the computer-readable storage medium and execute the method described in the second or third aspect.

[0012] According to a fifth aspect of the invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a processor to perform the method as described in the second or third aspect.

[0013] According to a sixth aspect of the invention, a computer program product is provided, comprising a computer program or instructions that, when executed by a processor, implement the method as described in the second or third aspect.

[0014] In summary, compared with the prior art, the above-described technical solutions conceived by this invention can achieve the following beneficial effects: 1. This invention is the first to propose RGB-T panoramic scene images (such as...). Figure 1 The dedicated technology framework generated (as shown) Figure 2 As shown in the figure, based on the functional asymmetry of RGB and thermal infrared modes, a full-process modal decoupling fusion architecture was designed, which avoids the modal information interference and loss problems caused by traditional coupled fusion, effectively releases the advantages of dual modes, and fills the industry's technological gap.

[0015] 2. The Object Aware Modal Attention (OAMA) module designed in this invention achieves complementary advantages of the two modalities rather than forced fusion through an asymmetric bidirectional attention mechanism: the semantic injection branch compensates for the weak semantic classification ability of the thermal infrared modality, and the geometric refinement branch completes the boundary blurring and target omission problems of the RGB modality in harsh environments, thereby improving the panoramic segmentation accuracy and robustness in all-weather scenes.

[0016] 3. The Relationship-Aware Modality Selection (RAMS) module designed in this invention dynamically allocates modality weights for different types of relationship reasoning tasks through a content-adaptive routing mechanism. It can make full use of the thermodynamic cues of thermal infrared light to effectively distinguish fine-grained relationships that are visually similar but have different physical states, thereby improving the accuracy of relationship recognition. Attached Figure Description

[0017] Figure 1 This is a schematic diagram of the multimodal image panoramic scene generation task provided in an embodiment of the present invention.

[0018] Figure 2 This is a schematic diagram of the structure of the multimodal image panoramic scene generation system provided in an embodiment of the present invention.

[0019] Figure 3 This is a schematic diagram of the structure of the object-aware modal attention module provided in an embodiment of the present invention.

[0020] Figure 4 This is a schematic diagram of the structure of the relation-aware modality selection module provided in an embodiment of the present invention.

[0021] Figure 5 One of the simulation results comparison diagrams is shown between the method provided in the embodiments of the present invention and the existing method.

[0022] Figure 6 The second figure shows a comparison of simulation results between the method provided in this embodiment of the invention and existing methods. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0024] This invention provides a multimodal image panoramic scene generation system, such as... Figure 2 As shown, it includes: The dual-stream independent backbone network comprises two backbone networks with identical structures but non-shared weights. Each backbone network includes M feature extraction layers. The two backbone networks are respectively applied to extract M levels of RGB features and M levels of thermal infrared features from the registered RGB image and thermal infrared image, respectively, where M>2. The feature extraction layer refers to a set of convolutional operations in the backbone network that output feature maps with the same spatial resolution.

[0025] Specifically, the registered RGB image (i.e., visible light image) and thermal infrared image are acquired, and multi-scale feature extraction is performed on the RGB image and the thermal infrared image using a dual-stream independent backbone network to obtain multi-scale RGB features and thermal infrared features.

[0026] The proposed dual-stream independent backbone network employs two identical backbone networks with non-shared weights, each containing M feature extraction layers. These serve as the RGB stream branch and the thermal infrared stream branch, respectively, to extract multi-scale feature maps. Specifically, the RGB stream branch has M RGB feature extraction layers, and the thermal infrared stream branch has M thermal infrared feature extraction layers. The RGB stream branch extracts RGB features rich in texture and semantic information, while the thermal infrared stream branch extracts thermal infrared features that are insensitive to illumination changes and contain thermal radiation and geometric contour information. This independent dual-stream backbone network design effectively avoids the loss of modality-specific information caused by early feature fusion, preserving the core advantages of both modalities to the greatest extent possible.

[0027] The backbone network can be any existing feature extraction backbone network, such as ResNet-50, and this embodiment of the invention does not limit it to a unique type.

[0028] Understandably, the value of the number M of feature extraction layers in the backbone network, as well as the stride of each feature extraction layer in the backbone network, can be set according to actual needs. Figure 2 This is a schematic diagram of the system provided in this embodiment of the invention when M=5. The step size of the feature extraction layer in the RGB stream branch and the thermal infrared stream branch are examples. H is the height of the RGB image and the thermal infrared image, and W is the width of the RGB image and the thermal infrared image.

[0029] M-2 object-aware modal attention modules OAMA1, ..., OAMA M-2 The input terminals are connected one-to-one with the 2nd, ..., M-1th feature extraction layers of the dual-stream independent backbone network, and the output terminals are connected to the first and second pixel decoders. Each object-aware modal attention module includes a semantic injection branch, used to employ the input RGB features. Input thermal infrared features Semantic enhancement yields enhanced thermal infrared features. Geometric refinement branch for input-based thermal infrared features For the input RGB features Enhanced RGB features are obtained by performing geometric boundary enhancement. ; Each object-aware modal attention module includes: A semantic injection branch is used to extract the input RGB features. Input thermal infrared features Semantic enhancement yields enhanced thermal infrared features. The semantic injection branch includes a global average pooling layer, a first MLP, a first activation function layer, and a first computation module. After being processed sequentially through a global average pooling layer, a multilayer perceptron, and a first activation function layer, the weights are converted into channel attention weights. The first calculation module according to Calculated ; Geometric refinement branch for input-based thermal infrared features For the input RGB features Enhanced RGB features are obtained by performing geometric boundary enhancement. The geometric refinement branch includes a spatial convolutional layer, a second activation function layer, and a second computation module. After being processed sequentially through a spatial convolutional layer and a second activation function layer, they become spatial attention weights. The second calculation module according to Calculated .

[0030] Specifically, the RGB features and thermal infrared features are deployed with Object-Aware Modal Attention (OAMA) modules at the corresponding scale layers, that is, M-2 object-aware modal attention modules OAMA1, ..., OAMA1. M-2 The input terminals are respectively connected to the output terminals of the 2nd, ..., M-1th feature extraction layers of the RGB stream branch and the thermal infrared stream branch in the dual-stream independent backbone network, OAMA1, ..., OAMA M-2 The outputs of both modules are connected to the first and second pixel decoders. Meanwhile, the output of the last feature extraction layer of the RGB stream branch is connected to the input of the first pixel decoder, and the output of the last feature extraction layer of the thermal infrared stream branch is connected to the input of the second pixel decoder. The OAMA module performs cross-modal feature interaction through an asymmetric bidirectional attention mechanism to obtain enhanced RGB features and thermal infrared features.

[0031] That is, the object-aware modal attention module is deployed on the corresponding multi-scale feature layers output by the dual-stream backbone network, and includes two parallel asymmetric interaction branches: a semantic injection branch and a geometric refinement branch, such as... Figure 3 As shown below, the structure and processing flow of each branch are described in detail: The semantic injection branch includes a globally average pooling layer, a first MLP, a first activation function layer, and a first computation module connected in sequence; the globally average pooling layer processes the input RGB feature map. A global average pooling operation is performed to globally compress the H×W spatial dimension, resulting in a global feature vector of dimension C×1×1, where C is the number of channels in the feature map. This global feature vector is then input into the first MLP for a non-linear transformation, first reducing the feature dimension to K times the original number of channels, then increasing the dimension back to the original number of channels. The output is then compared with the input (i.e.,...). A globally semantic vector with consistent dimensions. The global semantic vector is input into the first activation function layer, and the values ​​are normalized to the [0,1] interval to obtain the channel attention weights. The first calculation module extends the channel attention weights to the thermal infrared feature map through a broadcast mechanism. Using identical C×H×W dimensions, element-wise channel-wise multiplication is performed with the original thermal infrared feature map to achieve channel-level injection of semantic information; the multiplied features are then added to the original thermal infrared feature map element-wise using residual addition to obtain the semantically enhanced thermal infrared features. The calculation formula is:

[0032] in, This is the first activation function.

[0033] Understandably, the structure of the first MLP, the dimensionality reduction factor K, and the first activation function can all be set according to actual needs.

[0034] The geometric refinement branch, comprising a spatial convolutional layer, a second activation function layer, and a second computation module connected in sequence, processes the input thermal infrared feature map. Spatial convolutional layers are used for feature extraction, and the output is compared with the input feature map (i.e., A geometrically consistent, boundary-sensitive mapping. The geometric map is input into the second activation function layer, and the values ​​are normalized to the [0,1] interval to obtain the spatial attention weights; the second calculation module combines the spatial attention weights with the original RGB feature map. Element-wise spatial multiplication is performed to enhance the features of the target boundary region and suppress the features of the invalid background region. The multiplied features are then added element-wise with the original RGB feature map to obtain the geometrically enhanced RGB features. The calculation formula is:

[0035] Understandably, parameters such as the second activation function, the number of spatial convolutional layers, the kernel size, and the stride can all be set according to actual needs.

[0036] The first pixel decoder is used to process the RGB features output from the last feature extraction layer of the dual-stream independent backbone network and OAMA1, ..., OAMA.M-2 The enhanced RGB features output are subjected to multi-scale feature transformation to obtain RGB memory features. ; The second pixel decoder is used to process the thermal infrared features and OAMA1, ..., OAMA output from the last feature extraction layer of the dual-stream independent backbone network. M-2 The enhanced thermal infrared features output are subjected to multi-scale feature transformation to obtain thermal infrared memory features. .

[0037] Specifically, the enhanced bimodal features and the RGB and thermal infrared features output from the last feature extraction layers of the RGB and thermal infrared stream branches are input into two independent pixel decoders for multi-scale feature transformation, generating RGB memory features and thermal infrared memory features. That is, all enhanced RGB and thermal infrared features, as well as the RGB and thermal infrared features output from the last feature extraction layers of the RGB and thermal infrared stream branches, are input into their respective pixel decoders. These pixel decoders, through multi-scale feature fusion and upsampling operations, convert the low-resolution feature map into a high-resolution pixel embedding, ultimately generating RGB memory features that preserve semantic details. Thermal infrared memory features that retain geometric information The design employing independent pixel decoding avoids mutual interference between dual-modal features during the decoding process, preserves the semantic detail integrity of RGB memory features and the geometric structural integrity of thermal infrared memory features, and provides high-quality feature support for subsequent object segmentation and relation reasoning.

[0038] The first and second pixel decoders can adopt any existing structure, and the embodiments of the present invention do not limit them to a single one. As an example, both the first and second pixel decoders adopt the Mask2Former architecture.

[0039] The panorama segmentation branch includes an object query decoder and a panorama mask prediction head connected in sequence; wherein, the object query decoder is... This is the input used to extract object queries. Panoramic mask prediction head and The input is used to extract object categories and segmentation mask information from the scene.

[0040] Specifically, the panoramic segmentation branch uses RGB memory features. The reason for using this as input is that, through leading modality bias analysis, RGB modality has inherent advantages in object semantic classification and panoramic segmentation accuracy, and RGB memory features... Geometric boundary information from thermal infrared radiation has been injected via the OAMA module. Even in harsh environments such as low light, it has completed the reconstruction of missed objects and blurred boundaries in RGB images, providing a feature foundation for panoramic segmentation that balances semantic accuracy and boundary robustness. RGB memory features The input is fed into the object query decoder in the panorama segmentation branch, which uses a multi-layer cross-attention mechanism to... Extract semantic features related to each object and output the refined object query. Subsequently, the object query With the RGB memory features A common input panoramic mask prediction head; the panoramic mask prediction head through... The semantic categories of each object are obtained through multilayer perceptron mapping, and then... and Perform pixel-by-pixel dot product operations to generate segmentation masks for each object.

[0041] The relationship prediction branch includes a pairing proposal network, a relationship-aware modality selection module, and a relationship classifier connected in sequence; wherein, the pairing proposal network is... The input is used to generate subject-object pairs; the relation-aware modality selection module includes a first and a second cross-attention layer, a feature fusion layer, a modality router, and a third computation module, used to process each subject-object pair as a query vector to obtain the corresponding relation embedding features. The processing includes: the first and second cross-attention layers respectively processing the query vector... , Performing cross-attention operations yields semantic context features for the RGB modalities. Structural relationship characteristics of thermal infrared modes The feature fusion layer will , By splicing, the fusion features are obtained. , Transformed into route weights via modal routers The modal router includes a second MLP and a third activation function layer connected in sequence, and the third calculation module will... , As respectively , The weights are used to weight and fuse them to obtain The relation classifiers are based on each subject-object pair. The input is used to predict the relationship between each subject-object pair, resulting in a subject-predicate-object relation triplet.

[0042] Specifically, the object query extracted by the object query decoder in the panorama segmentation branch. The pairing proposal network generates subject-object pairs based on the positional relationship and category relevance of objects.

[0043] As a preferred embodiment, the pairing proposal network further filters and selects pairs after generating them, specifically including: Sort all subject-object pairs according to their confidence level, remove all subject-object pairs except the top K pairs, and use the remaining subject-object pairs as relation-specific pairing queries; where K is a positive integer.

[0044] For example, when K=100, the top-100 subject-object pairs are used as the dedicated pairing query for the relation, in order to eliminate invalid pairs that are too far apart or have no semantic relevance, and the remaining subject-object pairs are used as the dedicated pairing query for the relation. .

[0045] In the relationship prediction branch, the relationship-specific pairing query will be used. and the RGB memory features output by the first pixel decoder The thermal infrared memory feature output by the second pixel decoder The input value relation-aware modal selection (RAMS) module dynamically weights and fuses features from different modalities through a content-adaptive routing mechanism, and combines it with a relation classifier to complete predicate prediction and generate the final relation prediction result.

[0046] Among them, such as Figure 4 As shown, the workflow of the RAMS module is as follows: querying relationships using dedicated pairings. As a query vector, where The semantic content and spatial location information of the subject-object pair have been encoded using a pairing proposal network. Therefore, based on this query vector, the first and second cross-attention layers respectively encode the RGB memory features. and thermal infrared memory characteristics Perform a cross-attention operation to extract semantic context features of the RGB modality. Structural relationship characteristics with thermal infrared modes The feature fusion layer concatenates two features along the channel dimension to obtain a fused feature. The input consists of a modal router composed of an MLP and a third activation function. The third calculation module calculates the routing weights for the RGB modes. ( The weights of the thermal infrared modes are: The weight calculation process is as follows That is, the routing weight represents the contribution of RGB modal features to the current relationship prediction, and the weight corresponding to the thermal infrared modality is the difference between 1 and the routing weight; based on the routing weight... and Weighted fusion is performed to obtain the final relation embedding features. Then, Input the relation classifier to predict the predicate categories between objects and obtain the semantic relationships between objects (such as vehicles "temporarily parked" on the road and pedestrians "walking" on the sidewalk), thus completing the prediction of predicate labels.

[0047] This demonstrates that the RAMS module can dynamically allocate modal weights for different types of relation reasoning tasks, effectively solving the problem of ambiguity in fine-grained relation recognition.

[0048] The panoramic scene graph construction module is used to form a panoramic scene graph by treating each object in the scene as a node and using the subject-predicate-object relation triple as directed edges between the corresponding nodes. Each node corresponds to a segmentation instance, including the object's category and segmentation mask information.

[0049] Specifically, the panoramic scene graph construction module constructs an RGB-T panoramic scene graph containing object nodes and inter-object relation edges based on the object category and segmentation mask information output by the panoramic segmentation branch, and the relation prediction results (i.e., subject-predicate-object relation triples) output by the relation prediction branch, including: Obtain the category and pixel-level segmentation mask information of all identified objects in the scene output by the panoramic segmentation branch, and treat each object as a node in the panoramic scene graph; obtain the subject-predicate-object relation triples output by the relation prediction branch, and treat the relation triples as directed edges between corresponding nodes in the panoramic scene graph; construct a complete RGB-T panoramic scene graph based on all nodes and directed edges. ,in For a set of objects, For the total number of objects, For the first One object, Includes all detected object instances in the scene; For a set of relations, For the first The object and the first A directed relation between objects. Label the predicate category of this relation. It includes the semantic relationships between all objects, enabling a high-level, structured understanding of the scene.

[0050] This invention provides a training method for a multimodal image panoramic scene generation system as described in any of the above embodiments, comprising: Construct a training set; the training set includes multiple pairs of labeled RGB images and thermal infrared images of different scenes, and the labeling results are standardized subject-predicate-object relation triples in each scene; The multimodal image panoramic scene generation system is trained under supervised supervision using the training set, and the loss function used in the training is:

[0051] in, For the total loss function, , , , , , All are weighting coefficients. , , The cross-entropy classification loss is calculated for the subject, object, and predicate, respectively. For the binary cross-entropy loss of the pairing proposal network, , These are the panoramic segmentation losses for the first and second pixel decoders, respectively.

[0052] This invention provides a method for generating panoramic scene maps from multimodal images, comprising: The registered RGB image to be processed and the thermal infrared image are input into the multimodal image panoramic scene generation system as described in any of the above embodiments to obtain a panoramic scene image.

[0053] The workflow of the multimodal image panoramic scene generation system provided by this invention is described below with a specific example. In this example, the backbone network adopts ResNet-50, and the dual-stream independent backbone network outputs 5 levels of RGB features and thermal infrared features. The first to third activation functions all adopt the Sigmoid activation function. The workflow of this multimodal image panoramic scene generation system includes: S1: The registered RGB and thermal infrared images are input into a dual-stream independent backbone network. This network employs two identical ResNet-50 feature extraction backbone networks with non-shared weights to extract multi-scale features from both modalities, resulting in RGB and thermal infrared feature maps. The RGB backbone network is pre-trained on the Panoptic Scene Graph (PSG) dataset, while the thermal infrared backbone network is initialized using weights pre-trained on the ImageNet dataset. This leverages the low-level geometric feature extraction capabilities of the pre-trained weights to adapt to the modal characteristics of the thermal infrared data.

[0054] S2: Deploy Object-Aware Modal Attention (OAMA) modules on the feature extraction layers of Stage 2, Stage 3, and Stage 4 (output step sizes of 4, 8, and 16), respectively, for the RGB and thermal infrared features. Addressing the functional asymmetry between the RGB and thermal infrared modes, perform asymmetric bidirectional feature interaction to obtain enhanced RGB and thermal infrared features. The workflow consists of two parallel branches: (1) Semantic injection branch (semantic enhancement of RGB→thermal infrared) To address the inherent limitations of thermal infrared images, such as a lack of texture information and weak semantic classification capabilities, this branch injects strong semantic information from the RGB modes into the thermal infrared features. The specific steps are as follows: For the input RGB feature map A global average pooling operation is performed to globally compress the spatial dimension of H×W, resulting in a global feature vector of dimension C×1×1, where C is the number of channels in the feature map. This global feature vector is then input into a cascaded multilayer perceptron for a nonlinear transformation. First, the feature dimension is reduced to 1 / 4 of the original number of channels, then increased back to the original number of channels, outputting a global semantic vector with the same dimension as the input. The global semantic vector is input into the Sigmoid activation function, and the values ​​are normalized to the [0,1] interval to obtain the channel attention weights. The channel attention weights are then extended to the thermal infrared feature map via a broadcast mechanism. Using identical C×H×W dimensions, element-wise channel-wise multiplication is performed with the original thermal infrared feature map to achieve channel-level injection of semantic information; the multiplied features are then added to the original thermal infrared feature map element-wise using residual addition to obtain the semantically enhanced thermal infrared features. The calculation formula is:

[0055] in, This is the Sigmoid activation function.

[0056] (2) Geometric refinement branch (boundary enhancement of thermal infrared → RGB) To address the issues of blurred target boundaries and missed detection of small targets in RGB images under adverse conditions such as low light, rain, fog, and glare, this branch injects thermal infrared mode illumination-independent geometric boundary information into the RGB features. The specific steps are as follows: Input thermal infrared feature map Feature extraction is performed using two consecutive 3×3 spatial convolutional layers with a stride of 1 and padding of 1. The output is a boundary-sensitive geometric mapping map with dimensions identical to the input feature map. Input the geometric map into the Sigmoid activation function to normalize the values ​​to the [0,1] interval, thus obtaining the spatial attention weights; then combine the spatial attention weights with the original RGB feature map. Element-wise spatial multiplication is performed to enhance the features of the target boundary region and suppress the features of the invalid background region. The multiplied features are then added element-wise with the original RGB feature map to obtain the geometrically enhanced RGB features. The calculation formula is:

[0057] Through the collaborative interaction of the two asymmetric branches, the OAMA module achieves complementary advantages of dual-modal features rather than forced fusion. It retains the core characteristics of both modalities while making up for their inherent defects, providing a robust feature foundation for subsequent pixel decoding and panoramic segmentation.

[0058] S3: The enhanced bimodal features and the RGB and thermal infrared features output from the last feature extraction layer of the dual-stream independent backbone network are input into two independent pixel decoders (based on the Mask2Former architecture). The decoders convert low-resolution features into high-resolution pixel embeddings through multi-scale feature fusion and upsampling operations to generate RGB memory features. and thermal infrared memory characteristics Among them, the RGB memory features retain rich semantic details, while the thermal infrared memory features retain clear geometric structures.

[0059] S4: RGB memory features As input, the object query is extracted through the panorama segmentation branch, and panorama segmentation and relationship pairing proposals are completed. The specific steps include: (1) RGB memory features The input object query decoder, which uses a multi-layer cross-attention mechanism, obtains... Semantic features of each object are extracted step by step, and a refined object query is output. .

[0060] (2) and Common input panoramic mask prediction head: This head uses a multilayer perceptron to input various... Mapped to semantic category prediction, and each and Perform pixel-by-pixel dot product operations to generate the corresponding segmentation mask.

[0061] (3) At the same time, Input the pairing proposal network to generate subject-object candidate pairs. Sort the top 100 pairs by pairing confidence to obtain a relationship-specific pairing query. .

[0062] S5: Relationship-specific pairing query and the RGB memory features generated in step S3 and thermal infrared memory characteristics The input relation-aware modality selection (RAMS) module performs relation reasoning and predicate prediction through a content-adaptive routing mechanism, specifically including: (1) Use the relationship-specific pairing query Qpair as the query vector. The Qpair is generated by the pairing proposal network in step S4 and has encoded the semantic content and spatial location information of each subject-object candidate pair. (2) Using the query vector as the query, perform RGB memory features respectively. and thermal infrared memory characteristics Perform cross-attention operation to extract the corresponding RGB semantic context features. Relationship characteristics with thermal infrared structure ; (3) and By stitching along the channel dimension, the fused features are obtained. The fused features are input into a modal router consisting of two layers of multilayer perceptrons and a sigmoid function, which dynamically generates routing weights in the 0-1 range. , The weights representing the contribution of the RGB modes to the current relation prediction are as follows: ; (4) Based on routing weight and Weighted fusion is performed to obtain the final relation embedding features. ; (5) Embed the fused relationship into the feature Input the relation classifier to predict the predicate labels and obtain the semantic relationships between objects.

[0063] This module can dynamically select the optimal modal features for different types of relationships. For example, for the state relationship of "vehicle parked on the road", the model will automatically assign higher thermal infrared weights and accurately distinguish the fine-grained relationship of "temporary parking" and "long-term parking" by the difference in engine thermal radiation intensity. For the action relationship of "pedestrian crossing the road", the model will assign higher RGB weights and accurately identify action relationships such as "crossing" by using the texture features of the zebra crossing, effectively solving the pain point of fuzzy recognition of fine-grained relationships.

[0064] S6: Based on the object category and segmentation mask information output by the panoramic segmentation branch, and the relationship prediction results output by the relationship prediction branch, construct an RGB-T panoramic scene map containing object nodes and relationship edges between objects.

[0065] Specifically, the panoramic segmentation branch outputs the category and pixel-level segmentation mask information of all identified objects in the scene, treating each object as a node in the panoramic scene graph; the relation prediction branch outputs subject-predicate-object relation triples, treating these relation triples as directed edges between corresponding nodes in the panoramic scene graph; and based on all nodes and directed edges, a complete RGB-T panoramic scene graph is constructed. ,in This is a set of objects, containing all detected object instances in the scene. It is a set of relationships, containing the semantic relationships between all objects, enabling a high-level, structured understanding of the scene.

[0066] To verify the effectiveness of the method provided in this invention, a dedicated dataset VISA (Visible and Infrared Scene graph Annotations) for RGB-T panoramic scene graphs was constructed, and comprehensive comparative and ablation experiments were conducted, as detailed below: 1. Construction of the VISA dataset The VISA dataset is built upon the publicly available RGB-T semantic segmentation FMB dataset (from the paper "Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation," published in the 2023 conference proceedings "2023 IEEE / CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION," authored by Liu JY et al.). First, instance segmentation is performed based on semantic segmentation, and then relationship annotations are applied between different instances. The dataset contains 1500 pairs of pixel-aligned RGB and thermal infrared images, divided into 743 pairs of normal scenes and 757 pairs of challenging scenes (including harsh environments such as low light, strong light, rain, and fog). The dataset annotation system is as follows: (1) Object annotation: a total of 12 semantic objects are annotated, including 7 foreground objects (trafficlight, trafficsign, person, car, truck, bus, two-wheeler) and 5 background objects (road, sidewalk, building, vegetation, sky). All objects are annotated with pixel-level instance segmentation masks and category labels; (2) Relationship annotation: a total of 11 semantic predicates are annotated, divided into three categories: ① Positional relations: over, in front of, on, beside, attached to; ② Action relations: walking on, standing on, crossing; ③ Vehicle status relations: driving on, parked on, temporarily stopped on. The annotation results are standardized relation triples of subject-predicate-object. The dataset is divided into training set and test set according to the ratio of 1178:322, using sequence-level random hierarchical division to strictly avoid temporal data leakage. The training set and test set both cover normal scenarios and challenge scenarios in a balanced manner.

[0067] 2. Experimental Setup The basic architecture of this invention is built on the Pair-Net framework, with core hyperparameters set as follows: 100 object queries and 256 feature dimensions. Training is conducted end-to-end, using AdamW as the optimizer, with an initial learning rate of... The weight decays to The total batch size is 8, and the total number of training epochs is 50. The learning rate is reduced to 0.1 times its original value at the 35th and 45th epochs, respectively. The total loss function for training is:

[0068] in, , , The cross-entropy classification loss is used for subjects, objects, and predicates. For the binary cross-entropy loss of the pairing proposal network, , The panorama segmentation losses (Dice loss + cross-entropy loss) are for the first and second pixel decoders, respectively, with the loss weights consistent with Pair-Net: .

[0069] The experimental evaluation metrics adopted standard metrics generated from panoramic scene images, which were divided into two categories: one is the relation prediction performance metrics: recall (R@20) and average recall (mR@20), where mR@20 is used to evaluate the relation prediction performance under long-tail distribution; the other is the panoramic segmentation accuracy metrics: Panoptic Quality (PQ), which is used to quantify the accuracy and completeness of pixel-level object segmentation.

[0070] 3. Experimental Results and Analysis (1) Quantitative results analysis The method provided in this invention was compared with Pair-Net, the current most powerful panoramic scene image generation method. For example... Figure 5 As shown, the method provided by this invention achieves optimal performance across all evaluation metrics. On the full scene test set, the method of this invention achieves an R@20 of 64.3%, a 6.6% improvement compared to the current top-performing Pair-Net method; the mR@20 reaches 42.0%, a 6.7% improvement, demonstrating the advantage of this invention in handling fine-grained relationships. In challenging scenarios such as low light, the performance of the Pair-Net method using only RGB degrades significantly, while the method provided by this invention maintains high performance, with a performance improvement of nearly 10 percentage points. This fully demonstrates that introducing a thermal infrared mode and combining it with a mode decoupling and fusion mechanism can effectively compensate for the failure of RGB in harsh environments. Furthermore, as... Figure 6 As shown, the method provided by this invention also improves the accuracy of panoramic segmentation. In challenging scenarios such as low light, the PQ (Portability-Quality) improvement reaches 3.7 percentage points, which is higher than the 1.3 percentage point improvement in normal scenarios. This result proves that the method provided by this invention, through the asymmetric geometric refinement of the OAMA module, effectively corrects blurred and missed object contours in RGB images by utilizing the boundary features of thermal infrared images, thereby improving the panoramic segmentation accuracy in complex scenes and laying a more accurate object foundation for panoramic scene map generation.

[0071] This invention provides an electronic device, including: a computer-readable storage medium and a processor; The computer-readable storage medium is used to store executable instructions; The processor is used to read executable instructions stored in the computer-readable storage medium and execute the training method or generation method as described in any of the above embodiments.

[0072] This invention provides a computer-readable storage medium storing computer instructions for causing a processor to execute a training method or generation method as described in any of the above embodiments.

[0073] This invention provides a computer program product, including a computer program or instructions, which, when executed by a processor, implement the training method or generation method as described in any of the above embodiments.

[0074] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A multimodal image panoramic scene generation system, characterized in that, include: The dual-stream independent backbone network consists of two backbone networks with identical structures but non-shared weights, which are used to extract M-level RGB features and M-level thermal infrared features from the registered RGB image and thermal infrared image, respectively, where M>2. M-2 object-aware modal attention modules OAMA1, ..., OAMA M-2 The input terminals are connected one-to-one with the 2nd, ..., M-1th feature extraction layers of the dual-stream independent backbone network, and the output terminals are connected to the first and second pixel decoders. Each object-aware modal attention module includes a semantic injection branch, used to employ the input RGB features. Input thermal infrared features Semantic enhancement yields enhanced thermal infrared features. Geometric refinement branch for input-based thermal infrared features For the input RGB features Enhanced RGB features are obtained by performing geometric boundary enhancement. ; The first pixel decoder is used to process the RGB features output from the last feature extraction layer of the dual-stream independent backbone network and OAMA1, ..., OAMA. M-2 The enhanced RGB features output are subjected to multi-scale feature transformation to obtain RGB memory features. ; The second pixel decoder is used to process the thermal infrared features and OAMA1, ..., OAMA output from the last feature extraction layer of the dual-stream independent backbone network. M-2 The enhanced thermal infrared features output are subjected to multi-scale feature transformation to obtain thermal infrared memory features. ; The panoramic segmentation branch includes: an object query decoder, to This is the input used to extract object queries. ; Panoramic mask prediction head, with and This is used as input to extract object categories and segmentation mask information from the scene; The relationship prediction branch includes a pairing proposal network, a relationship-aware modality selection module, and a relationship classifier; wherein, the pairing proposal network uses... The input is used to generate subject-object pairs; the relation-aware modality selection module includes a first and a second cross-attention layer, a feature fusion layer, a modality router, and a third computation module, used to process each subject-object pair as a query vector to obtain the corresponding relation embedding features. ; The panoramic scene graph construction module is used to form a panoramic scene graph by treating each object in the scene as a node and using the subject-predicate-object relation triple as directed edges between the corresponding nodes. Each node corresponds to a segmentation instance, including the object's category and segmentation mask information.

2. The system as described in claim 1, characterized in that, The semantic injection branch includes a global average pooling layer, a first MLP, a first activation function layer, and a first computation module. After being processed sequentially through a global average pooling layer, a multilayer perceptron, and a first activation function layer, the weights are converted into channel attention weights. The first calculation module according to Calculated ; The geometric refinement branch includes a spatial convolutional layer, a second activation function layer, and a second computation module. After being processed sequentially through a spatial convolutional layer and a second activation function layer, they become spatial attention weights. The second calculation module according to Calculated .

3. The system as described in claim 1 or 2, characterized in that, The processing includes: the first and second cross-attention layers respectively processing the query vector... , Performing cross-attention operations yields semantic context features for the RGB modalities. Structural relationship characteristics of thermal infrared modes The feature fusion layer will , By splicing, the fusion features are obtained. , Transformed into route weights via modal routers The modal router includes a second MLP and a third activation function layer connected in sequence, and the third calculation module will... , As respectively , The weights are used to weight and fuse them to obtain The relation classifiers are based on each subject-object pair. The input is used to predict the relationship between each subject-object pair, resulting in a subject-predicate-object relation triplet.

4. The system as described in claim 1, characterized in that, After generating subject-object pairs, the pairing proposal network also includes: Sort all subject-object pairs according to their confidence level, and remove all subject-object pairs except the top K pairs; where K is a positive integer.

5. The system as described in claim 1, characterized in that, Both the first and second pixel decoders adopt the Mask2Former architecture.

6. A training method for the multimodal image panoramic scene generation system as described in any one of claims 1-5, characterized in that, include: Construct the training set; The training set includes multiple pairs of labeled RGB images and thermal infrared images of different scenes. The labeling results are standardized subject-predicate-object relation triples in each scene. The multimodal image panoramic scene generation system is trained under supervised supervision using the training set, and the loss function used in the training is: in, For the total loss function, , , , , , All are weighting coefficients. , , The cross-entropy classification loss is calculated for the subject, object, and predicate, respectively. For the binary cross-entropy loss of the pairing proposal network, , These are the panoramic segmentation losses for the first and second pixel decoders, respectively.

7. A method for generating panoramic scene maps from multimodal images, characterized in that, include: The registered RGB image to be processed and the thermal infrared image are input into the multimodal image panoramic scene generation system as described in any one of claims 1-5 to obtain a panoramic scene image.

8. An electronic device, characterized in that, include: Computer-readable storage media and processors; The computer-readable storage medium is used to store executable instructions; The processor is configured to read executable instructions stored in the computer-readable storage medium and execute the training method as described in claim 6 or the generation method as described in claim 7.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions for causing a processor to perform the training method as described in claim 6 or the generation method as described in claim 7.

10. A computer program product, comprising a computer program or instructions, characterized in that, When the computer program or instructions are executed by the processor, they implement the training method as described in claim 6 or the generation method as described in claim 7.