Space-semantics jump screening and fusion of image-text cross-modal fine-grained fusion method

By employing a spatial-semantic jump filtering and fusion method, the problem of differences in the organizational logic of image and text modal information is solved, achieving higher-performance multimodal information processing and enhancing fine-grained alignment and fusion of images and text.

CN122241615APending Publication Date: 2026-06-19KUNMING UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
KUNMING UNIV OF SCI & TECH
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multimodal task methods perform poorly at the feature fusion level, failing to fully consider the differences in the organizational logic of text and image modal information, resulting in poor fusion effects and difficulty in accurately capturing fine-grained information.

Method used

A spatial-semantic jump filtering and fusion method is adopted, which realizes cross-modal information interaction through bidirectional attention calculation. It combines hierarchical attention and dilated convolution techniques to adaptively filter image and text features and uses a multimodal fusion module for directional fusion.

Benefits of technology

It enhances fine-grained alignment and fusion of images and text, improves the accuracy and robustness of the model, and achieves higher-performance multimodal information processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241615A_ABST
    Figure CN122241615A_ABST
Patent Text Reader

Abstract

This invention provides a fine-grained cross-modal fusion method for text and image data using spatial-semantic jump filtering, belonging to the field of natural language processing technology. The invention includes: embedding representations of input text data using a pre-trained language model; extracting visual features from input image data using a computer vision model; achieving bidirectional flow and fusion of cross-modal information through bidirectional attention computation, breaking down the barriers between text and image modalities; decomposing long text sequences into overlapping local blocks, and achieving deep interaction with aspect information at the block level through a hierarchical attention mechanism to filter out the most relevant text representations; extracting the most relevant fine-grained regions to the text aspect from global visual features through aspect-guided visual filtering; and adaptively adjusting the fusion intensity of text and image information using a hierarchical multimodal fusion module to guide the model to focus on the target aspect and perform directional fusion of multimodal information. This invention improves the accuracy and robustness of the model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a fine-grained cross-modal fusion method for text and image based on spatial-semantic jump filtering, which belongs to the field of natural language processing. Background Technology

[0002] In recent years, with the rapid advancement of internet technology and the widespread adoption of smart devices, the self-media industry has experienced unprecedented prosperity. Against this backdrop, the diversity and complexity of user-generated content have also significantly increased, gradually exhibiting a distinct multimodal characteristic. Multimodality refers to information not existing in a single form but being expressed through multiple media such as text, images, audio, and video. This trend makes it crucial for the field of natural language processing to efficiently and accurately extract key content needed to complete specific tasks from massive amounts of multimodal data a critical research topic, while also bringing new opportunities and challenges to academia.

[0003] However, current mainstream methods for multimodal tasks still have many limitations, especially in terms of unsatisfactory performance at the feature fusion level. Most existing multimodal task methods are limited to feature-level fusion paradigms. When dealing with the inherent differences between modalities, they often simply concatenate text and visual features, or fuse based on attention mechanisms, failing to fully consider the differences in information organization logic between the two, resulting in poor fusion effects and difficulty in accurately capturing fine-grained information in multimodal data. Regarding differences in basic embeddings, although some methods attempt to map features from different modalities to the same space through complex transformations, they often fail to completely eliminate the gaps between feature representations, posing challenges for models in learning cross-modal semantic associations. Therefore, this invention proposes a space-semantic jump-filtering fusion method for fine-grained cross-modal image-text fusion, aiming to enhance fine-grained alignment and fusion of images and text. Summary of the Invention

[0004] This invention provides a fine-grained image-text cross-modal fusion method based on spatial-semantic jump filtering, which addresses the problem that existing methods are mostly limited to feature-level fusion paradigms and do not adequately consider the differences in modal representation when handling image-text complementary fusion, thus affecting the robustness and prediction accuracy of multimodal fusion. This invention enhances fine-grained alignment and fusion of images and text.

[0005] The technical solution of this invention is: a fine-grained image-text cross-modal fusion method for spatial-semantic jump filtering and fusion, the method comprising:

[0006] Step 1: Obtain the text data and the corresponding image data;

[0007] Step 2, Text and Visual Embedding Representation: Use a pre-trained language model to embed and represent the input text data; use a computer vision model based on the Transformer architecture to extract visual features from the input image data;

[0008] Step 3, Cross-modal feature interaction between text and image: Achieve bidirectional flow and fusion of cross-modal information through bidirectional attention computation, breaking down the modal barriers between text and image;

[0009] Step 4, Adaptive filtering based on semantic logic: Decompose long text sequences into overlapping local blocks, and through a hierarchical attention mechanism, achieve deep interaction with aspect information at the block level to filter out the most relevant text representations.

[0010] Step 5, Adaptive filtering of spatial dimensions: Through aspect-guided visual filtering, extract the fine-grained regions most relevant to the text aspects from global visual features;

[0011] Step 6, Aspect-based image-text fusion: Using a hierarchical multimodal fusion module, the fusion intensity of image and text information is adaptively adjusted to guide the model to focus on the target aspect and perform directional fusion of multimodal information.

[0012] Furthermore, Step 2 specifically includes the following:

[0013] Step 2.1: For a text sentence S, aspect words A, visual image V, and text features... and aspect features Visual features extracted from the pre-trained language model BERT Generated by the ViT visual feature extraction network;

[0014] ;

[0015] ;

[0016] ;

[0017] in, , L represents the length of the text sequence. represents the dimension of the BERT embedding layer, and B represents the number of samples in the batch; N = H * W represents the image region, where H and W are the spatial height and width, respectively, and D... v Represents the embedding dimension of a visual image. This represents the feature vector extracted after passing through the BERT model.

[0018] Furthermore, Step 3 specifically includes the following:

[0019] Step 3.1, Attention from visual features to text features: using text features As a query, visual features As keys and values, the similarity between the text query and the visual key is calculated using multi-head attention to determine which visual regions should be focused on for each text word, generating text features related to the visual content. The calculation formula is as follows:

[0020] ;

[0021] in, This represents a multi-head attention module, where L represents the length of the text sequence. Indicates the dimension of the BERT embedding layer;

[0022] Step 3.2, Attention from Text Features to Visual Features: Focusing on Visual Features As a query, text features As keys and values, the similarity between the visual query and the text key is calculated using multi-head attention to determine which text description fragments should be focused on for each image region, generating visual features highly aligned with the text semantics. The calculation formula is as follows:

[0023] ;

[0024] Where N = H * W is the image region, H and W are the spatial height and width, respectively, and D v This represents the embedding dimension of a visual image.

[0025] Furthermore, the specific steps of Step 4 include:

[0026] Step 4.1: Identify text features related to visual content. The window is divided into a series of local blocks using a sliding window, where the window size is set to... The step size is M, the number of blocks is C, and the number of blocks is calculated by the following formula;

[0027] ;

[0028] Where L represents the length of the text sequence;

[0029] Step 4.2, Intra-block self-attention enhancement: Use block-level masks to filter out non-empty valid blocks. Subsequently, each valid block is passed through a multi-head self-attention layer, and independent context modeling is performed for each valid block. The calculation is expressed by the following formula:

[0030] ;

[0031] in, It is the set of valid blocks selected from all blocks. For multi-head self-attention layer processing;

[0032] Step 4.3, Neighbor Block Aggregation and Gated Modulation: Aggregate the left and right adjacent blocks of each central block to form a neighbor block feature. Introducing a gated modulation mechanism to compute word features. Features of neighboring blocks The dot product is then used to generate importance weights via the Sigmoid function. The gating value quantifies the importance of each context token to the aspect, with an importance weight. The calculation formula is:

[0033] ;

[0034] in, , The sliding window size is N = H * W, where N is the image region, and H and W are the height and width of the space, respectively. Indicates the dimension of the BERT embedding layer. σ is the Sigmoid function;

[0035] Subsequently, the neighbor block features are weighted by importance to obtain the gated text features. The calculation is as follows:

[0036] ;

[0037] in, This indicates that elements at corresponding positions are multiplied;

[0038] Step 4.4: Processing based on a Top-K filtering sparse cross-attention mechanism: For each query vector, only the K most relevant key-value pairs are retained, where K is a preset hyperparameter; aspect term features are then processed. As a query, gated text features Using these as keys and values, cross-modal attention computation is performed to obtain aggregated text features after the interaction between each text block and aspect word features. The calculation is as follows:

[0039] ;

[0040] in, This indicates that cross-modal attention computation is being performed;

[0041] Finally, a multi-head self-attention layer is used to process the text features after cross-modal interaction. The text features are then fused and enhanced to produce the final filtered text features, which are the text features after passing through the semantic logic dimension adaptive filtering module. ;

[0042] .

[0043] Furthermore, Step 5 specifically includes the following:

[0044] Step 5.1, Region Feature Extraction Guided by Dilated Convolution: Dilated convolution technology is used to expand the receptive field without reducing spatial resolution by adjusting the dilation rate. Extract multi-scale local region features to capture richer contextual information;

[0045] ;

[0046] in, Let d represent the set of overlapping local blocks, and d represent the void ratio. Indicates a fixed convolution kernel size. This indicates that each block contains Feature vectors of spatial locations Visual features that are highly aligned with the semantics of the text;

[0047] An edge protection masking mechanism was designed, and the mask matrix... The calculation process is as follows:

[0048] ;

[0049] Where P is the protective boundary calculated based on the kernel size and the hole rate. This mask suppresses the influence of invalid edge regions in the subsequent attention calculation.

[0050] Step 5.2, Top-K Sparse Attention Filtering: Computational Word Features With each regional sub-unit The similarity is calculated, and after masking, | is used as the semantic relevance score S:

[0051] ;

[0052] This indicates the same feature dimension as the aspect term feature;

[0053] Based on the relevance score S, a Top-K dynamic selection strategy is implemented to retain the most relevant position for each location. Visual features obtained by filtering regions using semantic relevance scores The calculation is as follows:

[0054] ;

[0055] in, N = H * W represents the image region, where H and W are the spatial height and width, respectively.

[0056] Subsequently, the selected regions are semantically enhanced using a self-attention mechanism to obtain enhanced visual features:

[0057] ;

[0058] , It is a self-attention module;

[0059] Step 5.3, Sparse Cross-Attention Layer: Introducing a gating modulation mechanism to dynamically weigh the importance of visual information, and assigning aspect words... and enhance visual features The importance weights are obtained by performing a sigmoid calculation. ,Will With enhanced visual features Obtain visual features after gating The calculation process is as follows:

[0060]

[0061]

[0062] in, , σ is the Sigmoid function;

[0063] Deep interaction between visual attributes is achieved through sparse cross-attention, resulting in aspect words. Filtered visual features The calculation process is as follows:

[0064]

[0065]

[0066] in, , , This represents the visual features after passing through the adaptive filtering module in spatial dimensions. This represents a sparse cross-attention module. This is a self-attention module.

[0067] Furthermore, Step 6 specifically includes the following:

[0068] Step 6.1, Feature Space Alignment: First, align the visual features after the adaptive filtering module of spatial dimensions. Text features after adaptive filtering based on semantic logic dimension Perform spatial alignment mapping, and align the visual features. and text features The calculation process is as follows:

[0069]

[0070]

[0071] in, B represents the number of samples in the batch, and D1 represents the aligned feature dimension. Indicates the number of visual regions. , This indicates the number of text units in each sample. and For a learnable projection matrix, , As a bias term, GELU indicates that the activation function provides a smooth nonlinear transformation;

[0072] Step 6.2, Residual Modal Interaction: A two-branch sparse attention mechanism is adopted to address aspect word features separately. Aligned visual features and text features Interactive computation is performed to obtain aggregated visual features. and aggregated text features Finally, layer normalization is performed to obtain the final visual features. and text features The calculation process is as follows:

[0073] ;

[0074] ;

[0075] ;

[0076] ;

[0077] This represents a sparse cross-modal attention mechanism. Representation layer normalization;

[0078] Step 6.3, Dynamic Gating Fusion: To adaptively balance the contribution of visual and textual information to the final decision, we designed a dynamic gating mechanism with weights... The calculation formula is as follows:

[0079]

[0080]

[0081] in, and For the gated weight matrix, GELU indicates that the activation function provides a smooth nonlinear transformation; Both represent bias terms. This indicates the sigmoid calculation. For gating fusion;

[0082] Step 6.4, Joint Inference Enhancement: A Transformer encoder layer is used to perform deep inference on the fused features. This layer further mines the semantic relationships within the fused features through a self-attention mechanism, obtaining the final text-visual fusion feature representation. :

[0083] ;

[0084] in, This is the encoder layer.

[0085] The beneficial effects of this invention are:

[0086] This invention proposes a spatial-semantic jump-filtering fusion method for fine-grained cross-modal image-text fusion. This method effectively interacts with text and visual modalities, and through adaptive filtering in both spatial and semantic logic dimensions, it fully mines fine-grained visual features related to text in the image modality, enhancing the expressive power of image modality information and effectively bridging modal differences. Finally, it uses a cross-modal gating adaptive fusion method to effectively fuse text and visual modalities, achieving a higher-performance image-text fusion method, enhancing fine-grained alignment and fusion of images and text, and ultimately improving the accuracy and robustness of the model. Attached Figure Description

[0087] Figure 1 This is a schematic diagram of the framework of the spatial-semantic jump screening and fusion method for fine-grained cross-modal fusion of text and images in this invention. Detailed Implementation

[0088] Example 1: The method proposed in this example was implemented on two multimodal sentiment analysis datasets (Twitter2015 and Twitter2017);

[0089] Two benchmark datasets, Twitter2015 and Twitter2017, were used for sentiment analysis. The two Twitter datasets collected user posts from 2014-2015 and 2016-2017, respectively, both containing multimodal text and images. Each tweet in the datasets contains a sentence, an image, and at least one aspect word labeled with sentiment polarity. The sentiment polarity of aspect words in the Twitter datasets is divided into three categories: positive, neutral, and negative. Each dataset was divided into training, development, and test sets. Statistical information for each dataset is shown in Table 1.

[0090] Table 1 shows the statistics of the dataset.

[0091] like Figure 1 As shown, a fine-grained cross-modal fusion method for image and text cross-modal processing using spatial-semantic jump filtering is described, the method comprising:

[0092] Step 1: Obtain the text data and the corresponding image data; The text data and corresponding image data used in this invention were obtained from public datasets. These datasets have been verified and used by previous researchers and have a certain degree of credibility and representativeness. These datasets were carefully screened to ensure sufficient quality and quantity to meet research needs. To demonstrate the effectiveness of the proposed method, the model corresponding to the method of this invention was evaluated on two types of test sets, including (1) Twitter2015 and (2) Twitter2017. All datasets used are public datasets, and their sources are all publicly available and legal, so there is no need to worry about infringement or intellectual property issues. The usage agreements and terms of the data providers were strictly followed to ensure the legal acquisition and use of the data.

[0093] Step 2, Text and Visual Embedding Representation: Use a pre-trained language model to embed and represent the input text data; use a computer vision model based on the Transformer architecture to extract visual features from the input image data;

[0094] Step 3, Cross-modal feature interaction between text and image: Achieve bidirectional flow and fusion of cross-modal information through bidirectional attention computation, breaking down the modal barriers between text and image;

[0095] Step 4, Adaptive filtering based on semantic logic: Decompose long text sequences into overlapping local blocks, and through a hierarchical attention mechanism, achieve deep interaction with aspect information at the block level to filter out the most relevant text representations.

[0096] Step 5, Adaptive filtering of spatial dimensions: Through aspect-guided visual filtering, extract the fine-grained regions most relevant to the text aspects from global visual features;

[0097] Step 6, Aspect-based image-text fusion: Using a hierarchical multimodal fusion module, the fusion intensity of image and text information is adaptively adjusted to guide the model to focus on the target aspect and perform directional fusion of multimodal information.

[0098] Furthermore, Step 2 specifically includes the following:

[0099] Step 2.1: For a text sentence S, aspect words A, visual image V, and text features... and aspect features Visual features extracted from the pre-trained language model BERT Generated by the ViT visual feature extraction network;

[0100] ;

[0101] ;

[0102] ;

[0103] in, , L represents the length of the text sequence. represents the dimension of the BERT embedding layer, and B represents the number of samples in the batch; N = H * W represents the image region, where H and W are the spatial height and width, respectively, and D... v Represents the embedding dimension of a visual image. This represents the feature vector extracted after passing through the BERT model.

[0104] Furthermore, Step 3 specifically includes the following:

[0105] Step 3.1, Attention from visual features to text features: using text features As a query, visual features As keys and values, the similarity between the text query and the visual key is calculated using multi-head attention to determine which visual regions should be focused on for each text word, generating text features related to the visual content. The calculation formula is as follows:

[0106] ;

[0107] in, This represents a multi-head attention module, where L represents the length of the text sequence. Indicates the dimension of the BERT embedding layer;

[0108] Step 3.2, Attention from Text Features to Visual Features: Focusing on Visual Features As a query, text features As keys and values, the similarity between the visual query and the text key is calculated using multi-head attention to determine which text description fragments should be focused on for each image region, generating visual features highly aligned with the text semantics. The calculation formula is as follows:

[0109] .

[0110] Where N = H * W is the image region, H and W are the spatial height and width, respectively, and D v This represents the embedding dimension of a visual image.

[0111] Furthermore, the specific steps of Step 4 include:

[0112] Step 4.1: Identify text features related to visual content. The window is divided into a series of local blocks using a sliding window, where the window size is set to... The step size is M, the number of blocks is C, and the number of blocks is calculated by the following formula;

[0113] ;

[0114] Where L represents the length of the text sequence;

[0115] Step 4.2, Intra-block self-attention enhancement: Use block-level masks to filter out non-empty valid blocks. Subsequently, each valid block is passed through a multi-head self-attention layer, and independent context modeling is performed for each valid block. The calculation is expressed by the following formula:

[0116] ;

[0117] in, It is the set of valid blocks selected from all blocks. For multi-head self-attention layer processing;

[0118] Step 4.3, Neighbor Block Aggregation and Gated Modulation: To overcome the field-of-view limitations of a single block, the blocks to the left and right of each central block are aggregated to form a neighbor block feature. To dynamically measure the importance of different neighboring blocks to the current aspect word, a gating modulation mechanism is introduced to calculate aspect word features. Features of neighboring blocks The dot product is then used to generate importance weights via the Sigmoid function. The gating value quantifies the importance of each context token to the aspect, with an importance weight. The calculation formula is:

[0119] ;

[0120] in, , The sliding window size is N = H * W, where N is the image region, and H and W are the height and width of the space, respectively. Indicates the dimension of the BERT embedding layer. σ is the Sigmoid function;

[0121] Subsequently, the neighbor block features are weighted by importance to obtain the gated text features. The calculation is as follows:

[0122] ;

[0123] in, This indicates that elements at corresponding positions are multiplied; This step enables soft selection of contextual information, suppressing irrelevant information and amplifying relevant signals.

[0124] Step 4.4: To efficiently achieve cross-modal feature interaction and focus on the most relevant information fragments, a sparse cross-attention mechanism based on Top-K filtering is used: For each query vector, only the K most relevant key-value pairs are retained, where K is a preset hyperparameter; aspect word features are... As a query, gated text features Using these as keys and values, cross-modal attention computation is performed to obtain aggregated text features after the interaction between each text block and aspect word features. The calculation is as follows:

[0125] ;

[0126] in, This indicates that cross-modal attention computation is being performed;

[0127] Finally, a multi-head self-attention layer is used to process the text features after cross-modal interaction. The text features are then fused and enhanced to produce the final filtered text features, which are the text features after passing through the semantic logic dimension adaptive filtering module. ;

[0128] .

[0129] Furthermore, Step 5 specifically includes the following:

[0130] Step 5.1, Region Feature Extraction Guided by Dilated Convolution: Dilated convolution technology is used to expand the receptive field without reducing spatial resolution by adjusting the dilation rate. Extract multi-scale local region features to capture richer contextual information;

[0131] ;

[0132] in, Let d represent the set of overlapping local blocks, and d represent the void ratio. Indicates a fixed convolution kernel size. This indicates that each block contains Feature vectors of spatial locations Visual features that are highly aligned with the semantics of the text;

[0133] To prevent invalid padding at edge locations from interfering with model learning, an edge protection mask mechanism was designed, with a mask matrix... The calculation process is as follows:

[0134] ;

[0135] Where P is the protective boundary calculated based on the kernel size and the hole rate. This mask suppresses the influence of invalid edge regions in the subsequent attention calculation.

[0136] Step 5.2, Top-K Sparse Attention Filtering: To filter out the most relevant parts of the attribute from a large number of local regions, aspect-wise word features are calculated. With each regional sub-unit The similarity is calculated, and after masking, | is used as the semantic relevance score S:

[0137] ;

[0138] This indicates the same feature dimension as the aspect term feature;

[0139] Based on the relevance score S, a Top-K dynamic selection strategy is implemented to retain the most relevant position for each location. Visual features obtained by filtering regions using semantic relevance scores The calculation is as follows:

[0140] ;

[0141] in, N = H * W represents the image region, where H and W are the spatial height and width, respectively.

[0142] Subsequently, the selected regions are semantically enhanced using a self-attention mechanism to obtain enhanced visual features:

[0143] ;

[0144] , It is a self-attention module;

[0145] Step 5.3, Sparse Cross-Attention Layer: To achieve a refined fusion of visual and attribute features, a gating modulation mechanism is introduced to dynamically weigh the importance of visual information, and to integrate aspect words... and enhance visual features The importance weights are obtained by performing a sigmoid calculation. ,Will With enhanced visual features Obtain visual features after gating The calculation process is as follows:

[0146]

[0147]

[0148] in, , σ is the Sigmoid function;

[0149] Deep interaction between visual attributes is achieved through sparse cross-attention, resulting in aspect words. Filtered visual features The calculation process is as follows:

[0150]

[0151]

[0152] in, , , This represents the visual features after passing through the adaptive filtering module in spatial dimensions. This represents a sparse cross-attention module. This is a self-attention module.

[0153] Furthermore, Step 6 specifically includes the following:

[0154] Step 6.1, Feature Space Alignment: Since visual and text features usually come from different encoders and have different feature distributions, to facilitate cross-modal interaction, the visual features after the spatial dimension adaptive filtering module are first aligned. Text features after adaptive filtering based on semantic logic dimension Perform spatial alignment mapping, and align the visual features. and text features The calculation process is as follows:

[0155]

[0156]

[0157] in, B represents the number of samples in the batch, and D1 represents the aligned feature dimension. Indicates the number of visual regions. , This indicates the number of text units in each sample. and For a learnable projection matrix, , As a bias term, GELU indicates that the activation function provides a smooth non-linear transformation, ensuring a more consistent feature distribution after alignment.

[0158] Step 6.2, Residual Modal Interaction: A two-branch sparse attention mechanism is adopted to address aspect word features separately. Aligned visual features and text features Interactive computation is performed to obtain aggregated visual features. and aggregated text features Finally, layer normalization is performed to obtain the final visual features. and text features The calculation process is as follows:

[0159] ;

[0160] ;

[0161] ;

[0162] ;

[0163] This represents a sparse cross-modal attention mechanism. Representation layer normalization; sparse attention mechanism ensures that only the most relevant information is focused on, while residual connections protect the integrity of the original aspect information, and layer normalization stabilizes the training process;

[0164] Step 6.3, Dynamic Gating Fusion: To adaptively balance the contribution of visual and textual information to the final decision, we designed a dynamic gating mechanism with weights... The calculation formula is as follows:

[0165]

[0166]

[0167] in, and For the gated weight matrix, GELU indicates that the activation function provides a smooth nonlinear transformation; Both represent bias terms. This indicates the sigmoid calculation. This is a gated fusion mechanism; it can dynamically adjust the fusion ratio of visual and textual information according to the characteristics of the input sample, and achieve content-adaptive multimodal fusion.

[0168] Step 6.4, Joint Inference Enhancement: A Transformer encoder layer is used to perform deep inference on the fused features. This layer further mines the semantic relationships within the fused features through a self-attention mechanism, obtaining the final text-visual fusion feature representation. :

[0169] ;

[0170] in, This is the encoder layer. This layer further mines the semantic relationships within the fused features through a self-attention mechanism, enhancing the model's reasoning ability.

[0171] Evaluation metrics: Accuracy (Acc) and macro F1 score (F1) are used as evaluation metrics to compare the model corresponding to the method of this invention with other benchmark models. Higher Acc and F1 scores indicate better performance.

[0172] To verify the effectiveness of the model proposed in this invention, the method model of this invention is compared with other state-of-the-art sentiment analysis models, covering two datasets in an image-text scenario, as follows:

[0173] MIMN explores attention-based interactions between aspects, sentences, and related images through a multi-hop memory network.

[0174] TomBERT learns aspect-sensitive text features, matches aspect-image pairs to obtain visual features, and uses a self-attention strategy to capture multimodal interactions.

[0175] MMAP reveals hidden relationships between sentences, aspects, and images through multimodal interaction layers and adversarial training.

[0176] AMIFN improves the accuracy of fine-grained sentiment analysis by integrating attention mechanisms and graph convolutional networks to enable multi-perspective interaction and fusion for specific aspects.

[0177] ESAFN divides sentences into left and right contexts, uses attention mechanisms to explore aspect-text and aspect-image interactions, and finally fuses these features through a bilinear layer for sentiment prediction.

[0178] Res-BERT combines visual features extracted by ResNet with hidden representations from BERT.

[0179] HIMT performs aspect-text and aspect-image interactions and builds an auxiliary reconstruction module to eliminate semantic differences between different modalities.

[0180] VLP-MABSA is a task-specific vision-language pre-trained model that models aspects, viewpoints, and alignment by performing five specialized pre-training tasks.

[0181] KEF-TomBERT is a knowledge augmentation framework that improves task performance by associating images with adjective-noun pairs.

[0182] The method of this invention was first compared on two benchmark datasets, Twitter2015 and Twitter2017, in a scenario combining text and images. The results are shown in Table 2. The following conclusions were drawn:

[0183] On the ACC and F1 metrics, this method outperforms existing state-of-the-art models and improves the scores by 2-3 points on most test sets; this confirms that the method proposed in this invention can effectively perform fine-grained image-text alignment and fusion; considering the experimental results of the two datasets, the method proposed in this invention is effective in multimodal tasks.

[0184] Table 2: Comparison of experimental results on the Twitter 2015 and Twitter 2017 datasets

[0185] To further verify the effectiveness of the module proposed in this invention, ablation experiments were conducted on two datasets. The experimental results are shown in Table 3.

[0186] Table 3 shows the ablation experimental results for the two datasets.

[0187] Table 3 shows that after removing the model corresponding to the method of this invention, the ACC and F1 scores of both datasets decreased significantly, and the model performance degraded considerably. This demonstrates the effectiveness and importance of the image-text interaction module, the adaptive filtering module of the semantic logic dimension, and the adaptive filtering module of the spatial dimension in the multimodal image-text fusion task.

[0188] The specific embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.

Claims

1. A method for spatial-semantic jump screening fusion of cross-modal fine-grained fusion of text and image, characterized in that: The method includes: Step 1: Obtain the text data and the corresponding image data; Step 2, Text and Visual Embedding Representation: Use a pre-trained language model to embed and represent the input text data; use a computer vision model based on the Transformer architecture to extract visual features from the input image data; Step 3, Cross-modal feature interaction between text and image: Achieve bidirectional flow and fusion of cross-modal information through bidirectional attention computation, breaking down the modal barriers between text and image; Step 4, Adaptive filtering based on semantic logic: Decompose long text sequences into overlapping local blocks, and through a hierarchical attention mechanism, achieve deep interaction with aspect information at the block level to filter out the most relevant text representations. Step 5, Adaptive filtering of spatial dimensions: Through aspect-guided visual filtering, extract the fine-grained regions most relevant to the text aspects from global visual features; Step 6, Aspect-based image-text fusion: Using a hierarchical multimodal fusion module, the fusion intensity of image and text information is adaptively adjusted to guide the model to focus on the target aspect and perform directional fusion of multimodal information.

2. The method of claim 1, wherein the space-semantic jump filtering fusion and text-image cross-modal fine-grained fusion method is characterized in that: Step 2 specifically includes the following: Step2.1、For a text sentence S, an aspect word A, a visual image V, a text feature and an aspect word feature extracted by a pre-trained language model BERT; a visual feature generated by a visual feature extraction network ViT; ; ; ; in, , L represents the length of the text sequence. B represents the dimension of the BERT embedding layer, and B represents the number of samples in the batch. N = H * W represents the image region, where H and W are the spatial height and width, respectively, and D... v Represents the embedding dimension of a visual image. This represents the feature vector extracted after passing through the BERT model.

3. The fine-grained image-text cross-modal fusion method for spatial-semantic jump filtering and fusion according to claim 1, characterized in that: Step 3 specifically includes the following: Step 3.1, Attention from visual features to text features: using text features As a query, visual features As keys and values; using multi-head attention to calculate the similarity between the text query and the visual key, determining which visual regions each text word should focus on, and generating text features related to the visual content. The calculation formula is as follows: ; in, This represents a multi-head attention module, where L represents the length of the text sequence. Indicates the dimension of the BERT embedding layer; Step 3.2, Attention from Text Features to Visual Features: Focusing on Visual Features As a query, text features As keys and values, the similarity between the visual query and the text key is calculated using multi-head attention to determine which text description fragments should be focused on for each image region, generating visual features highly aligned with the text semantics. The calculation formula is as follows: ; where N = H * W is the image region, H and W are the spatial height and width, respectively, D v represents the embedding dimension of the visual image.

4. The fine-grained image-text cross-modal fusion method for spatial-semantic jump filtering and fusion according to claim 1, characterized in that: The specific steps of Step 4 include: Step 4.1: Identify text features related to visual content. The window is divided into a series of local blocks using a sliding window, where the window size is set to... The step size is M, the number of blocks is C, and the number of blocks is calculated by the following formula; ; Where L represents the length of the text sequence; Step 4.2, Intra-block self-attention enhancement: Use block-level masks to filter out non-empty valid blocks. Subsequently, each valid block is passed through a multi-head self-attention layer, and independent context modeling is performed for each valid block. The calculation is expressed by the following formula: ; in, It is the set of valid blocks selected from all blocks. For multi-head self-attention layer processing; Step 4.3, Neighbor Block Aggregation and Gated Modulation: Aggregate the left and right adjacent blocks of each central block to form a neighbor block feature. Introducing a gated modulation mechanism to compute word features. Features of neighboring blocks The dot product is then used to generate importance weights via the Sigmoid function. The gating value quantifies the importance of each context token to the aspect, with an importance weight. The calculation formula is: ; in, , The sliding window size is N = H * W, where N is the image region, and H and W are the height and width of the space, respectively. Indicates the dimension of the BERT embedding layer. σ is the Sigmoid function; Subsequently, the neighbor block features are weighted by importance to obtain the gated text features. The calculation is as follows: ; in, This indicates that elements at corresponding positions are multiplied. Step 4.4: Processing based on a Top-K filtering sparse cross-attention mechanism: For each query vector, only the K most relevant key-value pairs are retained, where K is a preset hyperparameter; aspect term features are then processed. As a query, gated text features Using these as keys and values, cross-modal attention computation is performed to obtain aggregated text features after the interaction between each text block and aspect word features. The calculation is as follows: ; in, This indicates that cross-modal attention computation is being performed; Finally, a multi-head self-attention layer is used to process the text features after cross-modal interaction. The text features are then fused and enhanced to produce the final filtered text features, which are the text features after passing through the semantic logic dimension adaptive filtering module. ; 。 5. The fine-grained image-text cross-modal fusion method for spatial-semantic jump filtering and fusion according to claim 1, characterized in that: Step 5 specifically includes the following: Step 5.1, Region Feature Extraction Guided by Dilated Convolution: Dilated convolution technology is used to expand the receptive field without reducing spatial resolution by adjusting the dilation rate. Extract multi-scale local region features to capture richer contextual information; ; in, Let d represent the set of overlapping local blocks, and d represent the void ratio. Indicates a fixed convolution kernel size. Indicates that each block contains Feature vectors of spatial locations Visual features that are highly aligned with the semantics of the text; An edge protection masking mechanism was designed, and the mask matrix... The calculation process is as follows: ; Where P is the protective boundary calculated based on the kernel size and the hole rate. This mask suppresses the influence of invalid edge regions in the subsequent attention calculation. Step 5.2, Top-K Sparse Attention Filtering: Computational Word Features With each regional sub-unit The similarity is calculated, and after masking, | is used as the semantic relevance score S: ; This indicates the same feature dimension as the aspect term feature; Based on the relevance score S, a Top-K dynamic selection strategy is implemented to retain the most relevant position for each location. Visual features obtained by filtering regions using semantic relevance scores The calculation is as follows: ; in, N = H * W represents the image region, where H and W are the spatial height and width, respectively. Subsequently, the selected regions are semantically enhanced using a self-attention mechanism to obtain enhanced visual features: ; , It is a self-attention module; Step 5.3, Sparse Cross-Attention Layer: Introducing a gating modulation mechanism to dynamically weigh the importance of visual information, and assigning aspect words... and enhance visual features The importance weights are obtained by performing a sigmoid calculation. ,Will With enhanced visual features Obtain visual features after gating The calculation process is as follows: ; ; in, , σ is the Sigmoid function; Deep interaction between visual attributes is achieved through sparse cross-attention, resulting in aspect words. Filtered visual features The calculation process is as follows: ; ; in, , , This represents the visual features after passing through the adaptive filtering module in spatial dimensions. This represents a sparse cross-attention module. This is a self-attention module.

6. The fine-grained image-text cross-modal fusion method for spatial-semantic jump filtering and fusion according to claim 1, characterized in that: Step 6 specifically includes the following: Step 6.1, Feature Space Alignment: First, align the visual features after the adaptive filtering module of spatial dimensions. Text features after adaptive filtering based on semantic logic dimension Perform spatial alignment mapping, and align the visual features. and text features The calculation process is as follows: ; ; in, B represents the number of samples in the batch, and D1 represents the aligned feature dimension. Indicates the number of visual regions. , This indicates the number of text units in each sample. and The projection matrix is ​​learnable. , As a bias term, GELU indicates that the activation function provides a smooth nonlinear transformation; Step 6.2, Residual Modal Interaction: A two-branch sparse attention mechanism is adopted to address aspect word features separately. Aligned visual features and text features Interactive computation is performed to obtain aggregated visual features. and aggregated text features Finally, layer normalization is performed to obtain the final visual features. and text features The calculation process is as follows: ; ; ; ; This represents a sparse cross-modal attention mechanism. Representation layer normalization; Step 6.3, Dynamic Gating Fusion: Design a dynamic gating mechanism, weighting... The calculation formula is as follows: ; ; in, and For the gated weight matrix, GELU indicates that the activation function provides a smooth nonlinear transformation; Both represent bias terms. This indicates the sigmoid calculation. For gating fusion; Step 6.4, Joint Inference Enhancement: A Transformer encoder layer is used to perform deep inference on the fused features. This layer further mines the semantic relationships within the fused features through a self-attention mechanism, obtaining the final text-visual fusion feature representation. : ; in, This is the encoder layer.