Vision positioning method based on memory self-correction
By dynamically refining visual features through a semantically relevant filtering module and an adaptive memory fusion module, the problems of inaccurate region proposals and redundant noise in visual localization are solved, achieving higher accuracy and performance in visual localization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIV OF PETROLEUM (EAST CHINA)
- Filing Date
- 2023-12-14
- Publication Date
- 2026-06-26
AI Technical Summary
Existing visual localization methods suffer from suboptimal results due to inaccurate region proposals when dealing with images containing diverse objects and complex relationships. Furthermore, redundancy and noise are easily generated when matching image information with text information.
By employing a semantic relevance filtering module and an adaptive memory fusion module, visual features are dynamically refined, relevant image information is selected based on text queries and adaptively fused to generate finer-grained image features, thereby enhancing the semantic consistency of the model.
It improves the accuracy and performance of visual localization tasks by enhancing the semantic relevance between images and text, reducing redundant information, and improving the accuracy and precision of matching.
Smart Images

Figure QLYQS_1 
Figure QLYQS_4 
Figure QLYQS_5
Abstract
Description
Technical Field
[0001] This invention pertains to visual positioning methods and relates to the technical fields of computer vision and natural language processing. Background Technology
[0002] The purpose of visual localization is to find the most relevant objects or regions in an image based on natural language queries, which requires establishing a connection between visual and textual information. Due to its wide application in tasks such as image captioning, visual question answering, and image search, visual localization is an important research topic in multimodal communities.
[0003] In recent years, fundamental research in visual imaging has made significant progress in fusing information from different modalities. Early visual localization methods can be categorized into single-stage and two-stage methods. Single-stage methods emphasize fusing the entire image and corresponding text information in a single step to predict the underlying result. However, they may struggle to handle images containing diverse objects and complex relationships. On the other hand, two-stage methods involve generating visual region features using a pre-trained model in the first stage and modeling the multimodal relationship between the region and the text in the second stage. However, the quality of the region proposal becomes crucial to performance, as inaccurate initial proposals can lead to suboptimal underlying results. To mitigate these issues, recent work has applied transformers to visual localization, allowing for multimodal inference based on pixel-level feature maps. For example, TransVG leverages the transformer framework to capture global contextual knowledge and predicts bounding boxes directly from text queries, achieving accurate target localization. To simplify the modeling process, Seqtr redefines visual localization as a point prediction problem by directly predicting discrete coordinate labels for localization information using a transformer.
[0004] While transformer-based visualization-based localization methods have achieved promising results, they still present new challenges. Existing methods simply concatenate visual features with textual features extracted from the transformer for the query. This can result in redundant visual features and mislead the model, as image information contains more noise than textual information. Therefore, it is necessary to eliminate irrelevant visual information based on the text query.
[0005] To address the aforementioned challenges, this paper proposes a novel Memory Self-Adjustment (MSCN) for visual localization. This network dynamically refines visual features based on the information features of the text query, resulting in more accurate matching between the text query and relevant image regions. MSCN consists of two key components: a Semantic Relevance Filtering Module (SRFM) and an Adaptive Memory Fusion Module (AMFM). The SRFM selects and stores image information relevant to the text content, thereby capturing the semantic relevance between the image and text; then, AMFM adaptively fuses the text-related information stored in SRFM with the initial image features. Importantly, AMFM achieves this fusion without directly connecting the image to memory. This design helps establish a robust connection between the image and text, thereby improving the performance of visual localization tasks. Summary of the Invention
[0006] The purpose of this invention is to address the problem that conventional visual localization methods rely on fixed image and text representations to capture cross-modal semantic consistency, which limits the flexibility of adjusting image representations based on different text information.
[0007] The technical solution adopted by the present invention to solve the above-mentioned technical problems is as follows:
[0008] S1. Construct a semantic relevance filtering module to select and store image information related to the text content, thereby capturing the semantic relevance between images and text;
[0009] S2. Construct an adaptive memory fusion module to adaptively fuse the text-related information stored in the semantically related filtering module with the initial image features. Based on the correlation between the stored memory information and the original image, generate image features with finer-grained details.
[0010] S3. Combine the modules in S1 and S2 to construct the overall architecture of the visual localization method based on memory self-correction;
[0011] S4. Training of visual localization methods based on memory self-correction.
[0012] 1. The visual localization method based on memory self-correction according to claim 1, characterized in that the specific process of S1 is as follows:
[0013] To improve the refinement of initial image features in response to queries, a semantic relevance filtering module was designed. A novel memory gate mechanism was also devised, enabling the network to focus on selecting relevant words to refine the initial image features, rather than processing image and text information in isolation. First, by combining image features V and query features E, the importance of each word was calculated using the memory gate, determining their importance in the refinement process.
[0014]
[0015] Where σ is the sigmoid function, w v It is a 1×M matrix, w e It is a 1×N matrix. Then, combining image features and word features, the corrected image features R are written into memory:
[0016]
[0017] Among them, M v (·) and M e (·) indicates that image and text query features are embedded into the same dimension through 1×1 convolution;
[0018] 2. The visual localization method based on memory self-correction according to claim 1, wherein the specific process of S2 is as follows:
[0019] To explore effective multimodal features, this paper proposes an adaptive memory fusion module. Based on the correlation between stored memory information and the original image, image features with finer-grained details are generated. The calibrated image features are adaptively fused with the query to alleviate the problem of cross-modal semantic inconsistency. Specifically, we represent the probability of similarity between memory units by acquiring relevant memory information and calculating their weights.
[0020]
[0021] Where α ij δ is the similarity probability between the i-th memory and the j-th image feature. k The parentheses are implemented as 1×1 convolutions; then, the updated memory representation is output based on the similarity probabilities:
[0022]
[0023] After receiving the corrected memory features, the image and output representation are integrated to generate new image features; then, an adaptive gate is used to dynamically adjust the information flow to update the corrected image.
[0024] G m =σ(w g [u i ,v i ]+b g (5)
[0025] x i =u i *G m +v i *(1-G m )
[0026] Among them, G m It is the response gate of feature integration, σ is the sigmoid function, and w g and b g These are the parameter matrix and the bias:
[0027] F=σ(E)⊙σ(X)) (6)
[0028] Finally, the Hadamard product between E and X is used to generate the final multimodal features, which are then input into the transformer for feature update and localization.
[0029] 3. The visual localization method based on memory self-correction according to claim 1, characterized in that the specific process of S3 is as follows:
[0030] The aforementioned memory self-correction-based visual localization method includes a semantic relevance filtering module, an adaptive memory fusion module, and a memory self-correction-based visual localization network.
[0031] 4. The visual localization method based on memory self-correction according to claim 1, characterized in that the specific process of S4 is as follows:
[0032] The training method for the memory self-correction-based visual localization method is as follows:
[0033] In our training implementation, we used an Adam optimizer with an initial learning rate of 5e-4, trained for 60 epochs, and used a batch size of 32 during training. After 50 epochs, the learning rate was reduced to 10 times the original. We resized the images to 640*640 and used Darknet-53 to encode the multi-scale visual features. The dimensions of the multi-scale visual features were set to 256, 512, and 1024. The sentence length of RefCOCO+ was pruned to 15, and the sentence length of RefCOCOg was 20. A centralized transformer encoder was responsible for updating the multimodal representation, and the decoder performed autoregressive predictions on the target sequence. The transformer had a hidden dimension of 256, the feedforward network had an inflation rate of 4, and the decoder consisted of 3 layers. In our experimental evaluation, we used Precision@0.5 as the evaluation metric. If the Intersection over Union (IOU) between the predicted bounding box and the ground truth bounding box was greater than 0.5, the prediction was considered correct.
[0034] Compared with existing technologies, the beneficial effects of this invention are:
[0035] 1. This invention proposes a visual localization method based on memory self-correction, which dynamically refines visual features based on the feature information of text queries, making the matching of text queries with relevant image regions more accurate, thereby improving the semantic consistency between text and images.
[0036] 2. This invention proposes for the first time a semantically relevant filtering module and an adaptive memory fusion module. The semantically relevant filtering module focuses on filtering image information that is irrelevant to the query, while the adaptive memory fusion module adaptively fuses text-related representations with initial image features to enhance the understanding ability of the memory self-correction model. Attached Figure Description
[0037] Figure 1 This is a schematic diagram of a visual localization method based on memory self-correction.
[0038] Figure 2 The figure shows a comparison of the results of the memory self-correction-based visual localization method with other network-based visual localization methods on the RefCOCO, RefCOCO+, and RefCOCOg datasets.
[0039] Figure 3 This is a comparison of ablation experiment results for a memory self-correction-based visual localization method on the RefCOCO dataset.
[0040] Figure 4 This is a visualization of the visual localization method based on memory self-correction. Detailed Implementation
[0041] The accompanying drawings are for illustrative purposes only and should not be construed as limiting the scope of this patent.
[0042] The present invention will be further described below with reference to the accompanying drawings and embodiments.
[0043] Figure 1 This is a schematic diagram of a visual localization method based on memory self-correction. (Example) Figure 1 As shown, our main objective is to predict the coordinates of bounding boxes given an image and a query. For the visual backbone, we utilize Darknet53 to extract multi-scale visual features at three different resolutions. These visual features are then flattened to create a set of visually fused features, which serve as input to the semantically relevant filtering module. For the text backbone, we employ a bidirectional GRU to generate text embeddings. We then design a memory self-calibration network that dynamically calibrates the visual features based on the query and adaptively fuses textual information to generate multimodal features, which are then fed into an autoregressive transformer for intra-modal and inter-modal inference. A decoder is employed to autoregressively generate discrete coordinates representing the predicted bounding boxes, effectively providing the necessary ground information.
[0044] To improve the refinement of initial image features in response to queries, a semantic relevance filtering module was designed. First, by combining image features V and query features E, a memory gate is used to calculate the importance of each word, determining their importance in the refinement process.
[0045]
[0046] Where σ is the sigmoid function, M is a 1×M matrix, and N is a 1×N matrix. Then, combining image features and word features, the corrected image features R are written into memory:
[0047]
[0048] Among them, M v (·) and M e (·) indicates that image and text query features are embedded into the same dimension through 1×1 convolution;
[0049] To explore effective multimodal features, this paper proposes an adaptive memory fusion module. Based on the correlation between stored memory information and the original image, image features with finer-grained details are generated. The calibrated image features are adaptively fused with the query to alleviate the problem of cross-modal semantic inconsistency. Specifically, we represent the probability of similarity between memory units by acquiring relevant memory information and calculating their weights.
[0050]
[0051] Where α ij δ is the similarity probability between the i-th memory and the j-th image feature. k The parentheses are implemented as 1×1 convolutions; then, the updated memory representation is output based on the similarity probabilities:
[0052]
[0053] After receiving the corrected memory features, the image and output representation are integrated to generate new image features; then, an adaptive gate is used to dynamically adjust the information flow to update the corrected image.
[0054] G m =σ(w g [u i ,v i ]+b g (5)
[0055] x i =u i *G m +v i *(1-G m )
[0056] Among them, G m It is the response gate of feature integration, σ is the sigmoid function, and w g and b g These are the parameter matrix and the bias:
[0057] F=σ(E)⊙σ(X)) (6)
[0058] Finally, the Hadamard product between E and X is used to generate the final multimodal features, which are then input into the transformer for feature update and localization.
[0059] Figure 2 The figure shows a comparison of the results of the memory self-correction-based visual localization method with other network-based visual localization methods on the RefCOCO, RefCOCO+, and RefCOCOg datasets. Figure 3 This is a comparison of ablation experimental results for a memory self-correction-based visual localization method on the RefCOCO dataset. Figure 2 and Figure 3 As shown, the visual localization method based on memory self-correction is more accurate than other models.
[0060] Figure 4 This is a visualization of the memory self-correction-based visual localization method. In each prediction step, a coordinate marker is generated based on the previous output markers. Specifically, the prediction for x1 focuses on the left side of the target, while the prediction for x2 focuses on the right side. Similarly, y1 is predicted to point to the top of the target, and y2 is predicted to point to the bottom of the target. This axial attention method tends to focus on the boundaries of the object, thus enabling more precise localization of the reference object.
[0061] This paper proposes a novel memory self-correction-based visual localization method, comprising the design of two key modules: a semantically relevant filtering module (SRFM) and an adaptive memory fusion module (AMFM). The former is responsible for selecting text-related image information and filtering redundant objects, while the latter adaptively fuses the labeled features with the original image features. Extensive experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets validate the superiority of our model compared to several existing methods. In future work, we will improve the model's fine-grained interaction capabilities and attempt to discard the Transformer module to simplify the existing end-to-end visual localization framework.
[0062] Finally, the details of the above examples of the present invention are merely illustrative of the invention. Any modifications, improvements, and substitutions to the above embodiments by those skilled in the art should be included within the scope of protection of the claims of the present invention.
Claims
1. A visual localization method based on memory self-correction, characterized in that, The method includes the following steps: S1. Construct a semantic relevance filtering module to select and store image information related to the text content, thereby capturing the semantic relevance between images and text; S2. Construct an adaptive memory fusion module to adaptively fuse the text-related information stored in the semantically related filtering module with the initial image features. Based on the correlation between the stored memory information and the original image, generate image features with finer-grained details. S3. Combine the modules in S1 and S2 to construct the overall architecture of the visual localization method based on memory self-correction; S4. Training of visual localization methods based on memory self-correction; The specific process of S1 is as follows: To improve the refinement of initial image features in response to queries, a semantic relevance filtering module was designed. A novel memory gate mechanism was also devised, enabling the network to focus on selecting relevant words to refine the initial image features, rather than processing image and text information in isolation. First, by combining image features V and query features E, the importance of each word was calculated using the memory gate, determining their importance in the refinement process. (1) Where σ is the sigmoid function, It is a 1 × M matrix. It is a 1 × N matrix. Then, combining image features and word features, the corrected image features R are written into memory: (2) in, and This means that image and text query features are embedded into the same dimension using a 1×1 convolution. The specific process of S2 is as follows: To explore effective multimodal features, this paper proposes an adaptive memory fusion module. Based on the correlation between stored memory information and the original image, image features with finer-grained details are generated. The calibrated image features are adaptively fused with the query to alleviate the problem of cross-modal semantic inconsistency. Specifically, we represent the probability of similarity between memory units by acquiring relevant memory information and calculating their weights. (3) in, It is the similarity probability between the i-th memory and the j-th image feature. It is implemented as a 1 × 1 convolution; then, the updated memory representation is output based on the similarity probability: (4) After receiving the corrected memory features, the image and output representation are integrated to generate new image features; then, an adaptive gate is used to dynamically adjust the information flow to update the corrected image. (5) in, It is the response gate for feature integration, and σ is the sigmoid function. and These are the parameter matrix and the bias: ⊙ (6) Finally, the Hadamard product between E and X is used to generate the final multimodal features, which are then input into the transformer for feature update and localization. The specific process of S3 is as follows: The aforementioned memory self-correction-based visual localization method includes a semantic relevance filtering module, an adaptive memory fusion module, and a memory self-correction-based visual localization network. The specific process of S4 is as follows: The training method for the memory self-correction-based visual localization method is as follows: In our training implementation, we used an Adam optimizer with an initial learning rate of 5e-4, trained for 60 epochs, and used a batch size of 32 during training. After 50 epochs, the learning rate was reduced to 10 times the original. We resized the images to 640 * 640 and used Darknet-53 to encode the multi-scale visual features. The dimensions of the multi-scale visual features were set to 256, 512, and 1024. The sentence length of RefCOCO+ was pruned to 15, and the sentence length of RefCOCOg was 20. A centralized transformer encoder was responsible for updating the multimodal representation, and the decoder performed autoregressive predictions on the target sequence. The transformer had a hidden dimension of 256, the feedforward network had an inflation rate of 4, and the decoder consisted of 3 layers. In our experimental evaluation, we used Precision@0.5 as the evaluation metric. If the Intersection over Union (IOU) between the predicted bounding box and the ground truth bounding box was greater than 0.5, the prediction was considered correct.