A method, apparatus, and storage medium for image description training based on feature correction.

By introducing multi-view visual features and a self-supervised alignment loss function into the image description generation model, the overfitting problem of the visual encoder is solved, and more efficient image description generation is achieved, which is suitable for resource-constrained or real-time-critical environments.

CN120747708BActive Publication Date: 2026-06-30SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
Filing Date
2025-06-20
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

When existing image captioning generation models are used in new fields or with limited training data, the visual encoder is prone to overfitting, leading to information loss. Furthermore, existing retrieval-enhanced generation strategies increase computational burden and system complexity, limiting the application of these models in resource-constrained or real-time-critical environments.

Method used

By constructing an image description generation framework, multi-view visual features are introduced and the features of the visual encoder and auxiliary encoder are aligned using a self-supervised alignment loss function. Feature fusion is performed by combining diverse patch features to form an end-to-end deep learning model, which reduces information loss and improves visual understanding.

Benefits of technology

Without relying on external retrieval mechanisms, this method enhances the representation capabilities of the visual encoder, reduces information loss, lowers system complexity, and improves the performance of image description generation and inference efficiency, making it suitable for resource-constrained or real-time-critical scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120747708B_ABST
    Figure CN120747708B_ABST
Patent Text Reader

Abstract

This invention relates to a feature-corrected image description training method, apparatus, and storage medium. The method includes: constructing a basic framework for image description generation, comprising a visual encoder, an auxiliary encoder, a Q-Former module, and a language model interface; acquiring input images for training and inputting them into the visual encoder and auxiliary encoder respectively; extracting main visual features through the visual encoder and extracting auxiliary visual features from different perspectives through the auxiliary encoder; inputting the main visual features into the Q-Former module; and concatenating the filtered auxiliary visual features with the output features of the Q-Former module to form the final visual representation, which is used as input to a deep learning model, thereby training the model end-to-end to generate accurate image descriptions. Compared with existing technologies, this invention not only improves model performance but also simplifies the system structure and increases inference efficiency, providing a more efficient, concise, and scalable technical path for image description generation tasks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image description generation technology, and in particular to an image description training method, apparatus and storage medium based on feature correction. Background Technology

[0002] Image captioning is a key task at the intersection of computer vision and natural language processing, aiming to generate semantically accurate and naturally linguistically natural text descriptions for input images. This task not only places high demands on the model's visual understanding capabilities but also tests its coherence and expressiveness in language generation. In recent years, with the development of large-scale pre-trained language models (LMs) and visual models, image captioning generation methods based on encoder-decoder architectures have made significant progress.

[0003] Typical methods employ pre-trained visual encoders (such as CLIP) to extract image features and then pass these features to a language model via a projection module to generate natural and rich descriptions. However, the performance of these models often deteriorates when training data is limited or when applied to new domains. The root cause lies in the overfitting of the visual encoder on limited data, leading to an "intrinsic information loss" during image encoding. This means the model fails to learn rich and general visual representations, a deficiency particularly pronounced when facing complex or uncommon scenes.

[0004] To compensate for this deficiency, existing technologies generally employ Retrieval Enhanced Generation (RAG) strategies to address the information loss during visual encoding. These methods typically use pre-trained visual encoders (such as CLIP) to extract image features and combine them with external knowledge bases for contextual supplementation, thereby improving the model's performance in novel or uncommon scenarios. For example, representative works such as “Evcap: Retrieval-augmented image captioning with external visual-name memory for open-world comprehension” (Jiaxuan Li, Duc Minh Vo, Akihiro Sugimoto, and Hideki Nakayama. In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition, pages 13733–13742, 2024.) and “Smallcap: Lightweight image captioning prompted with retrieval augmentation” (RitaParada Ramos, Bruno Martins, Desmond Elliott, and Yova Kementchedjhieva. 2023 IEEE / CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2840–2849, 2022.) all enhance the input of the language model by introducing an external retrieval module and integrating relevant textual information during the inference stage. However, this approach does not fundamentally solve the problem of information loss in the visual encoder itself. Instead, it repairs the problem through external compensation, which brings additional computational burden and system complexity, limiting the application of the model in resource-constrained or real-time environments.

[0005] Therefore, there is an urgent need for a new technical solution that can directly enhance the representational ability of the visual encoder and reduce information loss without relying on external retrieval mechanisms, thereby improving the performance of image description generation, while reducing system complexity and improving inference efficiency. Summary of the Invention

[0006] The purpose of this invention is to overcome the shortcomings of the prior art by providing a feature-corrected image description training method, apparatus, and storage medium that can directly enhance the representation ability of the visual encoder and reduce information loss without relying on external retrieval mechanisms, thereby improving the performance of image description generation, while reducing system complexity and improving inference efficiency.

[0007] The objective of this invention can be achieved through the following technical solutions:

[0008] A feature-corrected image description training method includes the following steps:

[0009] A basic framework for image description generation is constructed, which includes a visual encoder, an auxiliary encoder, a Q-Former module, and a language model interface.

[0010] Obtain the input image for training and input it into the visual encoder and the auxiliary encoder respectively. Extract the main visual features through the visual encoder and extract the auxiliary visual features from different perspectives than the visual encoder through the auxiliary encoder. Input the main visual features into the Q-Former module.

[0011] After filtering the auxiliary visual features, they are concatenated with the output features of the Q-Former module to form the final visual representation, which is used as input to the language model interface as input to the deep learning model, thereby training the deep learning model end-to-end to generate image descriptions.

[0012] Furthermore, the Q-Former module incorporates a self-supervised alignment loss function during processing. This self-supervised alignment loss function is calculated by maximizing the cosine similarity between the main visual features and the auxiliary visual features and the output features of the Q-Former module.

[0013] Furthermore, the calculation expression for the self-supervised alignment loss function is as follows:

[0014]

[0015]

[0016] In the formula, For self-supervised alignment loss function, The output of the Q-Former module is the cosine similarity between the features and the main visual features. The output of the Q-Former module is the cosine similarity between the features and the auxiliary visual features. For hyperparameters, The Q-Former module outputs the cosine similarity between its features and features from other sources, where the other sources are either primary visual features or secondary visual features, respectively constituting... or , The expected value is the image. Taken from set , For image The number of other source features, Let the cosine similarity be between the j-th feature from other sources and the k-th feature output by Q-Former. For image The first of the other source characteristics j One characteristic, For image The first of the output features of the Q-Former module k One characteristic.

[0017] Furthermore, the loss function in the language model training process is an image description loss based on cross-entropy loss.

[0018] Furthermore, the image description loss and the self-supervised alignment loss function are jointly optimized by balancing weights.

[0019] Furthermore, the expression for filtering auxiliary visual features is:

[0020]

[0021] In the formula, To filter the features after auxiliary visual features, For all patches in image I, the Dinov2 features are... This represents the final number of dinov2 patch features selected. To select M features from all patch features such that the sum of their subsequent similarities is minimized, Let be the similarity between the i-th patch feature and the j-th patch feature of dinov2.

[0022] Furthermore, the expression for the final visual representation is:

[0023]

[0024] In the formula, For the final visual representation, For the output characteristics of the Q-Former module, To filter the features after auxiliary visual features, N The number of output features of the Q-Former module. M This represents the number of features after filtering auxiliary visual features.

[0025] Furthermore, the language model interface is a projection layer, and the final visual representation is processed through the projection layer and then input into the language model for training.

[0026] The present invention also provides an image description training device based on feature correction, characterized in that it includes a memory and a processor, wherein the memory stores a computer program, and the processor calls the computer program to execute the steps of the method described above.

[0027] The present invention also provides a computer-readable storage medium on which a computer program is stored, the computer program being executed by a processor using the method described above.

[0028] Compared with the prior art, the present invention has the following advantages:

[0029] (1) This invention introduces multi-view visual features to provide visual features from different perspectives than the visual encoder, which can capture more details and structural information, thereby supplementing the semantic understanding ability of the visual encoder; and combines multi-view visual features to perform feature alignment and fusion through self-supervised alignment loss, guiding Q-Former to learn visual representations that can simultaneously cover the feature spaces of the visual encoder and the auxiliary encoder, making it more comprehensive and more generalizable.

[0030] (2) Based on the output of Q-Former, this invention further integrates diverse patch features extracted from the auxiliary encoder to form the final visual representation V. These features are filtered to ensure diversity, thereby supplementing the information dimensions that Q-Formertoken fails to capture completely.

[0031] (3) This invention completely eliminates the external database and retrieval module required by the Retrieval Enhancement Generation (RAG) strategy, significantly improving inference speed while maintaining or even improving performance, and enabling end-to-end image description generation. Experiments show that the visual processing time of this invention is only [time missing]. EVCap One-quarter of the capacity is more suitable for deployment in resource-constrained or real-time-critical scenarios.

[0032] (4) In common sense violation or out-of-domain scenarios such as WHOOPS and NoCaps, this invention demonstrates superior description quality compared to the RAG method, indicating its stronger intrinsic visual understanding capability. More importantly, this invention achieves high performance in these tasks without any external retrieval, verifying the comprehensiveness and robustness of its visual features. Attached Figure Description

[0033] Figure 1 This is a flowchart illustrating an image description training method based on feature correction provided in an embodiment of the present invention;

[0034] Figure 2 This is a schematic diagram of the model structure and the corrected feature representation of an image description training method based on feature correction provided in an embodiment of the present invention;

[0035] Figure 3 This is a schematic diagram showing a time comparison of the computational features and LLM generation descriptions of the present invention and EVCap, provided in an embodiment of the present invention.

[0036] Figure 4 This is a schematic diagram showing the visualization comparison results of a solution provided in this embodiment of the invention on several samples of COCO, Flickr30k, and NoCaps. Detailed Implementation

[0037] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.

[0038] Therefore, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the invention without inventive effort are within the scope of protection of the invention.

[0039] It should be noted that similar labels and letters in the following figures indicate similar items. Therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures.

[0040] Example 1

[0041] like Figure 1 and Figure 2 As shown, this embodiment provides an image description training method based on feature correction, including the following steps:

[0042] S1: Construct the basic framework for image description generation, which includes a visual encoder, an auxiliary encoder, a Q-Former module, and a language model interface;

[0043] S2: Obtain the input image for training and input it into the visual encoder and the auxiliary encoder respectively. Extract the main visual features through the visual encoder and extract the auxiliary visual features from different perspectives than the visual encoder through the auxiliary encoder. Input the main visual features into the Q-Former module.

[0044] S3: After filtering the auxiliary visual features, it is concatenated with the output features of the Q-Former module to form the final visual representation, which is used as input to the language model interface as input to the deep learning model, thereby training the deep learning model end-to-end to generate image descriptions.

[0045] In step S1, this scheme is based on the EVCap architecture, removing its RAG modules (such as external memory banks, retrieval modules, etc.), and retaining the pre-trained visual encoder, Q-Former module, and language model interface to form a more concise and efficient image description generation framework. The visual encoder can be selected as the CLIP-ViT encoder.

[0046] Furthermore, a frozen auxiliary encoder (such as DINOv2) was introduced, which provides visual features from a different perspective than the main visual features. DINOv2 is a self-supervised learning-based visual encoder that can capture more details and structural information, thereby supplementing the semantic understanding capabilities of the CLIP encoder.

[0047] It should be noted that this solution can be modified by replacing different types of auxiliary visual encoders, and the type of auxiliary encoder is not limited to a single one.

[0048] In step S2, preferably, a self-supervised alignment loss function is introduced in the processing of the Q-Former module. This self-supervised alignment loss function is calculated by maximizing the cosine similarity between the main visual features and the auxiliary visual features and the output features of the Q-Former module.

[0049] Specifically,

[0050] To reduce information loss of visual features during Q-Former compression, this invention introduces a self-supervised alignment loss. This guides the Q-Former to learn visual representations that simultaneously cover both CLIP and DINOv2 feature spaces. Specifically, given an image I, its patch-level feature set from CLIP or DINOv2 is... The token set output by Q-Former is This invention encourages Q-Former to learn more comprehensive and diverse visual feature representations by maximizing the cosine similarity between each source patch token and its most similar Q-Former token. The alignment loss is defined as follows:

[0051]

[0052] This loss is applied to the features of CLIP and DINOv2 respectively, and then combined in a weighted manner to form the final alignment loss:

[0053]

[0054] In the formula, For self-supervised alignment loss function, The output of the Q-Former module is the cosine similarity between the features and the main visual features. The output of the Q-Former module is the cosine similarity between the features and the auxiliary visual features. For hyperparameters, The Q-Former module outputs the cosine similarity between its features and features from other sources, where the other sources are either primary visual features or secondary visual features, respectively constituting... or , The expected value is the image. Taken from set , For image The number of other source features, Let the cosine similarity be between the j-th feature from other sources and the k-th feature output by Q-Former. For image The first of the other source characteristics j One characteristic, For image The first of the output features of the Q-Former module k One characteristic.

[0055] This multimodal alignment mechanism fully leverages CLIP's semantic understanding and DINOv2's visual detail capture capabilities, enabling Q-Former's output tokens to simultaneously align with patch-level features from both CLIP and DINOv2. This process ensures that Q-Former retains as much information as possible when compressing high-dimensional visual features, minimizing information loss and achieving more comprehensive coverage of image information, thereby enhancing its generalization ability.

[0056] Preferably, the loss function in the language model training process is an image description loss based on cross-entropy loss, and the image description loss and the self-supervised alignment loss function are jointly optimized by balancing weights.

[0057] In other words, the training objective of this scheme consists of two parts:

[0058] Image description loss : Standard cross-entropy loss, used to optimize the target description generated by the language model.

[0059] Self-supervised alignment loss As a regularization term, it guides Q-Former to learn visual features with greater generalization ability. The overall training objective function is as follows:

[0060]

[0061] Where λ1 is the balancing weight. By jointly optimizing these two objectives, As a regularization term, it guides the visual encoder to learn more robust and generalizable features, rather than simply fitting the limited descriptions in the training data; thus enabling the present invention to learn feature representations that can both meet the requirements of image description tasks and retain rich visual information during the training phase.

[0062] Based on the Q-Former output, this invention optionally introduces diverse patch features from DINOv2 to further enrich the visual representation. To avoid redundant information, this invention filters the patch features from DINOv2, selecting the M most representative and diverse features:

[0063]

[0064] In the formula, To filter the features after auxiliary visual features, For all patches in image I, the Dinov2 features are... This represents the final number of dinov2 patch features selected. To select M features from all patch features such that the sum of their subsequent similarities is minimized, Let be the similarity between the i-th patch feature and the j-th patch feature of dinov2.

[0065] These features are then concatenated with the token output by the Q-Former to form the final visual representation V:

[0066]

[0067] In the formula, For the final visual representation, For the output characteristics of the Q-Former module, To filter the features after auxiliary visual features, N The number of output features of the Q-Former module. M This represents the number of features after filtering auxiliary visual features.

[0068] This fusion strategy not only enhances the information dimension of visual representations but also improves the model's ability to understand complex or rare scenarios.

[0069] Optionally, the language model interface is a projection layer, and the final visual representation is processed by the projection layer and then input into the language model for training.

[0070] Experimental verification:

[0071] This embodiment comprehensively evaluates the method of the present invention on widely used mainstream image captioning generation datasets such as COCO, Flickr30k, and NoCaps. The evaluation metrics are BLEU, METEOR (explicitly ranked translation evaluation metric), and CIDEr (consensus-based image captioning evaluation). The larger the CIDEr value, the better the performance, as shown in Tables 1 and 2.

[0072] Table 1. Performance comparison on COCO, Flickr30k, and NoCaps. * indicates a RAG-based solution.

[0073]

[0074] Table 2 Performance Comparison on WHOOPS

[0075]

[0076] Therefore, in common-sense violation or out-of-domain scenarios such as WHOOPS and NoCaps, this invention demonstrates superior description quality compared to the RAG method, indicating its stronger intrinsic visual understanding capability. More importantly, this invention achieves high performance in these tasks without any external retrieval, validating the comprehensiveness and robustness of its visual features.

[0077] like Figure 3 The image shows a comparison of the computation time for visual features and the time for generating LLM descriptions between the present invention and EVCap, including the characteristics of the present invention with and without DINOv2 features. Blue represents the time for computing visual features, and orange represents the time for generating LLM descriptions. It can be seen that the computation time for visual features and the time for generating LLM descriptions in the present invention are both reduced compared to the EVCap solution, and it indicates that the visual processing time of the present invention is only [time missing]. EVCap One-quarter of the capacity is more suitable for deployment in resource-constrained or real-time-critical scenarios.

[0078] like Figure 4 The image shows a visual comparison of the present invention with the EVCap solution on several samples from COCO, Flickr30k, and NoCaps. The present invention has higher accuracy.

[0079] This embodiment also provides an image description training device based on feature correction, including a memory and a processor. The memory stores a computer program, and the processor calls the computer program to execute the steps of the image description training method based on feature correction as described above.

[0080] It should be noted that the specific details and beneficial effects of the device in this application can be found in the above-described method embodiments, and will not be repeated here.

[0081] This embodiment also provides a computer-readable storage medium storing a computer program, which is executed by a processor using the steps of the feature-correction-based image description training method described above.

[0082] The computer program code used to implement the methods of the present invention can be written in any combination of one or more programming languages. This computer program code can be provided to a processor or controller of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor or controller, the computer program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The computer program code can be executed entirely on the machine, partially on the machine, as a standalone software package partially on the machine and partially on a remote machine, or entirely on a remote machine or server.

[0083] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium can be a machine-readable signal medium or a machine-readable storage medium. A computer-readable storage medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0084] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.

Claims

1. An image description training method based on feature correction, characterized in that, Includes the following steps: A basic framework for image description generation is constructed, which includes a visual encoder, an auxiliary encoder, a Q-Former module, and a language model interface. Obtain the input image for training and input it into the visual encoder and the auxiliary encoder respectively. Extract the main visual features through the visual encoder and extract the auxiliary visual features from different perspectives than the visual encoder through the auxiliary encoder. Input the main visual features into the Q-Former module. After filtering the auxiliary visual features, they are concatenated with the output features of the Q-Former module to form the final visual representation, which is used as input to the language model interface as input to the deep learning model, thereby training the deep learning model end-to-end to generate image descriptions. The Q-Former module incorporates a self-supervised alignment loss function during processing. This function is calculated by maximizing the cosine similarity between the main visual features and the auxiliary visual features and the output features of the Q-Former module.

2. The image description training method based on feature correction according to claim 1, characterized in that, The calculation expression for the self-supervised alignment loss function is as follows: In the formula, For self-supervised alignment loss function, The output of the Q-Former module is the cosine similarity between the features and the main visual features. The output of the Q-Former module is the cosine similarity between the features and the auxiliary visual features. For hyperparameters, The Q-Former module outputs the cosine similarity between its features and features from other sources, where the other sources are either primary visual features or secondary visual features, respectively constituting... or , The expected value is the image. Taken from set , For image The number of other source features, Let the cosine similarity be between the j-th feature from other sources and the k-th feature output by Q-Former. For image The first of the other source characteristics j One characteristic, For image The first of the output features of the Q-Former module k One characteristic.

3. The image description training method based on feature correction according to claim 1, characterized in that, The loss function used in the language model training process is an image description loss based on cross-entropy loss.

4. The image description training method based on feature correction according to claim 3, characterized in that, The image description loss and the self-supervised alignment loss function are jointly optimized by balancing weights.

5. The image description training method based on feature correction according to claim 1, characterized in that, The expression for filtering auxiliary visual features is: In the formula, To filter the features after auxiliary visual features, For all patches in image I, the Dinov2 features are... This represents the final number of dinov2 patch features selected. To select M features from all patch features such that the sum of their subsequent similarities is minimized, Let be the similarity between the i-th patch feature and the j-th patch feature of dinov2.

6. The image description training method based on feature correction according to claim 1, characterized in that, The final visual representation is expressed as: In the formula, For the final visual representation, For the output characteristics of the Q-Former module, To filter the features after auxiliary visual features, N The number of output features of the Q-Former module. M This represents the number of features after filtering auxiliary visual features.

7. The image description training method based on feature correction according to claim 1, characterized in that, The language model interface is a projection layer, and the final visual representation is processed through the projection layer and then input into the language model for training.

8. An image description training device based on feature correction, characterized in that, It includes a memory and a processor, the memory storing a computer program, the processor invoking the computer program to perform the steps of the method as described in any one of claims 1 to 7.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, which is executed by a processor according to any one of claims 1 to 7.