An image description generation method based on a common attention mechanism
By using an image description algorithm based on generative adversarial networks, combined with an improved prophetic attention mechanism and a common attention discriminator, the problem of mismatch between image and description semantics is solved, and more accurate and diverse image description generation is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- UNIV OF ELECTRONICS SCI & TECH OF CHINA
- Filing Date
- 2023-03-31
- Publication Date
- 2026-06-12
AI Technical Summary
In existing image description algorithms, the semantic alignment problem between the image and the generated description has not been effectively solved, making it difficult to distinguish machine-generated text from human-created text, and the generated descriptions lack diversity.
An image description algorithm based on generative adversarial networks is adopted, which combines an improved prophetic attention mechanism and a common attention discriminator. By training the generator and discriminator, the semantic alignment between the image and the description is achieved.
It improves the accuracy of image descriptions and the diversity of generated descriptions, making the generated descriptions closer to human-generated text, and significantly enhancing semantic alignment.
Smart Images

Figure CN116452688B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image description generation in deep learning, and addresses the problem of misalignment between the image and the generated description semantics in image description generation. Background Technology
[0002] Image description algorithms are an artificial intelligence technique that integrates computer vision and natural language processing methods to enable machines to generate natural language descriptions based on given images. Applications of these algorithms include image search, automatic image annotation, and intelligent robotics.
[0003] In practical applications, image captioning algorithms have been widely used. For example, in social media, these algorithms help platforms automatically generate image descriptions, allowing users to better understand photo content and enhancing the user experience. In search engines, they help search engines better understand image content, improving retrieval accuracy and providing users with higher-quality search results. In autonomous driving, self-driving cars need image recognition technology to perceive their environment, and image captioning algorithms help them better understand and predict road conditions. Image captioning algorithms can also be applied to medical imaging, drone monitoring, and many other fields, providing strong support for achieving intelligent and automated systems.
[0004] Image captioning algorithms primarily employ an attention-enhanced encoder-decoder framework. The attention mechanism guides the decoding process by focusing on the hidden state of an image region at each time step. This technique has achieved significant success in advancing image captioning technology. Current attention mechanisms focus on image regions based on previous hidden states, which contain information about previously generated words. Therefore, the attention model must predict attention weights without knowing which word it should be applied to. Consequently, the focused image region is more accurate on the current input word than on the output word.
[0005] In tasks involving generating image captions using convolutional neural networks, reinforcement learning techniques based on policy gradient methods have been introduced to directly optimize N-gram matching metrics such as CIDEr, BLEU4, or SPICE. For example, CIDEr is used as an optimization metric for training image captioning models. However, these metrics do not achieve semantic alignment between images and captions. They do not provide a method to promote the naturalness of language, making machine-generated text indistinguishable from human-created text.
[0006] With the continuous advancement and development of deep learning, its application in image description algorithms is becoming increasingly widespread. This invention focuses on the problem of semantic mismatch between images and descriptions in image description algorithms. This invention designs a network based on generative adversarial networks, employs an improved prophetic attention mechanism, and trains a common attention discriminator to detect misalignment signals between the image and the generated sentence. Through this method, the generator can use this signal to improve its text generation mechanism, thereby better aligning the description with the given image. Summary of the Invention
[0007] To overcome the shortcomings of existing technologies, this invention proposes an image captioning generation method based on a common attention mechanism, utilizing generative adversarial networks (GANs). This technique employs an improved prophetic attention mechanism and trains a common attention discriminator to detect misalignment signals between the image and the generated sentence. This addresses the problem of semantic mismatch between the image and the caption in image captioning algorithms (e.g., ...). Figure 1 As shown in the figure, further improvements were made to the image description algorithm based on generative adversarial networks.
[0008] The technical solution adopted in this invention is:
[0009] Step 1: Image description algorithm based on generative adversarial networks. The network model consists of a generator and a discriminator. The generator generates a description of the corresponding image, while the discriminator evaluates the accuracy of the text description of the image. The overall framework is shown in the attached figure. Figure 2 As shown;
[0010] Step 2: The generator in Step 1 uses an encoder-decoder framework. The structure is as follows: the encoder uses a convolutional neural network with a pre-emptive attention mechanism, and the decoder uses a recurrent neural network. Given an image I, the generator G outputs an image description.
[0011] Step 3: The encoder in Step 2 uses Faster R-CNN, which takes image I and extracts image features V = {v1,...,v...} k}∈R d×N .
[0012] Step 4: The generator decoder in Step 2 consists of an initial layer and a prophetic attention layer. The initial layer is an LSTM structure, which, after certain modifications, can control the generation of image descriptions. The prophetic attention layer uses a bidirectional LSTM to calculate attention weights and improves upon Self-Attention. The attention weights are divided into present and future parts, where the future part of the attention weight is calculated by predicting the generation probability of the next word.
[0013] Step 5: The discriminator network in Step 1 is designed using a common attention mechanism to determine whether the generated image descriptions are human-generated or machine-generated. This discriminator consists of two parts: an image attention module and a text attention module. These two modules are used to extract features from the image and description, respectively, and generate corresponding attention matrices. Then, these two attention matrices are combined through a dot product operation to generate a matrix that represents the degree of semantic matching between the image and the description. Finally, this matrix is used as the output of the discriminator to force semantic alignment between the image and the description.
[0014] Step 6: This step uses reinforcement learning SCST to train the network model, uses the reward under the decoding algorithm as the baseline, and uses the image description evaluation metric CIDEr for normalization so that the generated description is close to the provided N-gram level sample benchmark.
[0015] Step 7: The discriminator will be improved alternately with the generator during training. The two modules are trained together, and the network-generated descriptions reach a balance, ultimately resulting in a generator network that achieves semantic alignment between images and descriptions.
[0016] Compared with the prior art, the beneficial effects of the present invention are:
[0017] (1) In terms of aligning images with descriptive semantics, it can enable descriptions to achieve higher accuracy;
[0018] (2) To address the shortcomings of image description algorithms in terms of diversity, it is possible to generate more diverse descriptions in language. Attached Figure Description
[0019] Figure 1 Example image for generating a sequence of image regions for each word used to describe the image.
[0020] Figure 2 Here is a diagram of the overall framework for image description based on the common attention mechanism.
[0021] Figure 3 Here is a diagram of the Faster R-CNN framework.
[0022] Figure 4 Visual attention architecture diagram.
[0023] Figure 5 Here is the architecture diagram of the Prophet Attention Mechanism.
[0024] Figure 6 Here is the architecture diagram of the common attention discriminator.
[0025] Figure 7 Here is a diagram illustrating the use of SCST to train a generator. Detailed Implementation
[0026] The invention will be further described below with reference to the accompanying drawings.
[0027] First, the encoder network in the generator uses Faster R-CNN. The Faster R-CNN structure is as follows: Figure 3 As shown. Faster R-CNN is an object detection model designed to identify object instances belonging to certain categories and locate them using bounding boxes.
[0028] The Faster R-CNN model mainly consists of two modules: the Region Proposal Network (RPN) module and the Fast R-CNN detection module, as shown in the figure below. It can be further subdivided into three parts: convolutional layers, the Region Proposal Network (RPN), and RoIPooling. The convolutional layers include a series of convolution (Conv+ReLU) and pooling operations to extract image features. It uses the existing classic network model VGG16, and the weight parameters of the convolutional layers are shared by the RPN and Fast R-CNN, which is key to accelerating the training process and improving the model's real-time performance. The RPN network generates region proposal boxes. Based on the multi-scale anchor boxes introduced by the network model, it uses Softmax to classify anchor boxes as either targets or background, and uses bounding box regression to predict the precise location of the candidate boxes for subsequent target recognition and detection. The RoI Pooling network integrates convolutional layer features and candidate bounding box information, mapping the coordinates of the candidate bounding boxes in the input image to the last layer (conv5-3). Pooling is then performed on the corresponding regions in the feature map, resulting in a fixed-size (7×7) pooled output, which is then connected to the subsequent fully connected layers. Following the fully connected layers are two sub-connected layers—a classification layer and a regression layer. The classification layer determines the category of the candidate bounding box, while the regression layer predicts the accurate location of the candidate bounding box by regressing its boundaries. The output of Faster R-CNN is a feature vector V = {v1,...,v...} from k images. k}∈R d×N .
[0029] Attention-enhanced image description decoders such as Figure 4 As shown, for each decoding step t, the decoder takes the current input word y. t-1 Word embeddings, and averaged visual features The data is concatenated and used as input to the LSTM, as shown in formula (1):
[0030]
[0031] Where [;] represents a join operation, We This represents the learnable word embedding parameters. Next, the LSTM output h... t It is used as a query to focus on relevant image regions in the visual feature set V and generate the visual features of interest c. t For example, formulas (2) and (3):
[0032]
[0033]
[0034] Where w α W h and W V These are learnable parameters. This represents matrix-vector addition, calculated by adding a vector to each column of the matrix. Finally, h t and c t It is passed to the linear layer to predict the next word, as in formula (4):
[0035] y t ~p t =softmax(W p [h t c t ]+b p (4)
[0036] Among them W p and b p These are learnable parameters. Finally, given a target benchmark sequence... Given a descriptive model with parameter θ, the training objective is to minimize the following cross-entropy loss, as shown in Equation (5):
[0037]
[0038] As can be seen from the formula, at each time period t, the attention model depends on h. t It contains the descriptive word y generated in the past. 1:t-1 The information is used to calculate the attention weight α. t This reliance on past information results in a poor foundation for words generated from the visual features being studied within the current time period, which compromises the accuracy of the description.
[0039] To ensure that the attention model can impartially associate image regions with the words to be generated, a prophetic attention model is employed, such as... Figure 5 As shown, information about future words can be used to guide commonly used attention models to address their semantic misalignment problem and select the correct image regions to generate the corresponding words.
[0040] Specifically, a traditional encoder-decoder framework is first used to generate the entire sentence y. 1:T Then, for each time step t, the prophetic attention will transfer future information y. i:j (j≥t) are used as input to calculate attention weights. This is naturally based on the generated words. During implementation, such as... Figure 5 As shown, a bidirectional LSTM (BiLSTM) is used to process y. 1:T Encode, therefore y i:j The information is first converted into h′ i:j Then, the attention weights are calculated using the following formula (6):
[0041]
[0042] In these equations (2), (3), and (6), the attention models share the same set of parameters. It is recommended to use α during training. t and The L1 criterion between them, as the regularization loss, can be defined as Equation (7):
[0043]
[0044] Where ||·||1 represents the L1 criterion. By minimizing the loss in equation (7), the attention model will process previously generated words y. 1:t-1 The "bias" attention weight α calculated above t Words generated in the future y i:j "Ideal" attention weights calculated on (j≥t) To move closer.
[0045] Then, in order to train the prophet's attention, Incorporate a traditional encoder-decoder framework to regenerate the target benchmark. It is defined by formulas (8), (9), and (10):
[0046]
[0047]
[0048]
[0049] Combining the loss L in formula (5) CE (θ), loss in formula (10) And the loss L in formula (7) Att (θ), the complete training objective is defined by formula (11):
[0050]
[0051] Where λ is a hyperparameter controlling regularization. During training, the description model is first pre-trained 25 times using Equation (5), and then the complete model is trained using Equation (11). In this way, appropriate parameter weights can be initialized for the prophetic attention. During the testing phase, since future words are not visible to the current time step in the language generation task, the same procedure as the traditional attention model is followed in the description decoder.
[0052] To enable prophetic attention to dynamically focus on image regions based on information from future time steps, specifically for a noun phrase such as "a black shirt," all the words should be treated as a complete phrase, rather than individual words. Therefore, for Dynamic Prophet Attention (DPA), if the current output word y... t If a word belongs to a noun phrase (NP), DPA will use all the words in that noun phrase to calculate the attention weights. Then, when the word is a non-visual (NV) word, the prophetic attention model is masked, i.e., the loss in formula (10) is removed. And the loss in formula (7). For the remaining words, directly set i = j = t. Specifically, in image description, the remaining words are usually verbs, used as relational words in the description, connecting different noun phrases. In short, dynamic prophetic attention is defined as formula (12):
[0053]
[0054] Where {y NV} represents the set of all NV words. The attention model can learn to focus on each output word y. t It can locate image regions without requiring training on baseline samples that describe them.
[0055] The discriminator's task is to score the similarity between images and descriptions. A joint attention model is used in the early stages to jointly embed images and descriptions, and similarity is calculated across the entire set of representations. The joint attention discriminator is as follows: Figure 6 The structural details are provided below.
[0056] Given a sequence of words (w1,...w... T The sentence w is composed of [h1, ..., h2] words. The discriminator uses an LSTM (state dimension m = 512) to embed each word, resulting in H = [h1, ..., h2]. T ] T For H∈R T×m , where h t ,ct =LSTM(h t-1 ,c t-1 ,w t For image I, extract features (I1,...I2). C ), where C = 14 × 14 = 196, and it is embedded as I = [WI1,...WI C ] T ∈R C×m ,in d I =2048, which is the image feature size in this paper. This section uses bilinear projection Q∈R. m×m Calculate the correlation Y between the image and the text, Y = tanh(IQH) T )∈R C×T The matrix Y is used to calculate the joint attention weights of one mode to another, as shown in formulas (13) and (14):
[0057] α = Softmax(Linear(tanh(IW)) I +YHW Ih )))∈R C (13)
[0058] β = Softmax(Linear(tanh(HW)) h +YTIW hI )))∈RT (14)
[0059] All the new matrices are in R. m×m Then, the aforementioned weights are used to combine word and image features. For U I V S ∈R m×m Finally, the image-description score was calculated as follows: Where E I E is the average spatial set of CNN features. S This is the final state of the LSTM.
[0060] During model training, the generator is optimized to solve the max problem. θ L G (θ), where L G (θ)=E I logD η (I,G θ (I)). The generator G is trained using the SCST method. θ SCST is a variant of reinforcement learning that uses the reward under the decoding algorithm as a baseline. In this work, the decoding algorithm can be viewed as a greedy algorithm, which calculates the reward from argmax at each step. θ (.∣ht Select the most likely word from the given image. For a given image, the generator's single sample w... s Used to estimate total sequence reward Where w s ~p θ (.∣I). Using SCST, the gradient is estimated as shown in equation (15):
[0061]
[0062] in, It is obtained using a greedy maximum value, such as Figure 7 As shown. Note that the baseline does not change the expected value of the gradient, but it reduces the variance of the estimate.
[0063] In addition, GAN training can be evaluated using image description metrics r. NLP Normalization is performed to make the generated description close to the provided N-gram level sample benchmark. Then the gradient is as shown in Equation (16):
[0064]
[0065] Discriminator D η The goal is not only to train it to distinguish between real and fake descriptions, but also to detect when an image is combined with random, unrelated real sentences, thus forcing it to examine not only sentence structure but also the semantic relationship between the image and the description. To achieve this, this section addresses the following optimization problem: max η L D (η), where the loss L D (η) is from formula (17):
[0066]
[0067] Where w is a real sentence, w s From generator G θ The generated fake descriptions are sampled, while w′ is a real but randomly selected description.
Claims
1. An image caption generation method based on a common attention mechanism, characterized in that, Includes the following steps: Step 1: Image description method based on generative adversarial networks. The network model is divided into a generator and a discriminator. The former is to generate a description of the corresponding image; the latter is to evaluate the accuracy of the text description of the image. Step 2: The generator in Step 1 uses an encoder-decoder framework; the structure is as follows: the encoder uses a convolutional neural network with a prophetic attention mechanism, and the decoder uses a recurrent neural network. Given an image I, the generator G outputs an image description. ; Step 3: The encoder in Step 2 uses Faster R-CNN, which takes image I and extracts image features V = {v1,...,v...} k }∈R d×N ; Step 4: The generator decoder in Step 2 consists of an initial layer and a prophetic attention layer; the initial layer is an LSTM structure, modified to control the generation of image descriptions; the prophetic attention layer uses a bidirectional LSTM to calculate attention weights and improves Self-Attention; the attention weights are divided into present and future parts, where the attention weights of the future part are calculated by predicting the generation probability of the next word; Step 5: The discriminator network in Step 1 is designed using a common attention mechanism to determine whether the generated image description is manually or machine-generated. This discriminator consists of two parts: an image attention module and a text attention module. These two modules are used to extract features from the image and description respectively and generate corresponding attention matrices. Then, the two attention matrices are combined through a dot product operation to generate a matrix that represents the degree of semantic matching between the image and the description. Ultimately, this matrix is used as the output of the discriminator to enforce semantic alignment between the image and the description; Step 6: This step uses reinforcement learning SCST to train the network model, uses the reward under the decoding algorithm as the baseline, and uses the image description evaluation metric CIDEr for normalization so that the generated description is close to the provided N-gram level sample benchmark. Step 7: The discriminator will be improved alternately with the generator during training; the two modules are trained together, and the descriptions generated by the network reach a balance, finally resulting in a generator network that generates descriptions to achieve semantic alignment between images and descriptions.