A method for generating stylized image descriptions based on cross-media disentangled representation learning
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2022-07-20
- Publication Date
- 2026-06-12
AI Technical Summary
Existing models lack interpretability and controllability of hidden layer representations and model parameters in the stylized image description generation task, which limits the understanding and further improvement of this task.
We employ a cross-media deentanglement representation learning method, using style filters and fact filters to separate stylistic and factual information in the hidden layer space. We then use capsule networks to aggregate cross-media style representations, ultimately generating stylized image description text.
It improves the controllability and interpretability of the model, and its generation performance surpasses existing technologies, achieving better stylized image description generation results.
Smart Images

Figure CN115293959B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the application of deep learning methods in the generation of stylized image descriptions, and more particularly to a technique using task-specific deentanglement representation learning and capsule networks. Background Technology
[0002] Stylized image captioning (SVC) aims to generate natural language descriptions that are semantically relevant to a given image and consistent with a given linguistic style. These two requirements make this task significantly more challenging than traditional image captioning tasks. With the advent of large-scale image-text cross-media corpora and the advancements in deep learning techniques in computer vision and natural language processing, stylized image captioning has made great progress in recent years. Among existing methods, widely adopted neural networks have demonstrated their powerful ability to address the complexity and challenges of stylized image captioning.
[0003] However, the intermediate states of existing models are processed through many layers of nonlinear transformations, such as the ReLU layers in Convolutional Neural Networks (CNNs). This makes the hidden layer representations and model parameters lack interpretability and controllability, which may limit the understanding and further improvement of this task. Summary of the Invention
[0004] The purpose of this invention is to solve the problems existing in the prior art and to provide a stylized image description generation method based on cross-media unentangled representation learning.
[0005] The specific technical solution adopted in this invention is as follows:
[0006] A stylized image description generation method based on cross-media unentangled representation learning, the steps of which are as follows:
[0007] S1: Obtain training data for the image and its stylized image description text;
[0008] S2: Iteratively train the image variational autoencoder network model with style filter and fact filter using training data, and use deentangled representation learning to separate fact information and stylization information in the hidden layer space during each model update, and finally obtain the optimized image variational autoencoder network model parameters.
[0009] S3: Iteratively train the text variational autoencoder network model with style filter and fact filter using training data. During each model update, the deentanglement module separates the fact information and stylistic information in the hidden layer space through deentanglement representation learning, and finally obtains the optimized text variational autoencoder network model parameters.
[0010] S4: After updating the model parameters in S2 and S3, fix the model parameters of the image variational autoencoder and the text variational autoencoder. Then, use the reparameterization technique to sample and average the stylized description output by the text variational autoencoder to obtain a fixed general style representation. Construct a stylized image description generation model. In the model, the image variational autoencoder first generates the image's own style and factual representation from the input image. Then, the capsule network aggregation module aggregates the general style representation and the image's own style to obtain an aggregated cross-media style representation. Finally, it is connected with the factual representation to form the final feature representation and input into the descriptive text generator to generate stylized image description text with style.
[0011] S5. Based on the training data, iteratively train the descriptive text generator in the stylized image description generation model and update the network parameters of the descriptive text generator.
[0012] S6. After completing the training of the descriptive text generator in S5, input the target image into the stylized image description generation model, and finally the descriptive text generator outputs the stylized image description text.
[0013] Based on the above technical solution, the following specific methods can be preferred to implement each step.
[0014] Preferably, the specific implementation steps of S2 are as follows:
[0015] S21: Select several batches of data from the training data;
[0016] S22: Apply style filters and fact filters to the intermediate hidden layer space of the variational autoencoder network to form an image variational autoencoder network with style filters and fact filters; the style filters and fact filters are respectively responsible for filtering style information and fact information in the hidden layer space; each filter consists of a classifier and a discriminator, the classifier is an auxiliary classification loss function, and the discriminator is an auxiliary discrimination loss function.
[0017] S23: Feed the data sampled in S21 into the image variational autoencoder network with style filter and fact filter in batches, and train it iteratively.
[0018] Preferably, the specific implementation steps of S3 are as follows:
[0019] S31: Select several batches of data from the training data;
[0020] S32: Apply style filters and fact filters to the intermediate hidden layer space of the variational autoencoder network to form a text variational autoencoder network with style filters and fact filters; the style filters and fact filters are respectively responsible for filtering style information and fact information in the hidden layer space; each filter consists of a classifier and a discriminator, the classifier is an auxiliary classification loss function, and the discriminator is an auxiliary discriminant loss function.
[0021] S33: The data sampled in S31 is fed into a text variational autoencoder network with style filters and fact filters in batches for iterative training.
[0022] Preferably, the learnable parameters in both the classifier and the discriminator include the weights and biases of Softmax.
[0023] Preferably, the training process for each round of training for the image variational autoencoder network and the text variational autoencoder network is as follows:
[0024] 1) First, train the discriminators in the style filter and fact filter to distinguish whether other spaces contain information about the current space. The discriminator in the style filter aims to predict the bag-of-words distribution of the stylized description based on the fact space, while the discriminator in the fact filter is based on the entire style space. To predict the bag-of-words distribution of factual descriptions, s1 and s2 represent two styles, respectively; the training loss of the discriminator in the style filter and fact filter. and They are respectively:
[0025]
[0026] In the formula: It is a style vocabulary list that removes factual words and stop words, and contains words from the list. It is the true bag-of-words (BoW) distribution for stylized description, p(w|z f ;θ dis(s) ) indicates that in the parameter θ dis(s) The following is a predicted bag-of-words BoW distribution of factual descriptions output by the discriminator in the style filter, θ dis(s) For discriminator parameters in the style filter;
[0027]
[0028] In the formula: It is a factual vocabulary list with style words and stop words removed, where w represents a word in the vocabulary list. It is a factual description of the true bag-of-words BoW distribution, p(w|z s ;θ dis(f) ) indicates that in the parameter θ dis(f) The following is a BoW distribution of the predicted bag-of-words, stylized by the discriminator output in the fact filter, θ. dis(f) These are the discriminator parameters in the fact filter;
[0029] 2) Fix the parameters θ of the discriminator in the trained style filter and fact filter. dis(s) and θ dis(f) Then, by minimizing the total loss function, the VAE network parameters θ in the variational autoencoder network are adjusted. vae The classifier parameter θ in the style filter cls(s) The classifier parameter θ in the fact filter cls(f) Training is performed, where the total loss function is... The format is:
[0030]
[0031] In the formula: For the original loss of the variational autoencoder network, λ cls(s) , λ adv(s) , λ cls(f) and λ adv(f) These are four loss weights; and The classifier losses in the style filter and fact filter are respectively:
[0032]
[0033]
[0034] In the formula: s represents style, labels represent the set of styles [s1, s2]; t(·) represents the one-hot vector of the true style distribution corresponding to the sample, and l(·) represents the distribution vector of the predicted style on the sample. This indicates that in the parameter θ cls(s) The distribution vector of the predicted style output by the subclassifier, p(w|z) f ;θ cls(f) ) indicates that in the parameter θ cls(f) The predicted bag-of-words (BoW) distribution of the factual descriptions output by the lower classifier;
[0035] and Two adversarial losses, one for the style filter and one for the fact filter:
[0036]
[0037]
[0038] In the formula: p(w|z) f ) and p(w|z s ) represents the parameter θ dis(s) and θ dis(f) After being fixed, p(w|z) is output by the discriminator. f ;θ cls(f) ) and p(w|z s ;θ dis(f) ).
[0039] Preferably, the specific implementation steps of S4 are as follows:
[0040] S41: After completing the model parameter updates in S2 and S3, fix the model parameters of the image variational autoencoder and the text variational autoencoder; for the text variational autoencoder module, use the reparameterization technique to sample from the posterior distribution of the stylized image description text and take the average to obtain a fixed general style representation. And s∈[s1,s2]; for style s1, its general style representation is denoted as For style s2, its general style representation is denoted as
[0041] S42: For the image variational autoencoder module, the same reparameterization technique as in S41 is used to sample and obtain a style representation specific to each image. and facts
[0042] S43: The style and factual representations of the images and text obtained from S41 and S42 are input into the capsule network aggregation module for aggregation. The capsule network aggregation includes input capsules and output capsules. The two input capsules are: and Where U and V are learnable matrix parameters, and the n output capsules represent the n parts of the aggregated cross-media representation; each input capsule Ω i Having n vote vectors: {A i1 A i2 ,…,A in} represents the contribution of style information extracted from the image or descriptive text to the output capsule, and the j-th voting vector is represented as:
[0043] A ij =Ω i W ij Let i = {1, 2} and j ∈ [1, n].
[0044] In the formula: W ij Represents the learnable weight matrix;
[0045] Each output capsule Defined as:
[0046]
[0047] Where: Coupling coefficient C ij A was measured ij and The amount of information transmitted between them C ij The calculation method is as follows:
[0048]
[0049] Among them: B ij The input capsule Ω was measured i and output capsule The coupling between them is initialized to 0 and updated by the following formula:
[0050]
[0051] Subsequently, a nonlinear compression function is applied to map the length of the output capsule to a range between 0 and 1 to characterize the probability:
[0052]
[0053] Where: || represents the modulus, || 2 Represents the square of the modulus;
[0054] Finally, the n output capsules corresponding to style s∈[s1,s2] are concatenated to form the aggregated cross-media style representation z corresponding to style s. s And then, according to the facts The features are connected to form the final feature representation. The final feature representation corresponding to style s is then input into the descriptive text generator to generate stylized descriptive text with style s.
[0055] Preferably, the descriptive text generator is implemented using an LSTM network.
[0056] Preferably, in the descriptive text generator, the final feature representation of the input is processed by performing greedy decoding to generate image-stylized descriptive text with style s.
[0057] Preferably, in step S5, the loss function for training the stylized image description generation model is cross-entropy loss.
[0058] Preferably, in step S6, after the target image is input into the stylized image description generation model, the target image-specific style representation and factual representation are first obtained through the image variational autoencoder trained in step S2. Then, the target image-specific style representation is input into the capsule network aggregation module and aggregated with the general style representation obtained during training to form an aggregated cross-media style representation. After being connected with the target image-specific factual representation, it is input into the descriptive text generator trained in step S5 to perform beam search and generate image description text of the corresponding style.
[0059] Compared to existing technologies, this invention is the first to introduce disentangled representation learning technology into a stylized image description generation model, and proposes an aggregation-then-generation approach to fully utilize disentangled representation information from both images and text. The model proposed in this invention outperforms current state-of-the-art techniques, and demonstrates that disentangled representation learning technology offers better controllability, interpretability, and generative performance in stylized image description generation tasks. Attached Figure Description
[0060] Figure 1 This is a flowchart of the stylized image description generation method of the present invention;
[0061] Figure 2 This is a framework diagram of the stylized image description generation model for the present invention. Detailed Implementation
[0062] The present invention will be further described and illustrated below with reference to the accompanying drawings and specific embodiments.
[0063] This invention is primarily used to deentangle the latent spatial representations of images and descriptive text. Then, it aggregates the cross-media style vectors previously obtained from the image and descriptive text, and combines them with factual information to generate stylized image descriptions using a descriptive text generator. The implementation process of this invention is described in detail below:
[0064] like Figure 1 As shown, in a preferred embodiment of the present invention, a method for generating stylized image descriptions based on cross-media unentangled representation learning is provided, the steps of which are as follows:
[0065] S1: Obtain training data for the images and their stylized image description text, and preprocess them to the same input format.
[0066] In this embodiment, the training dataset is represented as Where x i Represents the i-th image. This indicates the description text corresponding to style s1. This indicates the description text corresponding to style s2. This represents the corresponding factual description text, used to construct the bag-of-words distribution of the factual description, where N is the dataset size.
[0067] S2: Iteratively train the image variational autoencoder network model with style filter and fact filter using training data, and use deentanglement representation learning to separate fact information and stylization information in the hidden layer space during each model update, and finally obtain the optimized image variational autoencoder network model parameters.
[0068] In this embodiment, the specific implementation sub-steps of step S2 are as follows:
[0069] S21: Select several batches of data from the training data;
[0070] S22: Apply style filters and fact filters to the intermediate hidden layer space of the variational autoencoder network to form an image variational autoencoder network with style filters and fact filters; the style filters and fact filters are respectively responsible for filtering style information and fact information in the hidden layer space; each filter consists of a classifier and a discriminator, the classifier is an auxiliary classification loss function, and the discriminator is an auxiliary discrimination loss function.
[0071] S23: Feed the data sampled in S21 into the image variational autoencoder network with style filter and fact filter in batches, and train it iteratively.
[0072] S3: Iteratively train the text variational autoencoder network model with style filter and fact filter using training data. During each model update, the deentanglement module separates the fact information and stylistic information in the hidden layer space through deentanglement representation learning, and finally obtains the optimized text variational autoencoder network model parameters.
[0073] In this embodiment, the specific implementation sub-steps of step S3 are as follows:
[0074] S31: Select several batches of data from the training data;
[0075] S32: Apply style filters and fact filters to the intermediate hidden layer space of the variational autoencoder network to form a text variational autoencoder network with style filters and fact filters; the style filters and fact filters are respectively responsible for filtering style information and fact information in the hidden layer space; each filter consists of a classifier and a discriminator, the classifier is an auxiliary classification loss function, and the discriminator is an auxiliary discriminant loss function.
[0076] S33: The data sampled in S31 is fed into a text variational autoencoder network with style filters and fact filters in batches for iterative training.
[0077] It should be noted that the image variational autoencoder network and the text variational autoencoder network described above have the same network structure. Both are implemented by setting style filters and fact filters in the hidden layers of the variational autoencoder network, with each filter serving as a loss term to assist in network training. The learnable parameters in both the classifier and discriminator include the weights and biases of the Softmax function.
[0078] In S2 and S3 above, the training process for both the image variational autoencoder network and the text variational autoencoder network is the same for each round of training. The general process steps are described below:
[0079] 1) First, train the discriminators in the style filter and fact filter to distinguish whether other spaces contain information about the current space. The discriminator in the style filter aims to predict the bag-of-words distribution of the stylized description based on the fact space, while the discriminator in the fact filter is based on the entire style space. To predict the bag-of-words distribution of factual descriptions, s1 and s2 represent two styles, respectively; the training loss of the discriminator in the style filter and fact filter. and They are respectively:
[0080]
[0081] In the formula: It is a style vocabulary list that removes factual words and stop words, and contains words from the list. It is the true bag-of-words (BoW) distribution for stylized description, p(w|z f ;θ dis(s) ) indicates that in the parameter θ dis(s) The following is a predicted bag-of-words BoW distribution of factual descriptions output by the discriminator in the style filter, θ dis(s) For discriminator parameters in the style filter;
[0082]
[0083] In the formula: It is a factual vocabulary list with style words and stop words removed, where w represents a word in the vocabulary list. It is a factual description of the true bag-of-words BoW distribution, p(w|z s ;θ dis(f) ) indicates that in the parameter θ dis(f) The following is a BoW distribution of the predicted bag-of-words, stylized by the discriminator output in the fact filter, θ. dis(f)These are the discriminator parameters in the fact filter;
[0084] 2) Fix the parameters θ of the discriminator in the trained style filter and fact filter. dis(s) and θ dis(f) Then, by minimizing the total loss function, the VAE network parameters θ in the variational autoencoder network are adjusted. vae The classifier parameter θ in the style filter cls(s) The classifier parameter θ in the fact filter cls(f) Training is performed, where the total loss function is... The format is:
[0085]
[0086] In the formula: For the original loss of the variational autoencoder network, λ cls(s) , λ adv(s) , λ cls(f) and λ adv(f) These are four loss weights; and The classifier losses in the style filter and fact filter are respectively:
[0087]
[0088]
[0089] In the formula: s represents style, labels represent the set of styles [s1, s2]; t(·) represents the one-hot vector of the true style distribution corresponding to the sample, and l(·) represents the distribution vector of the predicted style on the sample. This indicates that in the parameter θ cls(s) The distribution vector of the predicted style output by the subclassifier, p(w|z) f ;θ cls(f) ) indicates that in the parameter θ cls(f) The predicted bag-of-words (BoW) distribution of the factual descriptions output by the lower classifier;
[0090] and Two adversarial losses, one for the style filter and one for the fact filter:
[0091]
[0092]
[0093] In the formula: p(w|z) f ) and p(w|z s ) represents the parameter θ dis(s) and θ dis(f)After being fixed, p(w|z) is output by the discriminator. f ;θ cls(f0 0 and p(w|z) s ;θ dis(f0 0.
[0094] It should be noted that, The original loss of the variational autoencoder network, specifically in a form that belongs to the existing technology of variational autoencoder networks, consists of a reconstruction loss term and a KL regularization term, and can be expressed as:
[0095]
[0096] Where x represents the input text or image, z represents the hidden representation of the input data, q(z|x) represents the probability distribution of the encoder, and p(x|z) represents the probability distribution of the decoder. The VAE model parameters are denoted as θ. vae p(z) is assumed to be a standard normal distribution. The posterior distribution q(z|x) of z is Parameters μ and σ 2 It is predicted by the encoder.
[0097] The forms of the loss terms described above are the same in both image variational autoencoders and text variational autoencoders. To distinguish this invention, the superscript "img" can be used to represent the loss parameters in the image variational autoencoder, while the superscript "cap" can be used to represent the loss parameters in the text variational autoencoder. For example, the total loss function in both image variational autoencoders and text variational autoencoders... They are respectively represented as and The same applies to the other parameters.
[0098] S4: After updating the model parameters in S2 and S3, fix the model parameters of the image variational autoencoder and the text variational autoencoder. Then, use the reparameterization trick to sample and average the stylized description output by the text variational autoencoder to obtain a fixed general style representation. Construct a stylized image description generation model. In the model, the image variational autoencoder first generates the style and factual representation of the input image (i.e., each image is different and belongs to each image specific). Then, the capsule network aggregation module aggregates the general style representation and the style of the image itself to obtain the aggregated cross-media style representation. Finally, it is connected with the factual representation to form the final feature representation and input into the descriptive text generator to generate stylized image description text with style.
[0099] In this embodiment, the specific implementation sub-steps of step S4 are as follows:
[0100] S41: After completing the model parameter updates in S2 and S3, fix the model parameters of the image variational autoencoder and the text variational autoencoder; for the text variational autoencoder module, use the reparameterization technique to sample from the posterior distribution of the stylized image description text and take the average to obtain a fixed general style representation. And s∈[s1,s2]; for style s1, its general style representation is denoted as For style s2, its general style representation is denoted as
[0101] S42: For the image variational autoencoder module, the same reparameterization technique as in S41 is used to sample and obtain a style representation specific to each image. and facts
[0102] S43: The style and factual representations of the images and text obtained from S41 and S42 are input into the capsule network aggregation module for aggregation. The capsule network aggregation includes input capsules and output capsules. The two input capsules are: and Where U and V are learnable matrix parameters, and the n output capsules represent the n parts of the aggregated cross-media representation; each input capsule Ω i Having n vote vectors: {A i1 A i2 ,…,A in} represents the contribution of style information extracted from the image or descriptive text to the output capsule, and the j-th voting vector is represented as:
[0103] A ij =Ω i W ij , i = (1, 2) and j ∈ [1, n]
[0104] In the formula: Wi j Represents the learnable weight matrix;
[0105] Each output capsule Defined as:
[0106]
[0107] Where: Coupling coefficient C ij A was measured ij and The amount of information transmitted between them C ij The calculation method is as follows:
[0108]
[0109] Among them: B ij The input capsule Ω was measured i and output capsule The coupling between them is initialized to 0 and updated by the following formula:
[0110]
[0111] Subsequently, a nonlinear compression function is applied to map the length of the output capsule to a range between 0 and 1 to characterize the probability:
[0112]
[0113] Where: || represents the modulus, || 2 Represents the square of the modulus;
[0114] Finally, the n output capsules corresponding to style s∈[s1,s2] are concatenated to form the aggregated cross-media style representation z corresponding to style s. s And then, according to the facts The features are connected to form the final feature representation. The final feature representation corresponding to style s is then input into the descriptive text generator to generate stylized descriptive text with style s.
[0115] In this embodiment, the descriptive text generator is constructed using an LSTM network framework. In this descriptive text generator, the final feature representation of the input is processed by a greedy decoding algorithm to generate image-stylized descriptive text with style s.
[0116] S5. Based on the training data, iteratively train the descriptive text generator in the stylized image description generation model, and update the network parameters of the descriptive text generator.
[0117] In this embodiment, the loss function for training the stylized image description generation model is cross-entropy loss.
[0118] S6. After completing the training of the descriptive text generator in S5, input the target image into the stylized image description generation model, and finally the descriptive text generator outputs the stylized image description text.
[0119] In this embodiment, in step S6 above, after the target image is input into the stylized image description generation model, the specific process of generating stylized image description text is the same as that of the aforementioned training samples. Specifically, the target image-specific style representation and factual representation are obtained first through the image variational autoencoder trained in S2. Then, the target image-specific style representation is input into the capsule network aggregation module and aggregated with the general style representation obtained during training to form an aggregated cross-media style representation. After being connected with the target image-specific factual representation, it is input into the description text generator trained in S5 to perform beam search and generate image description text of the corresponding style.
[0120] The methods shown in S1 to S6 above will be applied to specific embodiments below, and the specific implementation details and technical effects of the above implementation steps in the embodiments will be described in detail.
[0121] Example
[0122] In this embodiment, the steps of the stylized image description generation method based on cross-media deentanglement representation learning are as follows:
[0123] Step 1: Obtain training data for the images and their stylized image description text, and preprocess them to the same input format.
[0124] Step 2: Iteratively train the image variational autoencoder network model with style filters and fact filters using the training data. During each model update, disentangled representation learning is used to separate factual and stylistic information in the hidden layer space, ultimately obtaining the optimized parameters of the image variational autoencoder network model. The specific process of this step is as follows:
[0125] Step 21: Select a batch of data from the image training data;
[0126] Step 22: Apply two hidden space filters, a style filter and a fact filter, to the intermediate hidden space of the variational autoencoder network;
[0127] Step 23: Feed the data sampled in Step 21 into an image variational autoencoder network (D-Images) with style filters and fact filters.
[0128] By iteratively executing steps 21 to 23 above, the image variational autoencoder network can be trained.
[0129] The style and fact filters described above are responsible for filtering style and fact information in the hidden layer space. Each filter includes an auxiliary classification loss and an auxiliary discriminant loss function to separate the hidden layer space z into spatial slices containing only style and fact information, respectively. Each filter consists of a corresponding classifier and discriminator. These filters slice the hidden layer representation into different segments. Each filter preserves the selected information, such as the style and fact information latent in the image, by minimizing the auxiliary classifier loss function, and removes other irrelevant information by using another auxiliary discriminator loss function.
[0130] By using an auxiliary classification loss function, the aim is to capture the corresponding style or factual information in each hidden layer space. Specifically, this invention addresses this in each style space. A classifier with a style filter is applied to predict the style label l(·). Specifically, l(·) represents the distribution vector of the predicted style on the sample. Using a disentangled representation learning method, the classifier is designed to be trained by minimizing the cross-entropy between the predicted and true distributions, as shown in Equation (1):
[0131]
[0132] Where: s represents style, labels represent the set of styles [s1, s2]. In this embodiment, two style labels are set, which can be romantic style and humorous style, or positive style and negative style; t(·) represents the one-hot vector of the true style distribution corresponding to the sample, and t(s) represents the one-hot vector of the true style distribution of style s corresponding to the sample. It is the distribution vector of the predicted style output by the classifier, whose parameters are denoted as θ. cls(s) =[W cls(s) b cls(s) ],therefore This indicates that in the parameter θ cls(s) The distribution vector of the predicted style output by the lower classifier.
[0133] Similarly, in the fact space, this invention applies a classifier to predict the bag-of-words distribution of factual descriptions, with the training objective being:
[0134]
[0135] Similarly, in equation (2), It is the BoW distribution of the true bag-of-words for factual descriptions, p(w|z f ) is the BoW distribution of the predicted bag-of-words for the factual descriptions output by the classifier. It is a factual vocabulary list with style words and stop words removed, and the classifier parameter is denoted as θ. cls(f) =[Wcls(f) b cls(f) Therefore, p(w|z) f ;θ cls(f) ) indicates that in the parameter θ cls(f) The predicted bag-of-words (BoW) distribution of factual descriptions output by the lower classifier.
[0136] By employing an auxiliary discriminant loss function, each hidden space aims to be free of information from other spaces. Specifically, this paper first trains an adversarial discriminator to discriminate whether other spaces contain information from the current space. The discriminator in the style filter aims to predict the bag-of-words distribution of stylized descriptions based on the fact space, while the discriminator in the fact filter is based on the entire style space. To predict the bag-of-words distribution of factual descriptions. The training objectives of the discriminator are shown in equations (3) and (4):
[0137]
[0138]
[0139] In the formula: It is a style vocabulary list that removes factual words and stop words, and contains words from the list. It is the true bag-of-words (BoW) distribution for stylized description, p(w|z f ;θ dis(s) ) indicates that in the parameter θ dis(s) The following is a predicted bag-of-words BoW distribution of factual descriptions output by the discriminator in the style filter, θ dis(s) For discriminator parameters in the style filter; It is a factual vocabulary list with style words and stop words removed, where w represents a word in the vocabulary list. It is a factual description of the true bag-of-words BoW distribution, p(w|z s ;θ dis(f) ) indicates that in the parameter θ dis(f) The following is a BoW distribution of the predicted bag-of-words, stylized by the discriminator output in the fact filter, θ. dis(f) For the discriminator parameters in the fact filter; θ dis(f) and θ dis(f) It is also composed of the weights and biases in Softmax.
[0140] After the discriminator is trained, using the idea of adversarial training, the VAE learns how to "deceive" the discriminator in the two filters by maximizing information entropy, that is, minimizing the loss function shown in equations (5) and (6):
[0141]
[0142]
[0143] Similar to formula (6), where: p(w|z f ) and p(w|z s ) represents the parameter θ dis(s) and θ dis(f) After being fixed, p(w|z) is output by the discriminator. f ;θ cls(f) ) and p(w|z s ;θ dis(f) ).
[0144] It should be noted that among the above loss items, the losses calculated using equations (3) and (4) should be considered first. and Two discriminators are trained, and then the parameters of the image variational autoencoder network (VAE) and the two classifiers are trained. Furthermore, during the training of the VAE network, the parameters θ of the two previously trained discriminators are... dis(s) and θ dis(f) All parameters are fixed and not updated; the VAE network parameters θ in the variational autoencoder network are updated only by minimizing the total loss function. vae The classifier parameter θ in the style filter cls(s) And the classifier parameter θ in the fact filter cls(f) For image variational autoencoder networks, the above losses are distinguished by the superscript "img", and the total loss function is as follows. The format is:
[0145]
[0146] In the formula: and These are four loss weights. and The forms are shown in equations (1) and (2) respectively. and The forms are shown in equations (5) and (6) respectively.
[0147] The original loss of the image variational autoencoder network is in the form of:
[0148]
[0149] In the above equation, the visual representation is encoded and reconstructed through two transformations, f(·) and g(·), that is, the visual representation is mapped to the hidden layer representation and then reconstructed back to the original visual representation. The reconstruction loss is defined as |ε(x)-g(f(ε(x)))|, where |·| represents the l2 norm.
[0150] Step 3: Iteratively train the text variational autoencoder network model with style filter and fact filter using training data. During each model update, the deentanglement module separates the fact information and stylization information in the hidden layer space through deentanglement representation learning. Finally, the optimized text variational autoencoder network model (D-Captions) parameters are obtained.
[0151] The training process for the text variational autoencoder network model is the same as that for the image variational autoencoder network model. A batch of data can be selected from the text training data, and steps 21-23 can be iteratively repeated on the text variational autoencoder. Similarly, for the text variational autoencoder network, the above losses and weights are distinguished by the superscript cap. The discriminator training loss at this time is... and The total loss function The format is:
[0152]
[0153] In the formula: The original loss of the text variational autoencoder network is in the form of:
[0154]
[0155] In the above formula: the reconstruction loss is defined as To predict word x at each time step t t The vocabulary probability, x = (x1, x2, ... x n () is used to reconstruct stylized image description text.
[0156] For the D-Captions module, for each stylized image description of the input, it represents only one specific style; therefore, the hidden space is divided into two spatial slices (zi, zi, zi). s ,z f ), where z s This represents a spatial slice (s1 or s2) representing a specific style. Since an input stylized description contains only one specific style, This represents the entire style space z s .
[0157] Step 4: After updating the model parameters in Steps 2 and 3, fix the model parameters of the image variational autoencoder and the text variational autoencoder, and then perform the following steps:
[0158] Step 41: For the text variational autoencoder module, using the reparameterization technique, a fixed, generalized style representation can be obtained by sampling and averaging from the posterior distribution q(z|x) of the style description. (For style s1, it is) For style s2, it is That is, the mean of the hidden style representation of the descriptive text of the target style after being encoded by D-Captions.
[0159] Step 42: For the image variational autoencoder module, the reparameterization technique in S41 can also be used to sample and obtain image-specific style and fact representations: and
[0160] Step 43: Construct the Disentangled Stylized ImageCaption (DSIC) generation model, whose structure is as follows: Figure 2 As shown.
[0161] In this DSIC model, a decoder designed for the aggregation method is used to generate a description of the target style. For an input image x... * ,make and Each represents a style style And facts z f The posterior distributions over three spatial slices. Using the reparameterization technique, the following vector can be sampled from the image variational autoencoder:
[0162]
[0163]
[0164]
[0165] Subsequently, the three vectors derived from the image and the fixed-style vector learned from the descriptive text are combined. Input the aggregation module to obtain the aggregated cross-media style representation. and
[0166] During the training phase, and Separate and image-specific fact representations Connect them and input them into the description text generator. In the middle (here, LSTM), a greedy decoding is performed to generate image descriptions with style s1 and style s2 respectively. During the testing phase, the results obtained through the above steps... and Separate and image-specific fact representations Connect them together and perform a beam search of size 5 in the trained descriptive text generator.
[0167] The specific implementation process of the DSIC model is described in detail below:
[0168] In this model, the style and factual representations of the images and texts obtained in steps 41 and 42 are input into the capsule network aggregation module for aggregation (denoted as AGGREGATE). Finally, the results are output through the descriptive text generator. The specific process is as follows:
[0169] Capsule network aggregation consists of two input capsules: and U and V are learnable matrix parameters. There are n output capsules, each representing one of the n parts of the aggregated cross-media representation. Each input capsule Ω... i Having n "vote vectors" {A i1 A i2 ,…,A in} represents the contribution of style information extracted from the image or descriptive text to the output capsule, specifically the j-th voting vector is represented as:
[0170] A ij =Ω i W ij ,i={1,2}and j=[1,n]#(14)
[0171] In the formula: W ij Represents the learnable weight matrix;
[0172] Each output capsule Defined as:
[0173]
[0174] Wherein, coupling coefficient A was measured ij and The amount of information transmitted between them is calculated as shown in equation (14):
[0175]
[0176] Among them, B ij The input capsule Ω was measured i and output capsule The coupling degree between them is initialized to 0 and updated by equation (15):
[0177]
[0178] Subsequently, a nonlinear compression function is applied to map the length of the output capsule to a range between 0 and 1 to characterize the probability.
[0179]
[0180] Where: || represents the modulus, || 2 Represents the square of the modulus;
[0181] Finally, the n output capsules corresponding to style s1 are concatenated to form an aggregated cross-media style representation. Statement of facts These are concatenated and fed into an LSTM-based descriptive text generator to produce stylized descriptive text with style s1.
[0182] The above describes the process of generating a stylized description with style s1. Similarly, for style s2, the n output capsules corresponding to style s2 are concatenated to form an aggregated cross-media style representation. Statement of facts These elements are concatenated and fed into the descriptive text generator to produce stylized descriptive text with style S2.
[0183] Step 5: During the training process of stylized image description generation, firstly, image and text training data of the target are collected. The images in a batch of sampled data are input into the encoder of the image variational autoencoder obtained in S3, and the text is input as the training label into the description text generator. Greedy decoding is performed to generate image descriptions with style s1 and style s2 respectively. The description text generator is trained by calculating cross-entropy loss. The sampling and training process is repeated until the entire complete dataset is trained.
[0184] Step 6: In the testing phase, after inputting the target image into the trained stylized image description generation model, the target image-specific style representation and factual representation are first obtained through the image variational autoencoder trained in S2. Then, the target image-specific style representation is input into the capsule network aggregation module and aggregated with the general style representation obtained during training to form an aggregated cross-media style representation. After being connected with the target image-specific factual representation, it is input into the descriptive text generator trained in S5 to perform a beam search of size 5, thereby generating image description text of the corresponding style.
[0185] The above-mentioned stylized image description generation method based on cross-media deentanglement representation learning can be expressed in pseudocode as follows:
[0186] 1. Download the stylized image description generation datasets FlickrStyle10K and SentiCap.
[0187] II. Training of the de-entanglement module. The pseudocode for the training process is briefly described below:
[0188] Algorithm 1. Training of the de-entanglement module.
[0189] Input: Training dataset Where x i Represents the i-th image. This indicates the description text corresponding to style s1. This indicates the description text corresponding to style s2. This represents the corresponding factual description text, used to construct the bag-of-words distribution of the factual description. The dataset size is N.
[0190] Output: The trained DSIC model.
[0191] BEGIN
[0192] / / Learning unentangled representations on images (D-Images)
[0193] 1. FOR small batch data DO
[0194] 2. Minimize loss To optimize the parameters of D-Images
[0195] 3. Minimize loss To optimize the parameters of D-Images
[0196] 4. Minimize loss To optimize the parameters of D-Images and
[0197] 5. DONE
[0198] / / Describes learning unentangled representations on text (D-Captions)
[0199] 6. FOR small batch data DO
[0200] 7. Minimize loss To optimize the parameters of D-Captions
[0201] 8. Minimize loss To optimize the parameters of D-Captions
[0202] 9. Minimize loss To optimize the parameters of D-Captions and
[0203] 10. DONE
[0204] / / Using reparameterization techniques, we can sample and average from the latent space of stylized descriptions to obtain a fixed, general style representation. and
[0205] 11.
[0206] 12.DO
[0207] 13. / / Reparameterization sampling
[0208] 14 / / Reparameterization sampling
[0209] 15. DONE
[0210] 16. / / Average over the entire training set
[0211] 17. / / Average over the entire training set
[0212] END
[0213] III. Training the capsule network aggregation module. A brief description of the pseudocode is as follows:
[0214] Algorithm 2. Capsule Network Aggregation Module.
[0215] 1.
[0216] 2.DO
[0217] 3. / / Reparameterization sampling
[0218] 4. / / Cross-media style information aggregation representation
[0219] 5. / / Cross-media style information aggregation representation
[0220] 6. / / Connected to factual representation
[0221] 7. / / Connected to factual representation
[0222] 8. / / Input generator Perform greedy decoding
[0223] 9. / / Input generator Perform greedy decoding
[0224] 10. Utilize the cross-entropy loss function to optimize the generator description.
[0225] 11. DONE
[0226] END
[0227] IV. After completing the deentanglement module, aggregation module, and description generator After training, a DSIC model is formed, which can be used for testing and inference.
[0228] The pseudocode for the inference process is briefly described below:
[0229] Algorithm 3. Model testing process.
[0230] Input: Any target image x.
[0231] Output: A description of the image with style s1 and style s2. and
[0232] BEGIN
[0233] 1. / / Reparameterization sampling
[0234] 2. / / Cross-media style information aggregation
[0235] 3. / / Cross-media style information aggregation
[0236] 4. / / Connected to factual representation
[0237] 5. / / Connected to factual representation
[0238] 6. / / Input generator Perform a beam search of size 5
[0239] 7. / / Input generator Perform a beam search of size 5
[0240] END
[0241] This embodiment compares the image description style accuracy measurement results of the stylized image description generation method of the present invention (denoted as DSIC) and other image description generation models, as shown in Table 1 below:
[0242] Table 1 shows the generation results of the Image Description Style Precision Measurement (ICSA) (the best results on the dataset are shown in bold).
[0243]
[0244] CNN+LSTM: An image description generation model based on the classic encoder-decoder architecture, with a simple structure and good performance and robustness. This model uses a CNN to encode the image, and then uses an LSTM and a fixed style vector as an indicator to generate image descriptions of the target style.
[0245] StyleNet: This model proposes a model component called Factored LSTM. Using this component, the model can automatically distill style factors from a monolingual corpus. This model can explicitly control style factors in the descriptive text generation process, thereby producing more attractive visual descriptive text with the target style. This paper reproduces this work using open-source code.
[0246] Style-Factual LSTM (SF-LSTM): This model proposes an adaptive learning method based on a reference fact model, which can provide factual knowledge to the model when learning from stylized image descriptions and can adaptively calculate how much information to provide at each time step. When the model learns from stylized descriptive text, it can provide factual knowledge to the model and can adaptively calculate how much information to provide at each time step. This paper reproduces this work using the source code provided by the authors. Seq-SQuAD->QuAC: Where Seq represents a sequential learning approach, indicating that the model first learns on the SQuAD task, and after learning, it then learns on the QuAC task.
[0247] D-Images and D-Captions: These are the ablation structures of the model proposed in this invention. They respectively employ only D-Images combined with an LSTM decoder or D-Captions modules combined with a CNN encoder structure.
[0248] Tables 2 and 3 show the comparative effects of the present invention's comparative techniques and models on the SentiCap and StyleNet datasets.
[0249] Table 2 shows the stylized image description generation results on the SentiCap dataset (the best results on the dataset are shown in bold).
[0250]
[0251] Table 3 shows the stylized image description generation results on the FlickrStyle10K dataset (the best results on the dataset are shown in bold).
[0252]
[0253] The experimental results above demonstrate that this invention can successfully separate style information and factual information, revealing that style information exists both in human experience and in the image itself. Experimental results on stylized image description generation show that the disentangled representation learning of this invention is beneficial to the interpretability and controllability of images, and improves the performance of stylized image description generation tasks.
[0254] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the invention. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, all technical solutions obtained through equivalent substitution or transformation fall within the protection scope of the present invention.
Claims
1. A method for generating stylized image descriptions based on cross-media unentangled representation learning, characterized in that, The steps are as follows: S1: Obtain training data for the image and its stylized image description text; S2: Iteratively train the image variational autoencoder network model with style filter and fact filter using training data, and use deentangled representation learning to separate fact information and stylization information in the hidden layer space during each model update, and finally obtain the optimized image variational autoencoder network model parameters. S3: Iteratively train the text variational autoencoder network model with style filter and fact filter using training data. During each model update, the deentanglement module separates the fact information and stylistic information in the hidden layer space through deentanglement representation learning, and finally obtains the optimized text variational autoencoder network model parameters. S4: After updating the model parameters in S2 and S3, fix the model parameters of the image variational autoencoder and the text variational autoencoder. Then, use the reparameterization technique to sample and average the stylized description output by the text variational autoencoder to obtain a fixed general style representation. Construct a stylized image description generation model. In the model, the image variational autoencoder first generates the image's own style and factual representation from the input image. Then, the capsule network aggregation module aggregates the general style representation and the image's own style to obtain an aggregated cross-media style representation. Finally, it is connected with the factual representation to form the final feature representation and input into the descriptive text generator to generate stylized image description text with style. S5. Based on the training data, iteratively train the descriptive text generator in the stylized image description generation model and update the network parameters of the descriptive text generator. S6. After completing the training of the descriptive text generator in S5, input the target image into the stylized image description generation model, and finally the descriptive text generator outputs the stylized image description text. The training process for each round of training for image variational autoencoder networks and text variational autoencoder networks is as follows: 1) First, train the discriminators in the style filter and fact filter to distinguish whether other spaces contain information about the current space. The discriminator in the style filter aims to predict the bag-of-words distribution of the stylized description based on the fact space, while the discriminator in the fact filter is based on the entire style space. To predict the bag-of-words distribution of factual descriptions, and There are two styles; the training loss of the discriminator in the style filter and the fact filter. and They are respectively: In the formula: It is a style vocabulary list that removes factual words and stop words. For words in the vocabulary list, It is a stylized description of the true bag-of-words (BoW) distribution. Indicates in the parameter The following is a predicted bag-of-words (BoW) distribution of factual descriptions output by the discriminator in the style filter. For discriminator parameters in the style filter; In the formula: It is a factual vocabulary list that removes style words and stop words. For words in the vocabulary list, It is a factual description of the true bag-of-words (BoW) distribution. Indicates in the parameter The following is a BoW distribution of the predicted bag-of-words with stylized descriptions output by the discriminator in the fact filter. These are the discriminator parameters in the fact filter; 2) Fix the parameters of the discriminator in the trained style filter and fact filter. and Then, by minimizing the total loss function, the variational autoencoder network... Network parameters Classifier parameters in style filters Classifier parameters in fact filters Training is performed, where the total loss function is... The format is: In the formula: The original loss of the variational autoencoder network, , , and These are four loss weights; and The classifier losses in the style filter and fact filter are respectively: In the formula: Indicates style, A collection of styles , ; This is represented as the one-hot vector of the true style distribution corresponding to the sample. This represents the distribution vector of predicted styles on the sample. Indicates in the parameter The distribution vector of the predicted style output by the subclassifier. Indicates in the parameter The predicted bag-of-words (BoW) distribution of the factual descriptions output by the lower classifier; and Two adversarial losses, one for the style filter and one for the fact filter: In the formula: and Indicates parameters and The output of the discriminator after fixing and .
2. The stylized image description generation method based on cross-media unentangled representation learning as described in claim 1, characterized in that, The specific implementation steps of S2 are as follows: S21: Select several batches of data from the training data; S22: Apply style filters and fact filters to the intermediate hidden layer space of the variational autoencoder network to form an image variational autoencoder network with style filters and fact filters; the style filters and fact filters are respectively responsible for filtering style information and fact information in the hidden layer space; each filter consists of a classifier and a discriminator, the classifier is an auxiliary classification loss function, and the discriminator is an auxiliary discrimination loss function. S23: Feed the data sampled in S21 into the image variational autoencoder network with style filter and fact filter in batches, and train it iteratively.
3. The stylized image description generation method based on cross-media unentangled representation learning as described in claim 1, characterized in that, The specific implementation steps of S3 are as follows: S31: Select several batches of data from the training data; S32: Apply style filters and fact filters to the intermediate hidden layer space of the variational autoencoder network to form a text variational autoencoder network with style filters and fact filters; the style filters and fact filters are respectively responsible for filtering style information and fact information in the hidden layer space; each filter consists of a classifier and a discriminator, the classifier is an auxiliary classification loss function, and the discriminator is an auxiliary discriminant loss function. S33: The data sampled in S31 is fed into a text variational autoencoder network with style filters and fact filters in batches for iterative training.
4. The stylized image description generation method based on cross-media unentangled representation learning as described in claim 2 or 3, characterized in that, The learnable parameters in both the classifier and discriminator include the Softmax weights and biases.
5. The stylized image description generation method based on cross-media unentangled representation learning as described in claim 1, characterized in that, The specific implementation steps of S4 are as follows: S41: After completing the model parameter updates in S2 and S3, fix the model parameters of the image variational autoencoder and the text variational autoencoder; for the text variational autoencoder module, use the reparameterization technique to sample from the posterior distribution of the stylized image description text and take the average to obtain a fixed general style representation. ,and Regarding style Its general style is denoted as Regarding style Its general style is denoted as ; S42: For the image variational autoencoder module, the same reparameterization technique as in S41 is used to sample and obtain a style representation specific to each image. , and facts ; S43: The style and factual representations of the images and texts obtained from S41 and S42 are input into the capsule network aggregation module for aggregation. The capsule network aggregation includes input capsules and output capsules. The two input capsules are: and ,in and For learnable matrix parameters, Each output capsule represents a converged cross-media representation. Each part; each input capsule have Voting vectors: , to represent the contribution of style information extracted from the image or descriptive text to the output capsule, the j-th voting vector is represented as: In the formula: Represents the learnable weight matrix; Each output capsule Defined as: Where: Coupling coefficient Measured and The amount of information transmitted between them , The calculation method is as follows: in: Measurement of input capsules and output capsule The coupling between them is initialized to 0 and updated by the following formula: Subsequently, a nonlinear compression function is applied to map the length of the output capsule to a range between 0 and 1 to characterize the probability: in: Indicates the modulus. Represents the square of the modulus; Finally, the style corresponding The output capsules are connected to form a style. Corresponding aggregated cross-media style representation And then, according to the facts Connecting them together forms the final feature representation, which will define the style. The corresponding final feature representations are input into the descriptive text generator to generate styled text. Stylized descriptions of text.
6. The stylized image description generation method based on cross-media unentangled representation learning as described in claim 5, characterized in that, The descriptive text generator is implemented using an LSTM network.
7. The stylized image description generation method based on cross-media unentangled representation learning as described in claim 5, characterized in that, In the descriptive text generator, the final feature representation of the input is generated by performing greedy decoding, resulting in a text with style. Image-styled descriptive text.
8. The stylized image description generation method based on cross-media unentangled representation learning as described in claim 1, characterized in that, In step S5, the loss function for training the stylized image description generation model is cross-entropy loss.
9. The stylized image description generation method based on cross-media unentangled representation learning as described in claim 1, characterized in that, In step S6, after the target image is input into the stylized image description generation model, the target image-specific style representation and factual representation are first obtained through the image variational autoencoder trained in step S2. Then, the target image-specific style representation is input into the capsule network aggregation module and aggregated with the general style representation obtained during training to form an aggregated cross-media style representation. After being connected with the target image-specific factual representation, it is input into the descriptive text generator trained in step S5 to perform beam search and generate image description text of the corresponding style.