Image generation method and apparatus based on adaptive text learning
By using an adaptive text learning method, and leveraging the similarity and domain regularization loss function during the training process of the mapper and generator, high-quality and diverse target domain images are generated, solving the problems of low image generation quality and mode collapse in existing technologies.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TSINGHUA UNIVERSITY
- Filing Date
- 2023-03-30
- Publication Date
- 2026-06-19
Smart Images

Figure CN116402910B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer vision technology, and in particular to an image generation method and apparatus based on adaptive text learning. Background Technology
[0002] In recent years, image generation methods using generative adversarial networks (GANs) have developed rapidly. However, this method relies on a large number of sample images and usually requires a long training time for tedious adversarial training.
[0003] To address the aforementioned issues, a zero-shot adaptive image generation method has been proposed. This method requires no sample images, only domain labels and a pre-trained CLIP model (Contrastive Language-Image Pre-training), eliminating the need for cumbersome adversarial training. However, images generated using this method are of low quality and prone to mode collapse. Summary of the Invention
[0004] In view of the above problems, this disclosure provides an image generation method and apparatus based on adaptive text learning to overcome or at least partially solve the above problems.
[0005] A first aspect of this disclosure provides an image generation method based on adaptive text learning, comprising:
[0006] The latent vectors are input into the target domain generator to obtain the target domain image;
[0007] The target domain image is generated in the direction determined by the adaptive source domain text cue vector corresponding to the latent vector. The training process of the target domain generator uses the source domain cue text vector adaptively generated by the mapper based on the sample vector.
[0008] Optionally, the training process of the mapper includes the following steps:
[0009] Multiple sample vectors are input into the initial mapper to obtain the source domain cue text vector for each sample vector;
[0010] Encode the multiple source domain prompt text vectors to obtain multiple first vector codes;
[0011] The multiple sample vectors are input into the source domain generator to obtain the source domain image corresponding to each sample vector.
[0012] Encode the source domain images corresponding to the multiple sample vectors to obtain multiple second vector codes;
[0013] The initial mapper is trained based on the plurality of first vector codes and the plurality of second vector codes to obtain the trained mapper.
[0014] Optionally, training the initial mapper based on the plurality of first vector codes and the plurality of second vector codes to obtain the trained mapper includes:
[0015] Calculate the similarity between each of the first vector codes and each of the second vector codes;
[0016] The initial mapper is trained with the goal of maximizing the similarity between the first vector encoding and the second vector encoding corresponding to the same sample vector, and minimizing the similarity between the first vector encoding and the second vector encoding corresponding to different sample vectors, to obtain the trained mapper.
[0017] Optionally, the source domain cue text vector includes a source domain label; after obtaining the source domain cue text vector for each of the sample vectors, the method further includes:
[0018] Replace the source domain label of the source domain hint text vector of each sample vector with the target domain label to obtain the target domain hint text vector of each sample vector;
[0019] Encode multiple target domain prompt text vectors to obtain multiple third vector codes;
[0020] Obtain the encoding of the target domain label;
[0021] Calculate the similarity between each of the third vector codes and the codes of the target domain label;
[0022] The initial mapper is trained with the goal of minimizing the similarity between the encoding of each third vector and the encoding of the target domain label, resulting in a trained mapper.
[0023] Optionally, the training process of the target domain generator includes the following steps:
[0024] Input the sample vector into the initial target domain generator to obtain the target domain sample image corresponding to the sample vector;
[0025] The sample vector is input into the source domain generator to obtain the source domain sample image corresponding to the sample vector;
[0026] The first difference is determined based on the target domain sample image and the source domain sample image corresponding to the sample vector;
[0027] The sample vector is input into the mapper to obtain the target domain prompt text vector and the source domain prompt text vector corresponding to the sample vector;
[0028] The second difference is determined based on the target domain prompt text vector and the source domain prompt text vector corresponding to the sample vector;
[0029] Based on the first difference and the second difference, a directional loss function is established;
[0030] The initial target domain generator is trained according to the directional loss function to obtain the trained target domain generator.
[0031] Optionally, the step of inputting the latent vector into the target domain generator to obtain the target domain image includes:
[0032] Acquire images;
[0033] Obtain the image latent vector corresponding to the image;
[0034] The latent vector of the image is input into the target domain generator to obtain the target domain image.
[0035] A second aspect of this disclosure provides an image generation apparatus based on adaptive text learning, comprising:
[0036] The input module is used to input the latent vectors into the target domain generator to obtain the target domain image;
[0037] The target domain image is generated in the direction determined by the adaptive source domain text cue vector corresponding to the latent vector. The training process of the target domain generator uses the source domain cue text vector adaptively generated by the mapper based on the sample vector.
[0038] A third aspect of this disclosure provides an electronic device, including: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the image generation method based on adaptive text learning as described in the first aspect.
[0039] A fourth aspect of this disclosure provides a computer-readable storage medium that, when instructions in the computer-readable storage medium are executed by a processor of an electronic device, enables the electronic device to perform the image generation method based on adaptive text learning as described in the first aspect.
[0040] The embodiments disclosed herein have the following advantages:
[0041] In this embodiment, the latent vector is input into the target domain generator to generate the target domain image. The training process of the target domain generator uses source domain cue text vectors adaptively generated by a mapper based on sample vectors, and the generation direction of the target domain image is determined by the adaptive cue text vector corresponding to the latent vector. Therefore, the target domain images generated from each latent vector can possess features related to that latent vector, solving the technical problem that target domain images of the same target domain have homogeneous patterns, thereby improving the quality of the generated target domain images. Attached Figure Description
[0042] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings used in the description of the embodiments of this disclosure will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0043] Figure 1 This is a schematic diagram illustrating the generation of target domain images using related technologies;
[0044] Figure 2 This is a schematic diagram illustrating the generation of a target domain image according to an embodiment of this disclosure;
[0045] Figure 3 This is a schematic diagram of the process of training the mapper according to an embodiment of the present disclosure;
[0046] Figure 4 This is a schematic flowchart of the training target domain generator according to an embodiment of the present disclosure. Detailed Implementation
[0047] To make the above-mentioned objectives, features and advantages of this disclosure more apparent and understandable, the disclosure will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0048] The target domain and the source domain can be any two different domains; for example, the source domain could be the anime / painting domain, and the target domain could be the mural domain. Images within the same domain have the same image style.
[0049] Figure 1This diagram illustrates the generation of target domain images using related technologies. It shows three different source domain images, all using the same source domain cue text, "human." When generating the target domain images, the target domain images corresponding to each of the three source domain images also share the same target domain cue text, "sprite." Therefore, when generating the three target domain images based on the same cue text, the direction between the vector codes of each source domain image and the corresponding target domain image is the same as the direction between the vector codes of the source and target domain cue texts. This places high constraints on the target domain generator, easily leading to the three generated target domain images having homogeneous characteristics.
[0050] Figure 2 This is a schematic diagram illustrating the generation of target domain images according to an embodiment of this disclosure. Three different source domain images have different source domain prompt texts ("Asian girl", "curly-haired lady", and "man wearing glasses"). Therefore, when generating the target domain image, the target domain images corresponding to each of the three source domain images also have different target domain prompt texts ("Asian girl elf", "curly-haired lady elf", and "man wearing glasses elf"). The direction between the vector encoding of each source domain image and the vector encoding of the corresponding target domain image is the same as the direction between the vector encoding of the corresponding source domain prompt text and the vector encoding of the target domain prompt text. However, the directions between the vector encoding of the source domain prompt text and the vector encoding of the target domain prompt text for different images are different. Therefore, when generating three target domain images based on different target domain prompt texts, a more precise generation direction is determined for each target domain image, giving the target domain generator higher flexibility and avoiding the problem of the three generated target domain images having homogeneous features and being prone to mode collapse.
[0051] An image generation method based on adaptive text learning in this embodiment of the present disclosure may specifically include: inputting a latent vector into a target domain generator to obtain a target domain image. The generation direction of the target domain image is determined by the adaptive source domain text cue vector corresponding to the latent vector; the training process of the target domain generator uses a source domain cue text vector adaptively generated by a mapper based on sample vectors.
[0052] The latent vector can be a random vector, a vector obtained by randomly sampling a Gaussian distribution, or an image latent vector corresponding to an image. Specifically, if the latent vector is an image latent vector corresponding to an image, an arbitrary image can be obtained, and a pre-trained inversion model can be used to invert the arbitrary image into an image latent vector, which is then input into the target domain generator.
[0053] The latent vector is input into the target domain generator. The generator determines the generation direction of the target domain image based on the adaptive text cue vector corresponding to the latent vector, thus generating a target domain image related to the input latent vector. The adaptive text cue vector corresponding to the latent vector means that the source domain cue text vector is specific to that latent vector. Because each latent vector corresponds to a different adaptive text cue vector, the target domain generator can determine a more precise generation direction for each target domain image, thereby generating different target domain images.
[0054] How the target domain generator can determine the generation direction of the target domain image based on the adaptive text prompt vector corresponding to the latent vector can be seen from the training process of the target domain generator described later. The training of the target domain generator is based on a trained mapper; therefore, the training process of the mapper will be introduced first.
[0055] Figure 3 This is a schematic flowchart of the training mapper according to an embodiment of the present disclosure. The training process of the mapper includes at least the following steps: inputting multiple sample vectors into an initial mapper to obtain a source domain cue text vector for each sample vector; encoding the multiple source domain cue text vectors to obtain multiple first vector codes; inputting the multiple sample vectors into a source domain generator to obtain a source domain image corresponding to each sample vector; encoding the source domain images corresponding to the multiple sample vectors to obtain multiple second vector codes; and training the initial mapper based on the multiple first vector codes and the multiple second vector codes to obtain the trained mapper.
[0056] Multiple sample vectors can be random vectors, vectors obtained by randomly sampling from a Gaussian distribution, or latent vectors corresponding to sample images. The method for obtaining latent vectors of sample images can refer to the method for obtaining latent vectors of images.
[0057] By inputting multiple sample vectors into the initial mapper, a source domain cue text vector can be obtained for each sample vector. The source domain cue text vector of a vector includes a set of cue text subvectors corresponding to that vector and a source domain label. The source domain label can be the name of the source domain; for example, the source domain label could be "photo domain". A text encoder can be used to encode the source domain cue text vectors of multiple sample vectors to obtain the first vector encoding corresponding to each sample vector.
[0058] Multiple sample vectors are input into the source domain generator, which generates a source domain image corresponding to each sample vector. The source domain generator can be any pre-trained image generator. An image encoder is then used to encode the source domain image corresponding to each sample vector, yielding a second vector encoding for each sample vector.
[0059] The first vector encoding is determined based on the source domain cue text vector generated by the initial mapper, and the second vector encoding is determined based on the source domain image generated by the trained source domain generator. The goal is for the encoding of the source domain cue text vector of a sample vector generated by the initial mapper to be close to the encoding of the corresponding source domain image, and for the encoding of the source domain cue text vector of a sample vector to be far removed from the encodings of the corresponding source domain images of other sample vectors. Therefore, a contrastive loss function can be established based on multiple first vector encodings and multiple second vector encodings. The initial mapper can then be trained using this contrastive loss function to obtain a trained mapper.
[0060] In some implementations, the similarity between each first vector code and each second vector code can be calculated; the initial mapper can be trained with the aim of maximizing the similarity between the first vector code and the second vector code corresponding to the same sample vector and minimizing the similarity between the first vector code and the second vector code corresponding to different sample vectors, to obtain a trained mapper.
[0061] The calculation of the similarity between each first vector code and each second vector code can be achieved by first performing L2 normalization on each first vector code and each second vector code, and then calculating the cosine similarity between each first vector code after L2 normalization and each second vector code after L2 normalization.
[0062] In this way, the prompt text vector generated by the trained mapper can retain the image features of the source domain image.
[0063] Aside from the domain label, the target domain may lack other prior knowledge, allowing the sharing of source domain cue text vectors between the source and target domain images. However, because source domain cue text vectors represent features closely related to the source domain, conflicts may exist between them and the target domain. For example, the cue text for a "human" domain image might be "round ears," while "elf" domain images all use "pointed ears," excluding "round ears." Sharing the cue text "round ears" from the "human" domain image with the "elf" domain image could lead to problems in the generated target domain image. Therefore, domain regularization loss can be used to ensure that the generated source domain cue text vectors are applicable to the target domain.
[0064] Based on the above technical solution, after obtaining the source domain cue text vector for each sample vector, the training of the mapper may further include the following steps: replacing the source domain label of the source domain cue text vector of each sample vector with the target domain label to obtain the target domain cue text vector of each sample vector; encoding multiple target domain cue text vectors to obtain multiple third vector codes; obtaining the code of the target domain label; calculating the similarity between each third vector code and the code of the target domain label; establishing a domain regularization loss function, and training the initial mapper according to the domain regularization loss function to obtain the trained mapper. Specifically, the initial mapper is trained with the goal of minimizing the similarity between each third vector code and the code of the target domain label to obtain the trained mapper.
[0065] By directly replacing the source domain label of the source domain hint text vector with the target domain label, we can obtain the target domain hint text vector for that vector. The target domain label can be the name of the target domain, for example, "Sprite Domain". Encoding the target domain hint text vector of each sample vector using a text encoder yields the third vector encoding. Encoding the target domain label using a text encoder yields the target domain label encoding.
[0066] The cosine similarity between the third vector encoding and the target domain label encoding can be calculated. This cosine similarity represents the distance between the corresponding target domain cue text vector and the target domain. While training the initial mapper with the objectives of maximizing the similarity between the first and second vector encodings corresponding to the same sample vector, and minimizing the similarity between the first and second vector encodings corresponding to different sample vectors, the initial mapper can also be trained with the objective of minimizing the cosine similarity between the third vector encoding and the target domain label encoding, resulting in a trained mapper.
[0067] Thus, the prompt text vectors generated by the trained mapper, while preserving the image features of the source domain image, also restrict the applicability of the preserved image features to the target domain. For example, if the target domain is the "sprite" domain, then prompt text vectors such as "round ears" should not be included in the ideal prompt text vectors.
[0068] By utilizing a trained mapper, the target domain generator can be trained to produce more accurate and diverse generation directions for cross-domain image pairs. During the training phase of the target domain generator, the adaptive cue text vector generated by the mapper replaces the manually designed fixed cue vectors in related techniques, and the final text supervision information includes the shared learned cue vector and the respective embeddings of the two domain labels.
[0069] Figure 4This is a schematic flowchart illustrating the training process of the target domain generator according to an embodiment of the present disclosure. The training process of the target domain generator includes the following steps: inputting a sample vector into an initial target domain generator to obtain a target domain sample image corresponding to the sample vector; inputting the sample vector into a source domain generator to obtain a source domain sample image corresponding to the sample vector; determining a first difference based on the target domain sample image and the source domain sample image corresponding to the sample vector; inputting the sample vector into the mapper to obtain a target domain prompt text vector and a source domain prompt text vector corresponding to the sample vector; determining a second difference based on the target domain prompt text vector and the source domain prompt text vector corresponding to the sample vector; establishing a directional loss function based on the first difference and the second difference; and training the initial target domain generator based on the directional loss function to obtain a trained target domain generator.
[0070] The sample vector can be a random vector, a vector obtained by randomly sampling a Gaussian distribution, or a latent vector of a sample image corresponding to a sample image. It can be the same as or different from the sample vector used to train the mapper.
[0071] The initial target domain generator generates target domain sample images corresponding to sample vectors. The source domain generator is a pre-trained image generator; inputting sample vectors into the source domain generator yields source domain sample images corresponding to the sample vectors. An image encoder encodes both the target domain sample image and the source domain sample image, and then performs L2 normalization on both encodings to obtain processed target domain sample image encodings and processed source domain sample image encodings. The difference between the processed target domain sample image encoding and the processed source domain sample image encoding is defined as the first difference.
[0072] Inputting the sample vector into the trained mapper yields the corresponding source domain cue text vector, which includes the source domain label. Replacing the source domain label with the target domain label in the source domain cue text vector yields the corresponding target domain cue text vector. Both the source and target domain cue text vectors are encoded using a text encoder, and then L2 normalization is applied to both encodings to obtain the processed target domain cue text vector encoding and the processed source domain cue text vector encoding. The difference between the processed target domain cue text vector encoding and the processed source domain cue text vector encoding is defined as the second difference.
[0073] Based on the first and second differences, the directional loss function can be established as shown below. L adapt :
[0074]
[0075] in, L adapt Characterizing the directional loss function, ΔI i The first difference, ΔT, represents the first difference corresponding to the i-th sample vector. i The second difference is represented by the i-th sample vector; i = 1, 2, ..., n; n is the number of sample vectors. Representing the i-th sample vector, A set representing multiple sample vectors. Characterizes mathematical expectation.
[0076] Training the initial target domain generator using the orientation loss function gradually brings the direction between the vector encodings of the target domain sample images and the source domain sample images generated by the generator closer to the direction between the encodings of the target domain cue text vectors and the source domain cue text vectors. Since the domain cue text vectors contain domain labels, the supervision information for training the initial target domain generator includes the embeddings of both domain labels.
[0077] Because the source domain cue text vector is an adaptively generated cue text vector specific to the input sample vector by the trained mapper, while the target domain cue text vector is obtained by directly replacing the domain label of the source domain cue text vector, during training, when the target domain generator generates the target domain image, the determined generation direction is the direction between the adaptive source domain cue text vector encoding and the target domain cue text vector encoding, gradually approaching the direction between the sample vector.
[0078] Two vector codes in the encoding space can determine a direction. Therefore, a direction can be determined based on the encodings of the target domain cue text vector and the source domain cue text vector. When generating the target domain sample image, given the vector codes and directions of the source domain sample image (the direction between the target domain cue text vector and the source domain cue text vector), the target domain sample image can be generated such that the direction between the vector codes of the target domain sample image and the source domain sample image closely approximates the direction between the encodings of the target domain cue text vector and the source domain cue text vector.
[0079] After the target domain generator is trained, it can be used alone to generate target domain images. Furthermore, the direction of the generated target domain images is close to the direction between the adaptive source domain text hint vector and the target domain text hint vector corresponding to the latent vector.
[0080] The target domain generator in the technical solution of this disclosure embodiment can be an adversarial network-based generator or a diffusion model.
[0081] The target domain images generated using the technical solutions of this disclosure are clearer and more accurate. Because the mapper retains sufficient image features from the source domain image, and these features are compatible with the target domain, the generated prompt text vectors provide the target domain generator with more precise and diverse generation directions. Therefore, the generated target domain images consistently exhibit high diversity, effectively solving the pattern collapse problem. Furthermore, by providing source domain images, target domain labels, target domain images generated by the technical solutions of this disclosure, and target domain images generated by related technical solutions to 1210 users, and allowing users to select the better target domain image—one that is more consistent with the target domain label and better preserves useful source domain information—80.5% of users preferred the target domain images generated by the technical solutions of this disclosure.
[0082] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this disclosure are not limited to the described order of actions, because according to the embodiments of this disclosure, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of this disclosure.
[0083] An image generation apparatus based on adaptive text learning according to an embodiment of this disclosure includes an input module, wherein:
[0084] The input module is used to input the latent vectors into the target domain generator to obtain the target domain image;
[0085] The target domain image is generated in the direction determined by the adaptive source domain text cue vector corresponding to the latent vector. The training process of the target domain generator uses the source domain cue text vector adaptively generated by the mapper based on the sample vector.
[0086] Optionally, the training process of the mapper includes the following steps:
[0087] Multiple sample vectors are input into the initial mapper to obtain the source domain cue text vector for each sample vector;
[0088] Encode the multiple source domain prompt text vectors to obtain multiple first vector codes;
[0089] The multiple sample vectors are input into the source domain generator to obtain the source domain image corresponding to each sample vector.
[0090] Encode the source domain images corresponding to the multiple sample vectors to obtain multiple second vector codes;
[0091] The initial mapper is trained based on the plurality of first vector codes and the plurality of second vector codes to obtain the trained mapper.
[0092] Optionally, training the initial mapper based on the plurality of first vector codes and the plurality of second vector codes to obtain the trained mapper includes:
[0093] Calculate the similarity between each of the first vector codes and each of the second vector codes;
[0094] The initial mapper is trained with the goal of maximizing the similarity between the first vector encoding and the second vector encoding corresponding to the same sample vector, and minimizing the similarity between the first vector encoding and the second vector encoding corresponding to different sample vectors, to obtain the trained mapper.
[0095] Optionally, the source domain cue text vector includes a source domain label; after obtaining the source domain cue text vector for each of the sample vectors, the method further includes:
[0096] Replace the source domain label of the source domain hint text vector of each sample vector with the target domain label to obtain the target domain hint text vector of each sample vector;
[0097] Encode multiple target domain prompt text vectors to obtain multiple third vector codes;
[0098] Obtain the encoding of the target domain label;
[0099] Calculate the similarity between each of the third vector codes and the codes of the target domain label;
[0100] The initial mapper is trained with the goal of minimizing the similarity between the encoding of each third vector and the encoding of the target domain label, resulting in a trained mapper.
[0101] Optionally, the training process of the target domain generator includes the following steps:
[0102] Input the sample vector into the initial target domain generator to obtain the target domain sample image corresponding to the sample vector;
[0103] The sample vector is input into the source domain generator to obtain the source domain sample image corresponding to the sample vector;
[0104] The first difference is determined based on the target domain sample image and the source domain sample image corresponding to the sample vector;
[0105] The sample vector is input into the mapper to obtain the target domain prompt text vector and the source domain prompt text vector corresponding to the sample vector;
[0106] The second difference is determined based on the target domain prompt text vector and the source domain prompt text vector corresponding to the sample vector;
[0107] Based on the first difference and the second difference, a directional loss function is established;
[0108] The initial target domain generator is trained according to the directional loss function to obtain the trained target domain generator.
[0109] Optionally, the input module is specifically used to perform:
[0110] Acquire images;
[0111] Obtain the image latent vector corresponding to the image;
[0112] The latent vector of the image is input into the target domain generator to obtain the target domain image.
[0113] It should be noted that the device embodiments are similar to the method embodiments, so the description is relatively simple. For relevant details, please refer to the method embodiments.
[0114] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0115] Those skilled in the art will understand that embodiments of this disclosure can be provided as methods, apparatus, or computer program products. Therefore, embodiments of this disclosure can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of this disclosure can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0116] This disclosure describes embodiments of methods, apparatus, electronic devices, and computer program products according to embodiments of this disclosure with reference to flowchart illustrations and / or block diagrams. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0117] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0118] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0119] While preferred embodiments of the present disclosure have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the present disclosure.
[0120] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.
[0121] The above provides a detailed description of an image generation method and apparatus based on adaptive text learning provided by this disclosure. Specific examples have been used to illustrate the principles and implementation methods of this disclosure. The description of the above embodiments is only for the purpose of helping to understand the method and its core ideas. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this disclosure. Therefore, the content of this specification should not be construed as a limitation of this disclosure.
Claims
1. An image generation method based on adaptive text learning, characterized by, include: The latent vectors are input into the target domain generator to obtain the target domain image; The target domain image is generated in the direction determined by the adaptive source domain cue text vector corresponding to the latent vector; the target domain generator is trained using the source domain cue text vector adaptively generated by the mapper based on the sample vector. The training process of the mapper includes the following steps: Multiple sample vectors are input into the initial mapper to obtain the source domain cue text vector for each sample vector; Encode the multiple source domain prompt text vectors to obtain multiple first vector codes; The multiple sample vectors are input into the source domain generator to obtain the source domain image corresponding to each sample vector. Encode the source domain images corresponding to the multiple sample vectors to obtain multiple second vector codes; The initial mapper is trained based on the plurality of first vector codes and the plurality of second vector codes to obtain the trained mapper.
2. The method of claim 1, wherein, The step of training the initial mapper based on the plurality of first vector codes and the plurality of second vector codes to obtain the trained mapper includes: Calculate the similarity between each of the first vector codes and each of the second vector codes; The initial mapper is trained with the goal of maximizing the similarity between the first vector encoding and the second vector encoding corresponding to the same sample vector, and minimizing the similarity between the first vector encoding and the second vector encoding corresponding to different sample vectors, to obtain the trained mapper.
3. The method of claim 2, wherein, The source domain cue text vector includes a source domain label; after obtaining the source domain cue text vector for each of the sample vectors, the method further includes: Replace the source domain label of the source domain hint text vector of each sample vector with the target domain label to obtain the target domain hint text vector of each sample vector; Encode multiple target domain prompt text vectors to obtain multiple third vector codes; Obtain the encoding of the target domain label; Calculate the similarity between each of the third vector codes and the codes of the target domain label; The initial mapper is trained with the goal of minimizing the similarity between the encoding of each third vector and the encoding of the target domain label, resulting in a trained mapper.
4. The method of claim 1, wherein, The training process of the target domain generator includes the following steps: Input the sample vector into the initial target domain generator to obtain the target domain sample image corresponding to the sample vector; The sample vector is input into the source domain generator to obtain the source domain sample image corresponding to the sample vector; The first difference is determined based on the target domain sample image and the source domain sample image corresponding to the sample vector; The sample vector is input into the mapper to obtain the target domain prompt text vector and the source domain prompt text vector corresponding to the sample vector; The second difference is determined based on the target domain prompt text vector and the source domain prompt text vector corresponding to the sample vector; Based on the first difference and the second difference, a directional loss function is established; The initial target domain generator is trained according to the directional loss function to obtain the trained target domain generator.
5. The method of claim 1, wherein, The step of inputting the latent vector into the target domain generator to obtain the target domain image includes: Acquire images; Obtain the image latent vector corresponding to the image; The latent vector of the image is input into the target domain generator to obtain the target domain image.
6. An image generation apparatus based on adaptive text learning, characterized by, include: The input module is used to input the latent vectors into the target domain generator to obtain the target domain image; The target domain image is generated in the direction determined by the adaptive source domain cue text vector corresponding to the latent vector; the target domain generator is trained using the source domain cue text vector adaptively generated by the mapper based on the sample vector. The training process of the mapper includes the following steps: Multiple sample vectors are input into the initial mapper to obtain the source domain cue text vector for each sample vector; Encode the multiple source domain prompt text vectors to obtain multiple first vector codes; The multiple sample vectors are input into the source domain generator to obtain the source domain image corresponding to each sample vector. Encode the source domain images corresponding to the multiple sample vectors to obtain multiple second vector codes; The initial mapper is trained based on the plurality of first vector codes and the plurality of second vector codes to obtain the trained mapper.
7. The apparatus of claim 6, wherein, The source domain cue text vector includes a source domain label; after obtaining the source domain cue text vector for each of the sample vectors, the method further includes: Replace the source domain label of the source domain hint text vector of each sample vector with the target domain label to obtain the target domain hint text vector of each sample vector; Encode multiple target domain prompt text vectors to obtain multiple third vector codes; Obtain the encoding of the target domain label; Calculate the similarity between each of the third vector codes and the codes of the target domain label; The initial mapper is trained with the goal of minimizing the similarity between the encoding of each third vector and the encoding of the target domain label, resulting in a trained mapper.
8. The apparatus of claim 6, wherein, The training process of the target domain generator includes the following steps: Input the sample vector into the initial target domain generator to obtain the target domain sample image corresponding to the sample vector; The sample vector is input into the source domain generator to obtain the source domain sample image corresponding to the sample vector; The first difference is determined based on the target domain sample image and the source domain sample image corresponding to the sample vector; The sample vector is input into the mapper to obtain the target domain prompt text vector and the source domain prompt text vector corresponding to the sample vector; The second difference is determined based on the target domain prompt text vector and the source domain prompt text vector corresponding to the sample vector; Based on the first difference and the second difference, a directional loss function is established; The initial target domain generator is trained according to the directional loss function to obtain the trained target domain generator.