Training method and device of picture-text mutual generation model
By combining modal inductive units and graph-text encoders and decoders, the problems of resource waste and insufficient one-way information representation in graph-to-text and text-to-graph tasks are solved, and efficient adaptive generation of graph-text mutual generation models is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT
- Filing Date
- 2024-02-08
- Publication Date
- 2026-06-23
Smart Images

Figure CN118014049B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a training method and apparatus for a text-image interaction model. Background Technology
[0002] With the development of deep learning technology and large model-related technologies, the field of AI (Artificial Intelligence) has evolved from decision-making AI to generative AI. There are various existing content generation methods and applications, but the generation of various types of content (taking the most basic text and images as examples) requires the use of different types of algorithms and the training of corresponding models. The additional number of models will greatly increase the deployment and inference overhead of the models, which is not a reasonable direction.
[0003] Whether it's the text-to-image task or the text-to-text field, the related technologies usually employ different model structures and are trained on different data. These two tasks cannot currently be completed simultaneously within a single framework, resulting in a large investment in model training costs. Furthermore, the unidirectional information representation capability of individual task applications is insufficient, leading to poor model training performance and failing to meet the growing demand for text-to-image interaction tasks. Summary of the Invention
[0004] This invention provides a training method and apparatus for a text-image interaction model, which addresses the shortcomings of existing technologies that, when using different model structures to perform text-to-image or image-to-text tasks individually, result in significant resource waste and insufficient unidirectional information representation capabilities for individual tasks, leading to poor model training performance and an inability to meet the growing demand for text-image interaction tasks. This invention improves the performance and adaptability of corresponding text-image interaction models.
[0005] This invention provides a training method for a text-image interaction model, comprising:
[0006] Modal intuition units extract intuition information from sample modal data. The intuition information includes modal type and corresponding modal features. The modal intuition units are obtained through multi-task supervised training based on a self-attention network. The sample modal data includes either sample images or sample text.
[0007] The self-perceived information is encoded using a graph encoder to obtain latent space features, and these features are then subjected to multimodal diffusion to obtain latent space features of the diffused target modality type. The self-perceived information and the diffused target modality type latent space features are then decoded using a graph decoder to obtain decoded information. The graph encoder and decoder are based on a conditional variational autoencoder (cVAE) architecture, and the target modality type belongs to either the sample image or the sample text.
[0008] The image-text encoder and the image-text decoder are trained based on the decoding information and the multi-task loss function to obtain an image-text mutual generation model; wherein, the target loss includes the determination of reconstruction loss, the loss corresponding to the understanding auxiliary task for image class and the loss corresponding to the understanding auxiliary task for text class.
[0009] According to the training method of the image-text interaction model provided by the present invention, after obtaining the latent space features, the method further includes:
[0010] CLIP, a pre-trained multimodal model, extracts conditional features from sample modal data for multimodal diffusion to determine new latent space features based on the conditional features and the latent space features.
[0011] According to a training method for a text-image interaction model provided by the present invention, after obtaining the text-image interaction model, the method further includes:
[0012] The input data of the image-text interaction model is generated in multiple ways according to the multi-path seed progressive selection algorithm to obtain multiple output results. The quality score corresponding to each output result is calculated, and the output result with the highest quality score is determined as the final output of the image-text interaction model.
[0013] According to a training method for a text-image interaction model provided by the present invention, the modal features are obtained by the following formula:
[0014] f s =SNet(x, [c prompt c image ]);
[0015] Among them, f s Let SNet be the modal feature, x be the modal intuition unit, and c be the sample modal data. prompt For the condition prompt, c image For conditional images.
[0016] According to the training method of the image-text interaction model provided by the present invention, the multi-task loss function is expressed by the following formula:
[0017] Loss s =Loss m-cls (x)+w img *Loss im (x)+w txt *Loss txt-cls (x);
[0018] Among them, w img w represents the weights corresponding to the sample images. txt Weights corresponding to the sample text; Loss m-clsThe image-text matching loss is used to determine whether the semantics of the image and text are consistent; Loss img-cls Cross-entropy loss is used in image classification tasks; Loss txt-c Cross-entropy loss is used in text sentiment classification tasks; w img This can be expressed by the following formula:
[0019]
[0020] w txt This can be expressed by the following formula:
[0021]
[0022] The present invention also provides a training device for a text-image interaction model, comprising:
[0023] The feature extraction module is used to extract self-awareness information from sample modal data based on modal self-awareness units. The self-awareness information includes modal type and corresponding modal features. The modal self-awareness units are obtained through multi-task supervised training based on self-attention networks. The sample modal data includes one of sample images and sample text.
[0024] The encoding / decoding module is used to encode the self-perceived information based on the image encoder to obtain latent space features, and to perform multimodal diffusion processing on the latent space features to obtain the diffused latent space features of the target modality type; and to decode the self-perceived information and the diffused latent space features of the target modality type based on the image decoder to obtain decoded information; wherein the image encoder and the image decoder are based on the conditional variational autoencoder (cVAE) architecture, and the target modality type belongs to either the sample image or the sample text;
[0025] The training module is used to train the image-text encoder and the image-text decoder based on the decoding information and the multi-task loss function to obtain the image-text mutual generation model; wherein, the target loss includes the determination of reconstruction loss, image class understanding auxiliary task corresponding loss and text class understanding auxiliary task corresponding loss.
[0026] According to the training apparatus for a text-image interaction model provided by the present invention, the apparatus further includes:
[0027] A quality screening module is used to generate multiple output results from the input data of the image-text interaction model after the image-text interaction model is obtained, according to the multi-path seed progressive optimization algorithm, and calculate the quality score corresponding to each output result. The output result with the highest quality score is determined as the final output of the image-text interaction model.
[0028] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement a training method for the graph-text intergeneration model as described above.
[0029] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements a training method for a graph-text intergeneration model as described above.
[0030] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements a training method for the graph-text intergeneration model as described above.
[0031] The training method and apparatus for the image-text interaction model provided by this invention extracts self-information information from sample modal data through a modal self-information unit, encodes the self-information information through an image-text encoder to obtain latent space features, performs multimodal diffusion processing on the latent space features to obtain the diffused latent space features of the target modal type, decodes the diffused latent space features of the target modal type through an image-text decoder to obtain decoded information, and finally trains the image-text encoder and image-text decoder based on the decoded information and a multi-task loss function to obtain the image-text interaction model. This invention improves the performance and adaptability of the image-text interaction model by extracting self-information information from modal data and introducing task-related conditional encoders and decoders. Attached Figure Description
[0032] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0033] Figure 1 This is one of the flowcharts illustrating the training method of the image-text interaction model provided by the present invention;
[0034] Figure 2 This is the second flowchart illustrating the training method of the image-text interaction model provided by this invention;
[0035] Figure 3 This is a schematic diagram of the structure of the training device for the image-text interaction model provided by the present invention;
[0036] Figure 4 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0037] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0038] The following is combined with Figures 1-3 The present invention describes a training method and apparatus for a text-image interaction model.
[0039] Figure 1 This is one of the flowcharts illustrating the training method of the image-text intergeneration model provided by the present invention, such as... Figure 1 As shown, the training method for this text-image interaction model includes the following steps:
[0040] Step 110: Extract self-sensing information from sample modal data based on modal self-sensing units. The self-sensing information includes modal type and corresponding modal features. The modal self-sensing units are obtained through multi-task supervised training based on self-attention networks. The sample modal data includes either sample images or sample text.
[0041] In this step, the sample modal data includes image data or text data. For example, in the image-to-text matching process, the input sample modal data is an image, and the output is the corresponding text description; in the text-to-image matching process, the input sample modal data is text, and the output is the corresponding image.
[0042] For example, the input to the image-to-text task is an image of a dog, and the output may be the text "a picture of a dog". The input to the text-to-image task is a piece of text, such as "a picture of a cat", and the output is an image that matches the description.
[0043] In this step, the modal self-intuition unit is constructed based on the Transformer structure. The modal self-intuition unit can extract features from images and text in a unified structure to obtain a feature sequence. The input of the modal self-intuition unit is an image or text, and the output is the corresponding self-intuition features.
[0044] In this embodiment, the Transformer architecture used specifically includes the ViT architecture or the BERT architecture.
[0045] In this embodiment, using multi-tasks for supervised feature training can yield more intuitive features with stronger representational capabilities. The multi-tasks include, but are not limited to, image classification tasks, text sentiment classification tasks, and text classification tasks. (Depending on the specific application, the multi-tasks can be further expanded, for example, by adding image segmentation tasks and text continuation tasks.)
[0046] In this embodiment, the multi-task loss function is expressed by the following formula:
[0047] Loss s =Loss m-cls (x)+w img *Loss img -(x)+w txt *Loss txt-cls (x);
[0048] Among them, w img w represents the weights corresponding to the sample images. txt Weights corresponding to the sample text; Loss m-cls The image-text matching loss is used to determine whether the semantics of the image and text are consistent; Loss img-cls Cross-entropy loss is used in image classification tasks; Loss txt-cls Cross-entropy loss is used in text sentiment classification tasks; w img This can be expressed by the following formula:
[0049]
[0050] w txt This can be expressed by the following formula:
[0051]
[0052] In this embodiment, the modal features are obtained using the following formula:
[0053] f s =SNet(x,[c prompt ,c image ]);
[0054] Among them, f s For modal features, SNet is the modal intuition unit, x is the sample modal data, and c is the modal feature. prompt For the condition prompt, c image For conditional images.
[0055] In this embodiment, c prompt and c imageThese are used to provide customized input information such as style customization and reference images. It should be noted that these two initial inputs are empty by default. For example, these two inputs represent an empty string and a completely black image, respectively. This allows the image-text interaction model to not only achieve simple image-text interaction, but also support more advanced customized tasks. For example, additional prompts can be used to specify the style of the generated image or text, and specify the specific image or text task (e.g., image editing or image generation; caption prediction or VQA).
[0056] Step 120: Encode the self-perceived information based on the image encoder to obtain latent space features, and perform multimodal diffusion processing on the latent space features to obtain the diffused latent space features of the target modality type; decode the self-perceived information and the diffused latent space features of the target modality type based on the image decoder to obtain decoded information; wherein, the image encoder and image decoder are obtained based on the conditional variational autoencoder (cVAE) architecture, and the target modality type belongs to one of the sample image and sample text.
[0057] In this step, the image-text encoder can be a conditional variational autoencoder (cVAE), a variational autoencoder (VAE), or other encoders used for image-to-text and text-to-image generation; the corresponding image-text decoder is the decoder corresponding to the aforementioned cVAE, VAE, or other image-text co-generation encoder.
[0058] In this embodiment, the image encoder module is used to encode the input (text or image) into a unified latent space, and then in the unified latent space, it is convenient for the subsequent diffusion model to perform image-to-text or text-to-image diffusion generation.
[0059] Specifically, the input of the image encoder is text or an image, and the output is the reconstructed features corresponding to the input. It introduces the self-perceived features from the previous step. For example, if the input is text, the self-perceived features of the image corresponding to the input are used as conditions. It can utilize multimodal information to make the reconstruction process more natural and the effect better.
[0060] In this embodiment, the image encoder and image decoder can be represented by the following formula:
[0061] z y =VAE encoder (f s ,y)
[0062] y′=VAE decoder (f s , z y )
[0063] Among them, VAE encoder VAE decoder These represent the encoder and decoder parts of the VAE architecture, respectively. y represents the other modality corresponding to x (e.g., text corresponding to an image, or an image corresponding to text), and z represents... y Let y' be the latent space feature of this mode, and f be the reconstructed feature of this mode. s These are the self-induction features extracted from the self-induction module.
[0064] In this embodiment, the interaction between the original modality's self-perceived features and the corresponding modality's input / latent space features is implemented using a transformer-based cross-attention structure to improve the interoperability of multiple self-perceived information. The specific calculation process is as follows (the following shows the encoder's calculation process, replacing y with z). y This refers to the decoding computation process:
[0065] Q = W Q *f s
[0066] K = W K *y
[0067] V = W V *y
[0068] o = MHA(Q, K, V)
[0069] Among them, W Q W K W V These are the network parameters in the transformer network attention mechanism, Q, K, and V are the corresponding generated query matrix, key matrix, and value matrix, respectively, and MHA() is the multi-head self-attention mechanism.
[0070] In this embodiment, in order to train the conditional VAE codec, it is necessary to collect paired image and text samples, extract the self-perceived features of the original modality, and perform reconstruction training according to the above process; it should be noted that the main loss of the entire network is the reconstruction loss of the target modality.
[0071] Step 130: Train the image-text encoder and image-text decoder based on the decoding information and the multi-task loss function to obtain the image-text mutual generation model; wherein, the target loss includes the determination of reconstruction loss, the loss corresponding to the understanding auxiliary task of image class and the loss corresponding to the understanding auxiliary task of text class.
[0072] In this step, the target loss can be determined by the reconstruction loss and assignable weights, the loss corresponding to the image class understanding aid task and assignable weights, and the loss corresponding to the text class understanding aid task and assignable weights.
[0073] For example, if the reconstruction loss is C1 and the corresponding weight is W1, the corresponding losses and their assignable weights for image-based understanding assistance tasks are C2 and W2, respectively, and the corresponding losses and their assignable weights for text-based understanding assistance tasks are C3 and W3, then the target loss is... 总 Represented as:
[0074] Loss 总 =W1*C1+W2*C2+W3*C3;
[0075] Among them, W1, W2 and W3 can be customized according to user needs.
[0076] It should be noted that, due to the high difficulty of simultaneously encoding and decoding images and text, training solely based on reconstruction loss leads to a significant performance degradation in the encoder and decoder. To address this issue, an encoder-decoder training method based on auxiliary tasks is proposed. Specifically, in addition to the main reconstruction loss, auxiliary tasks for image understanding (e.g., image segmentation models to determine whether the segmentation maps of the reconstructed image and the original image are consistent) and auxiliary tasks for text understanding (e.g., text summarization models to determine whether the summary of the original text is consistent with the generated text) are introduced. By training the encoder and decoder through multi-task loss, the performance of the encoder and decoder can be improved compared to the original single-task loss.
[0077] This invention provides a training method for a text-image interaction model. The method extracts self-information information from sample modal data using a modal self-information unit, encodes this self-information information using a text-image encoder to obtain latent space features, performs multimodal diffusion processing on these features to obtain diffused latent space features of the target modality type, decodes these features using a text-image decoder to obtain decoded information, and finally trains the text-image encoder and decoder based on the decoded information and a multi-task loss function to obtain the text-image interaction model. This invention, by extracting self-information information from modal data and introducing task-related conditional encoders and decoders, enables the entire text-image interaction model to adaptively perform text-image or image-image tasks, improving the performance and adaptability of the corresponding text-image interaction model.
[0078] In some embodiments, after obtaining the latent space features, the method further includes: extracting conditional features from the sample modality data based on the pre-trained multimodal model CLIP, so that multimodal diffusion can determine new latent space features based on the conditional features and the latent space features.
[0079] In this embodiment, the multimodal diffusion model receives the latent space features and conditional features output by the VAE encoder as input during the training phase, and outputs the latent space features of the diffused target modality.
[0080] Specifically, using latent space features, conditional features, and diffusion steps as input to the VAE decoder, the target modality (image or text) is decoded. The calculation process can be represented by the following formula:
[0081] n t-1 =dm(z t ,t,c)
[0082] Where, n t-1 Let z represent the prediction noise at step t-1. t Let represent the latent space features at step t, where t represents the diffusion step number t, and c represents the conditional features of the original modality (in graph-to-text generation, i.e., the conditional features of the image can be obtained using the CLIP image encoder; in text-to-graph generation, i.e., the conditional features of the text can be obtained using the CLIP text encoder).
[0083] In this embodiment, in n t-1 ,z t New spatial features z can be obtained by using a sampler based on this. t-1 After multiple diffusion steps, the latent space feature z0 of the output can be obtained. Inputting this into the VAE decoder will yield the output of the corresponding mode.
[0084] In this embodiment, the multimodal diffusion model receives Gaussian noise, self-inductance characteristics, conditional characteristics, and diffusion step count as inputs during the testing phase, and outputs the corresponding decoded output.
[0085] Figure 2 This is the second flowchart illustrating the training method of the image-text intergeneration model provided by this invention. Figure 2 In the illustrated embodiment, taking a text-generated image as an example, during the testing phase, text (TXT) is input into the modal induction unit (corresponding to the induction module) to extract induction information (corresponding to induction features f). s The self-perceived features and image y are input into the encoder for encoding to obtain the corresponding latent space features z. t ; Extract conditional features c from the text using CLIP, and obtain the diffusion steps t, then z t Inputs c and t are fed into the diffusion model, which outputs the latent space feature z0 of the target modality type after diffusion. The decoder then processes z0 and f. s Decoding is performed to obtain the decoded information y′, and the loss value is calculated using reconstruction loss to train the encoder and decoder, resulting in an image-text intergeneration model. This model is used in subsequent testing processes to generate images corresponding to the input TXT. During the testing phase, the multimodal diffusion model receives Gaussian noise x∈N(0,I), f s The function calculates f, c, and t, and outputs the corresponding latent space feature z0. Finally, it uses a trained decoder to analyze z0 and f. sDecode the image and output the image corresponding to the input TXT.
[0086] The present invention provides a training method for a text-image interaction model. By using a pre-trained multimodal model CLIP to extract conditional features from sample modal data, the multimodal diffusion model can determine new latent space features based on the conditional features and latent space features. The resulting text-image interaction model supports text-image and text-image generation functions, thus realizing text-image interaction.
[0087] In some embodiments, after obtaining the image-text interaction model, the method further includes: generating multiple output results from the input data of the image-text interaction model using a multi-path seed progressive optimization algorithm, calculating the quality score corresponding to each output result, and determining the output result with the highest quality score as the final output of the image-text interaction model.
[0088] In this embodiment, during the training phase, auxiliary task supervision (an auxiliary task other than diffusion loss) is used to provide better consistency between the generated content and the target content, so as to ensure the high quality of the generated content.
[0089] Specifically, after training is completed, the image-text interaction model uses a progressive optimization method of multiple seeds (N random seeds are generated in parallel each time) to process the same input in multiple ways and generate the corresponding decoded output. In this process, each generation is STEP after running a certain number of diffusion steps, and a generation quality model (trained in advance based on labeled data) judges the optimal intermediate state. Then, it continues to generate based on this intermediate state, thereby improving the quality.
[0090] For example, with N=10 seeds and STEP=10 steps, in the first 10 steps, 10 seeds are diffused simultaneously. After the 10th step, the intermediate state with the highest quality score is selected for subsequent diffusion. This operation can be repeated multiple times to further improve the quality.
[0091] The present invention provides a training method for a text-image interaction model. By using a multi-path seed progressive optimization algorithm to generate multiple output results from the input data of the text-image interaction model, a quality score is calculated for each output result, and the output result with the highest quality score is determined as the final output of the text-image interaction model, which effectively improves the quality of the text-image interaction output results.
[0092] The training apparatus for the text-image interaction model provided by the present invention will be described below. The training apparatus for the text-image interaction model described below can be referred to in correspondence with the training method for the text-image interaction model described above.
[0093] Figure 3This is a schematic diagram of the training device for the image-text intergeneration model provided by the present invention, as shown below. Figure 3 As shown, the training device for the image-text interaction model includes a feature extraction module 310, an encoding / decoding module 320, and a training module 330.
[0094] The feature extraction module 310 is used to extract self-sensing information from sample modal data based on modal self-sensing units. The self-sensing information includes modal type and corresponding modal features. The modal self-sensing units are obtained through multi-task supervised training based on self-attention networks. The sample modal data includes one of sample images and sample text.
[0095] The encoding / decoding module 320 is used to encode the self-perceived information based on the image encoder to obtain latent space features, and to perform multimodal diffusion processing on the latent space features to obtain the diffused latent space features of the target modality type; and to decode the self-perceived information and the diffused latent space features of the target modality type based on the image decoder to obtain decoded information; wherein, the image encoder and image decoder are based on the conditional variational autoencoder (cVAE) architecture, and the target modality type belongs to one of the sample image and sample text;
[0096] Training module 330 is used to train the image-text encoder and image-text decoder based on the decoding information and the multi-task loss function to obtain the image-text mutual generation model; wherein, the target loss includes the determination of reconstruction loss, the loss corresponding to the understanding auxiliary task of image class and the loss corresponding to the understanding auxiliary task of text class.
[0097] This invention provides a training device for a text-image interaction model. It extracts self-information information from sample modal data using a modal self-information unit, encodes this self-information information using a text-image encoder to obtain latent space features, performs multimodal diffusion processing on these features to obtain diffused latent space features of the target modality type, decodes these features using a text-image decoder to obtain decoded information, and finally trains the text-image encoder and decoder based on the decoded information and a multi-task loss function to obtain the text-image interaction model. This invention, by extracting self-information information from modal data and introducing task-related conditional encoders and decoders, enables the entire text-image interaction model to adaptively perform text-to-image or image-to-text tasks, improving the performance and adaptability of the corresponding text-image interaction model.
[0098] In some embodiments, the training apparatus for the text-image interaction model further includes: a quality screening module, which, after obtaining the text-image interaction model, performs multi-path generation on the input data of the text-image interaction model according to a multi-path seed progressive optimization algorithm to obtain multiple output results, calculates the quality score corresponding to each output result, and determines the output result with the highest quality score as the final output of the text-image interaction model.
[0099] The present invention provides a training device for a text-image interaction model. The device generates multiple output results from the input data of the text-image interaction model through a multi-path seed progressive optimization algorithm, calculates the quality score corresponding to each output result, and determines the output result with the highest quality score as the final output of the text-image interaction model, thereby effectively improving the quality of the text-image interaction output results.
[0100] Figure 4 This is a schematic diagram of the structure of the electronic device provided by the present invention, such as... Figure 4 As shown, the electronic device may include: a processor 410, a communication interface 420, a memory 430, and a communication bus 440, wherein the processor 410, the communication interface 420, and the memory 430 communicate with each other through the communication bus 440. The processor 410 can call logical instructions in the memory 430 to execute a training method for the image-text interaction model. This method includes: extracting self-awareness information from sample modal data based on modal self-awareness units, where the self-awareness information includes modal type and corresponding modal features; obtaining the modal self-awareness units through multi-task supervised training based on a self-attention network, where the sample modal data includes either sample images or sample text; encoding the self-awareness information using an image-text encoder to obtain latent space features, and performing multi-modal diffusion processing on the latent space features to obtain the diffused latent space features of the target modal type; decoding the self-awareness information and the diffused latent space features of the target modal type using an image-text decoder to obtain decoded information; wherein the image-text encoder and decoder are based on a conditional variational autoencoder (cVAE) architecture, and the target modal type belongs to either the sample images or the sample text; training the image-text encoder and decoder based on the decoded information and a multi-task loss function to obtain the image-text interaction model; wherein the target loss includes reconstruction loss, loss corresponding to the understanding auxiliary task for image types, and loss corresponding to the understanding auxiliary task for text types.
[0101] Furthermore, the logical instructions in the aforementioned memory 430 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0102] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the training method of the graph-text interaction model provided by the above methods. The method includes: extracting self-awareness information from sample modal data based on modal self-awareness units, wherein the self-awareness information includes modal type and corresponding modal features; obtaining the modal self-awareness units through multi-task supervised training based on a self-attention network, wherein the sample modal data includes one of sample images and sample text; and encoding the self-awareness information based on a graph-text encoder to obtain a latent space. The latent space features are extracted and multimodal diffusion is performed on them to obtain the latent space features of the target modality type after diffusion. The latent space features of the target modality type are decoded based on the image-text decoder to obtain the decoded information. The image-text encoder and image-text decoder are based on the conditional variational autoencoder (cVAE) architecture, and the target modality type belongs to one of the sample image and sample text. The image-text encoder and image-text decoder are trained according to the decoded information and the multi-task loss function to obtain the image-text mutual generation model. The target loss includes the reconstruction loss, the loss corresponding to the understanding auxiliary task of the image class, and the loss corresponding to the understanding auxiliary task of the text class.
[0103] Furthermore, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements a training method for the graph-text intergeneration model provided by the methods described above. This method includes: extracting self-awareness information from sample modal data based on modal self-awareness units, the self-awareness information including modal type and corresponding modal features; obtaining the modal self-awareness units through multi-task supervised training based on a self-attention network, the sample modal data including either sample images or sample text; encoding the self-awareness information based on a graph-text encoder to obtain latent space features, and performing multi-modal encoding on the latent space features. The latent space features of the target modality type are obtained through state diffusion processing. The latent space features of the target modality type are then decoded based on the image-text decoder to obtain decoded information. The image-text encoder and decoder are based on a conditional variational autoencoder (cVAE) architecture, and the target modality type belongs to either the sample image or the sample text. The image-text encoder and decoder are trained based on the decoded information and a multi-task loss function to obtain an image-text interaction model. The target loss includes reconstruction loss, image-class understanding auxiliary task corresponding loss, and text-class understanding auxiliary task corresponding loss.
[0104] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0105] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0106] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A training method for a text-image interaction model, characterized in that, include: Modal intuition units extract intuition information from sample modal data. The intuition information includes modal type and corresponding modal features. The modal intuition units are obtained through multi-task supervised training based on a self-attention network. The sample modal data includes either sample images or sample text. The self-perceived information is encoded based on the image encoder to obtain latent space features, and the latent space features are subjected to multimodal diffusion processing to obtain the diffused latent space features of the target modality type. The latent space features of the self-perceived information and the diffused target modality type are decoded based on the image-text decoder to obtain decoded information; wherein, the image-text encoder and the image-text decoder are obtained based on the conditional variational autoencoder (cVAE) architecture, and the target modality type belongs to one of the sample image and the sample text; The image-text encoder and the image-text decoder are trained based on the decoding information and the multi-task loss function to obtain the image-text mutual generation model; wherein, the target loss includes the determination of reconstruction loss, the loss corresponding to the understanding auxiliary task of image class and the loss corresponding to the understanding auxiliary task of text class. The multi-task loss function is expressed by the following formula: ; in, The weights corresponding to the sample images. The weights corresponding to the sample text; This is the image-text matching loss, used to determine whether the semantics of the image and text are consistent; Cross-entropy loss is used in image classification tasks; Cross-entropy loss is used in text sentiment classification tasks; This can be expressed by the following formula: ; This can be expressed by the following formula: ; After obtaining the latent space features, the method further includes: Based on the pre-trained multimodal model CLIP, conditional features are extracted from sample modal data for multimodal diffusion. New latent space features are determined based on the conditional features and the latent space features. After obtaining the image-text intergeneration model, the method further includes: The input data of the image-text interaction model is generated in multiple ways according to the multi-path seed progressive selection algorithm to obtain multiple output results. The quality score corresponding to each output result is calculated, and the output result with the highest quality score is determined as the final output of the image-text interaction model.
2. The training method for the text-image interaction model according to claim 1, characterized in that, The modal features are obtained using the following formula: ; in, For the modal features, For the modal self-inductance unit, The sample modal data, As a conditional prompt, For conditional images.
3. A training apparatus for a text-image interaction model, employing the text-image interaction model training method as described in claim 1, characterized in that, include: The feature extraction module is used to extract self-awareness information from sample modal data based on modal self-awareness units. The self-awareness information includes modal type and corresponding modal features. The modal self-awareness units are obtained through multi-task supervised training based on self-attention networks. The sample modal data includes one of sample images and sample text. The encoding / decoding module is used to encode the self-sensing information based on the image encoder to obtain latent space features, and to perform multimodal diffusion processing on the latent space features to obtain the diffused latent space features of the target modality type. The latent space features of the self-perceived information and the diffused target modality type are decoded based on the image-text decoder to obtain decoded information; wherein, the image-text encoder and the image-text decoder are obtained based on the conditional variational autoencoder (cVAE) architecture, and the target modality type belongs to one of the sample image and the sample text; The training module is used to train the image-text encoder and the image-text decoder based on the decoding information and the multi-task loss function to obtain the image-text mutual generation model; wherein, the target loss includes the determination of reconstruction loss, image class understanding auxiliary task corresponding loss and text class understanding auxiliary task corresponding loss.
4. The training device for the text-image interaction model according to claim 3, characterized in that, The device further includes: A quality screening module is used to generate multiple output results from the input data of the image-text interaction model after the image-text interaction model is obtained, according to the multi-path seed progressive optimization algorithm, and calculate the quality score corresponding to each output result. The output result with the highest quality score is determined as the final output of the image-text interaction model.
5. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the training method for the text-image interaction model as described in any one of claims 1 to 2.
6. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the training method of the text-image interaction model as described in any one of claims 1 to 2.
7. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the training method of the text-image interaction model as described in any one of claims 1 to 2.