Method for generating multi-modal model, multi-modal processing method and device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By generating self-generated training samples from descriptive text that acquires modal information, and combining basic training samples with data augmentation techniques, the multimodal model is iteratively fine-tuned, solving the problems of insufficient training efficiency and accuracy of multimodal models in existing technologies, and achieving efficient multimodal information processing.

CN117216202BActive Publication Date: 2026-06-19HANGZHOU ALICLOUD FEITIAN INFORMATION TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HANGZHOU ALICLOUD FEITIAN INFORMATION TECH CO LTD
Filing Date: 2023-08-08
Publication Date: 2026-06-19

Application Information

Patent Timeline

08 Aug 2023

Application

19 Jun 2026

Publication

CN117216202B

IPC: G06F16/3329; G06F16/334; G06F16/335; G06F18/214

AI Tagging

Application Domain

Digital data information retrieval Special data processing applications

Technology Topics

Manual annotationEngineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Image fusion method, system and device for product defect sample and storage medium
CN122243772AImage enhancement Image analysis Pattern recognitionManual annotation
Fog recognition method for automatically generating a hint based on ground observation data and satellite images
CN122244714AImage analysis Biological modelsManual annotationFeature extraction
Loop structure of cold heat flows
CN1916533AImprove efficiencySimple configurationFluid circulation arrangementHeating and refrigeration combinationsHeat flowWorking fluid
Environment-friendly mobile collecting box for decoration cutting dust
CN108636005AThe dragging process is smoothavoid secondary flyingUsing liquid separation agent Working accessories EngineeringSediment
Credit text analysis method, credit object auditing method and credit object auditing device
CN114386430AReduce labor costs Improve efficiency Finance Semantic analysisCredit cardEngineering

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In existing technologies, pre-trained language models only support text information processing and have difficulty directly processing other modal information such as images, audio, and video, resulting in insufficient training efficiency and accuracy of multimodal models.

Method used

By acquiring descriptive text containing modal information, pre-trained language models are used to generate self-generated training samples. Combined with basic training samples and data augmentation techniques, the multimodal model is iteratively fine-tuned to improve its multimodal processing capabilities.

Benefits of technology

While reducing manual annotation time, it improves the training efficiency and accuracy of multimodal models, enabling better extraction of detailed features from modal information and adapting to the multimodal needs of various scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117216202B_ABST

Patent Text Reader

Abstract

This application provides a method for generating a multimodal model, a method for processing multimodal data, and an apparatus, comprising: acquiring modal information; determining the descriptive text corresponding to the modal information; generating input and output information based on the descriptive text using a pre-trained base model to obtain self-generated training samples; constructing a multimodal model based on the base model; and fine-tuning the multimodal model based on the self-generated training samples; and further fine-tuning the multimodal model by generating input and output information based on the modal information using the fine-tuned multimodal model. The solution provided in this application enables the model to process multimodal information, reduces the time spent on manual annotation, improves overall training efficiency, and can uncover rich detailed features in the modal information, thereby improving the accuracy of the model.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a method for generating multimodal models, a method for processing multimodal models, and an apparatus. Background Technology

[0002] With the widespread application of large models in the field of artificial intelligence, they have demonstrated strong processing capabilities in language understanding, generation, interaction and reasoning, and a variety of application scenarios have emerged.

[0003] For example, a dialogue model is a large model with dialogue capabilities, capable of generating corresponding dialogue context based on the preceding dialogue, meeting the intelligent dialogue needs in different scenarios. In some technologies, a usable dialogue model can be obtained based on a pre-trained language model and fine-tuned through downstream tasks.

[0004] However, pre-trained language models typically only have the ability to process text information and cannot directly process information from other modalities, such as images, audio, and video. This results in a single input for the model, which is difficult to meet user needs. Some technologies have attempted to use multimodal information when training models, but the overall efficiency and accuracy of training still need to be improved. Summary of the Invention

[0005] The main objective of this application is to provide a method for generating multimodal models, a method for processing multimodal data, and an apparatus to improve the training efficiency and accuracy of multimodal models.

[0006] In a first aspect, embodiments of this application provide a method for generating a multimodal model, comprising:

[0007] Obtain modal information, wherein the modal information includes content of any one or more modalities other than text;

[0008] The description text corresponding to the modality information is determined. Based on the description text, input information and output information are generated through a pre-trained base model to obtain self-generated training samples. The self-generated training samples include: the modality information, the input information, and the output information.

[0009] A multimodal model is constructed based on the base model, and the multimodal model is fine-tuned based on the self-generated training samples;

[0010] The fine-tuned multimodal model generates input and output information based on modal information, and is then further fine-tuned. The fine-tuned multimodal model is used to determine the output information based on the modal information and the input information.

[0011] Secondly, embodiments of this application provide a method for generating a multimodal dialogue model, comprising:

[0012] Obtain modal information, wherein the modal information includes content of any one or more modalities other than text;

[0013] Determine the description text corresponding to the modal information, and generate the dialogue context and dialogue background based on the description text using a pre-trained language model to obtain self-generated training samples. The self-generated training samples include: the modal information, the dialogue context, and the dialogue background.

[0014] A multimodal dialogue model is constructed based on the language model, and the multimodal dialogue model is fine-tuned based on the self-generated training samples;

[0015] The fine-tuned multimodal dialogue model generates dialogue context and dialogue context based on modal information, and the multimodal dialogue model is further fine-tuned; wherein, the fine-tuned multimodal dialogue model is used to determine dialogue context based on modal information and dialogue context.

[0016] Thirdly, embodiments of this application provide a multimodal processing method, including:

[0017] Obtain information to be processed, which includes modal information and input information, wherein the modal information contains content of any one or more modalities other than text;

[0018] Based on the multimodal model, the output information corresponding to the information to be processed is determined;

[0019] The multimodal model is generated by the method described in any of the preceding methods.

[0020] Fourthly, embodiments of this application provide an electronic device, including:

[0021] At least one processor; and

[0022] A memory that is communicatively connected to the at least one processor;

[0023] The memory stores instructions that can be executed by the at least one processor to cause the electronic device to perform the method described in any of the above aspects.

[0024] The multimodal model generation method, multimodal processing method, and device provided in this application can acquire modal information, which includes content of any one or more modalities other than text. The method determines the descriptive text corresponding to the modal information, and generates input and output information based on the descriptive text using a pre-trained base model to obtain self-generated training samples. These self-generated training samples include the modal information, the input information, and the output information. A multimodal model is constructed based on the base model, and the multimodal model is fine-tuned based on the self-generated training samples. Using the fine-tuned multimodal model, input and output information are generated based on the modal information, and the multimodal model is further fine-tuned. This enables the model to process multimodal information and allows for iterative fine-tuning using the model's multimodal understanding capabilities. Only a small amount of initial annotation is required to complete model training, reducing the time spent on manual annotation and improving overall training efficiency. Furthermore, generating training samples first using the descriptive text and then using the modal information itself allows for the extraction of rich detailed features from the modal information, improving the model's accuracy. Attached Figure Description

[0025] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0026] Figure 1 This is a schematic diagram of an application scenario provided by an embodiment of this application;

[0027] Figure 2 A flowchart illustrating a method for generating a multimodal dialogue model provided in an embodiment of this application;

[0028] Figure 3 This is an interactive schematic diagram of data annotation provided in an embodiment of this application;

[0029] Figure 4 A schematic diagram illustrating the principle of a multimodal dialogue model provided in this application embodiment;

[0030] Figure 5 A flowchart illustrating an optimization model provided in an embodiment of this application;

[0031] Figure 6 A schematic diagram illustrating the principle of comparative learning provided in an embodiment of this application;

[0032] Figure 7 A schematic diagram illustrating the principle of a method for generating a multimodal dialogue model provided in this application embodiment;

[0033] Figure 8 A flowchart illustrating another method for generating a multimodal dialogue model provided in this application embodiment;

[0034] Figure 9 A flowchart illustrating a method for generating a multimodal model provided in an embodiment of this application;

[0035] Figure 10 A flowchart illustrating a multimodal dialogue method provided in an embodiment of this application;

[0036] Figure 11 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.

[0037] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation

[0038] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application.

[0039] It should be noted that the user information (including but not limited to user device information, user attribute information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with relevant laws, regulations and standards, and corresponding operation entry points are provided for users to choose to authorize or refuse.

[0040] First, let me explain the terms used in this application:

[0041] Large Language Model (LLM): Also known as a large-scale language model, it usually refers to a deep learning model with billions to trillions or even more parameters, which typically requires large-scale data and computing resources for training.

[0042] Self-instructing refers to using self-instructions to generate data for fine-tuning the model, that is, allowing the model to learn some features of the data or perform a certain task to fine-tune it.

[0043] Multimodal: refers to data that contains multiple modalities (such as text, images, voice, video, etc.).

[0044] Instructional fine-tuning refers to using manually provided instructions or tasks to fine-tune the model.

[0045] Supervised Fine-tuning (SFT): Fine-tuning the model using supervised data.

[0046] Large models refer to deep learning models with a massive number of parameters, typically containing hundreds of millions, tens of billions, hundreds of billions, trillions, or even more than ten trillion parameters. Large models can also be called foundation models. They are pre-trained on large-scale unlabeled corpora, producing pre-trained models with hundreds of millions of parameters. These models can adapt to a wide range of downstream tasks and have good generalization ability. Examples include Large Language Models (LLMs) and multi-modal pre-training models.

[0047] In practical applications, large models only require a small number of samples to fine-tune the pre-trained model before they can be applied to different tasks. Large models can be widely used in fields such as Natural Language Processing (NLP) and Computer Vision. Specifically, they can be applied to computer vision tasks such as Visual Question Answering (VQA), Image Captioning (IC), and Image Generation, as well as NLP tasks such as text-based sentiment classification, text summarization, and machine translation. The main application scenarios for large models include digital assistants, intelligent robots, search, online education, office software, e-commerce, and intelligent design.

[0048] In the field of natural language processing, large language models can be applied to scenarios such as dialogue, translation, and code generation, possessing a certain level of language understanding and reasoning capabilities. Currently, these models typically only support text input and cannot directly utilize information from other modalities, such as images, videos, and audio. Compared to plain text data, bimodal, trimodal, or even multimodal interactive data can effectively enhance the processing capabilities of large models, meeting the needs of different scenarios. To achieve multimodal processing capabilities, models can be trained using multimodal data; however, manual annotation of data from other modalities is required before training, resulting in low efficiency and high costs.

[0049] In some technologies, model training can be achieved by generating data. However, due to the lack of detail and richness in the generated data, the model cannot accurately extract more granular content, resulting in certain deficiencies in alignment and data accuracy during model training.

[0050] In view of this, embodiments of this application provide a method for fine-tuning a model using multimodal instructions. This method can fully utilize the text capabilities of existing, pre-trained plain text language models, and combine basic data, self-instructions, and human feedback to iteratively fine-tune the model, enabling it to possess multimodal processing capabilities.

[0051] Taking a multimodal dialogue model as an example, the method in the embodiments of this application can enable the model to obtain the following dialogue based on the dialogue context and modal information.

[0052] The modal information may include content in one or more modalities other than text. Optionally, modalities other than text may include images, audio, video, etc. The modal information may include content in one or more modalities, for example, it may be a single image, a single audio, a single video, etc., or it may be any combination of two or more of images, audio, and video. For example, a set of modal information may include a combination of images, audio, and video.

[0053] The preceding and following dialogue texts can be text content belonging to different roles that appear during the dialogue. The preceding dialogue text can be content entered by the user, and the following dialogue text can be a response from other roles (customer service, chatbot, etc.) to the preceding dialogue text. The preceding and following dialogue texts can typically be presented as questions and answers, but are not limited to these. For example, the preceding dialogue text can also be a statement, and the following dialogue text can be a response to that statement.

[0054] Specifically, modal information such as target images can be obtained, and the descriptive text corresponding to the target image can be determined through manual annotation or other methods. Since the pre-trained language model (such as LLM) already has text understanding capabilities, the descriptive text can be processed directly to generate dialogue context and dialogue aftertext. The dialogue context and dialogue aftertext, together with the target image, are used as self-generated training samples to fine-tune the model, enabling the fine-tuned model to process images. Then, the target image is processed directly based on the fine-tuned model to regenerate dialogue context and dialogue aftertext, and the model is fine-tuned again, thus enabling the model to handle multimodal processing. During the fine-tuning process, the model's existing capabilities are used for iterative processing, reducing the time spent on manual annotation and improving the overall training efficiency. Furthermore, generating samples first using the descriptive text and then using the target image itself can uncover rich detailed features in the image. The generated training samples cover a wide range of content in the image, allowing the model to extract image content more accurately and meticulously, thus improving the model's accuracy.

[0055] In addition to generating training samples automatically, the model can be fine-tuned by combining basic training samples and data-augmented training samples. Basic training samples can include dialogue context and dialogue descriptions obtained by manually annotating target images. Data augmentation mainly involves randomly concatenating images and dialogues from multiple existing training samples. Random concatenation can mean that the images and dialogues are sorted separately but in different ways to simulate the situation where users ask questions out of order in real-world scenarios.

[0056] By integrating the training samples from the above-mentioned multiple forms, the model can be fine-tuned to adapt to various situations, thereby improving the overall training effect of the model.

[0057] Figure 1 This is a schematic diagram illustrating an application scenario provided by an embodiment of this application. For example... Figure 1 As shown, various modal information, such as images, audio, and video, can be collected from the network or through other means. Basic data is manually labeled, including descriptive text, dialogue context, and dialogue background. Modal information, dialogue context, and dialogue background are used as basic training samples. The basic training samples are randomly concatenated to obtain concatenated training samples. In addition, based on the descriptive text, dialogue context and dialogue background are generated using a pre-trained language model. These are combined with the modal information to obtain self-generated training samples.

[0058] Based on the training samples obtained above, the model is fine-tuned to enable it to process modal information. In the next stage, the fine-tuned model can directly process modal information to generate dialogue context and subsequent dialogue, thus obtaining new multimodal training samples. After manual selection and rewriting, these multimodal training samples are used to further fine-tune the model. The fine-tuned model exhibits better performance and can be further used to generate new multimodal training samples for continued fine-tuning until the model meets the requirements. The fine-tuned model can then be used to derive corresponding subsequent dialogue based on modal information and dialogue context, thus enabling its application in various human-computer interaction scenarios such as intelligent customer service.

[0059] The following detailed description of some embodiments of this application is provided in conjunction with the accompanying drawings. Where there is no conflict between the embodiments, the following embodiments and features can be combined with each other. Furthermore, the timing of the steps in the following method embodiments is merely an example and not a strict limitation.

[0060] Figure 2 This is a flowchart illustrating a method for generating a multimodal dialogue model provided in this application embodiment. The execution entity of this method can be any device with data processing capabilities, such as a terminal device or server. In practical applications, the method in this embodiment can be implemented in the cloud, deployed locally, implemented on a client side, or implemented using IoT (Internet of Things) devices, etc. Figure 2 As shown, the method may include:

[0061] Step 201: Obtain modal information, wherein the modal information includes content of any one or more modalities other than text.

[0062] Optionally, modalities other than text can be images, audio, video, etc. The modal information can contain content of one or more modalities, for example, it can be a single image, a single audio, a single video, etc., or it can be any two or more combinations of images, audio, and video. For example, a set of modal information can include a combination of images, audio, and video.

[0063] The modal information can be obtained from the network, or from relevant servers or databases. For example, when it is necessary to train an intelligent customer service dialogue model for an e-commerce platform, the corresponding modal information can be obtained from the e-commerce platform.

[0064] Step 202: Determine the description text corresponding to the modality information. Based on the description text, generate the dialogue context and dialogue background through a pre-trained language model to obtain self-generated training samples. The self-generated training samples include: the modality information, the dialogue context, and the dialogue background.

[0065] The preceding and following dialogue contexts can serve as the input and output information of the language model, respectively. The language model can be used as a base model for further construction of multimodal models, such as multimodal dialogue models.

[0066] Optionally, the descriptive text corresponding to the modal information can be text used to describe the modal information in a fine-grained manner. The descriptive text may include: one or more items contained in the modal information, and the attributes of each item, etc. For example, for an image, the fine-grained descriptive text may include all objects appearing in the image, as well as the color, action, position, and interrelationships of each object, etc.

[0067] In this step, a pre-trained language model can be used to generate training samples based on the descriptive text. These training samples are generated by the model itself without manual annotation and are therefore referred to as self-generated training samples.

[0068] Among them, the pre-trained language model can be an LLM, which has a certain language processing capability. It can generate corresponding questions based on the descriptive text, and can also provide answers based on the questions.

[0069] Optionally, the descriptive text can be directly input into the language model, which generates one or more questions as the dialogue context. Then, the questions and descriptive text are input into the language model to obtain the answers, which serve as the dialogue context. In this embodiment, the dialogue context and dialogue context are not limited to questions and answers. For example, the dialogue context can also be a statement, and the dialogue context can be a response to that statement.

[0070] For example, in a customer service scenario, the preceding text of the dialogue is the content entered by the user, and the following text is the response from the customer service representative. The dialogue can be single-turn or multi-turn. In a multi-turn dialogue scenario, the preceding text can include previous rounds of dialogue as well as the question in the current round, thus allowing for a more accurate answer by combining contextual information.

[0071] In addition to generating dialogue context and dialogue afterword directly from the description text, you can also use prompt words when generating dialogue context or dialogue afterword.

[0072] Optionally, self-instruction can be used to obtain self-generated training samples. Self-instruction can utilize contextual prompts to automatically generate new instructions. In this embodiment, the new instructions can refer to the generated dialogue context or dialogue background.

[0073] Optionally, the step of generating the preceding and following dialogue texts based on the description text using a pre-trained language model may include: obtaining prompt words; inputting the description text and prompt words into the pre-trained language model to obtain the preceding dialogue text; and inputting the description text and the obtained preceding dialogue text into the language model to obtain the following dialogue text.

[0074] The prompt words can be used to constrain the generated dialogue context, such as the number, style, and question direction of the dialogue context to be generated by the model.

[0075] For example, the modal information is an image, and the descriptive text obtained for the image is: "The image shows a man and a dog playing frisbee on a lawn. The man is lying down, and the dog leaps up and hovers in the air, trying to catch the frisbee with its mouth. The man is wearing a red crew neck sweatshirt, blue shorts, and red sneakers. The dog is a black and white Border Collie. The frisbee is orange and located in the upper left corner of the image. The background is green grass and trees."

[0076] The prompt reads, "Please help me generate 10 questions and answers based on the content of this image. The questions should be diverse and natural, and the answers should be detailed and accurate. Do not ask or answer questions that are not included in the description of the image above."

[0077] The pre-trained language model can generate questions that meet the requirements based on the aforementioned descriptive text and prompt words.

[0078] Furthermore, to generate diverse training samples, the prompts can be adjusted to generate a variety of questions, such as adding or replacing question styles (e.g., brief, humorous, rigorous, etc.) and / or question directions (color, relationship between things, time, creation, etc.). Optionally, the prompts can be manually or automatically adjusted. Automatic adjustment can be achieved in several ways; for example, prompt templates can be set, and new prompts can be generated by replacing the question styles or directions in the template, or prompts can be automatically generated using a language model.

[0079] Optionally, when generating the following dialogue text, it can also be combined with prompt words. For example, the prompt words could be "The generated answer should be concise and accurate, with a word count of no more than 100 words", so that the model can generate the following dialogue text that meets the requirements based on the description text, the preceding dialogue text, and the prompt words.

[0080] By employing the above method, the text understanding capabilities of a pre-trained language model can be leveraged to automatically generate new dialogue contexts and following texts. These contexts, along with modal information, are then combined to obtain rich, self-generated multimodal training samples. In this way, a large amount of multimodal dialogue data can be obtained from a small amount of basic data. This data can then be used for subsequent fine-tuning, enabling the model to process multimodal inputs.

[0081] In addition to the aforementioned self-generated training samples, other types of training samples can be combined to fine-tune the model and further improve its performance.

[0082] Optionally, the method further includes: determining the dialogue context and dialogue background by manual annotation based on the modal information to obtain basic training samples, wherein there are multiple basic training samples and multiple self-generated training samples; concatenating at least two original training samples to obtain concatenated training samples, wherein the original training samples are training samples arbitrarily selected from multiple basic training samples and multiple self-generated training samples.

[0083] The basic training samples will be explained below.

[0084] Optionally, any base training sample may include: modal information, dialogue context, and dialogue context. For example, the modal information may be an image, the dialogue context may be a question about the image, such as "What color shoes is the person in the picture wearing?", and the dialogue context may be the corresponding answer, such as "Red shoes".

[0085] Both the context and following text of the dialogue can be manually annotated. Optionally, the context and following text can be annotated by different people to simulate multimodal dialogue in real-world scenarios. Optionally, the descriptive text of the modal information can also be annotated by different people.

[0086] Figure 3 This is an interactive diagram illustrating data annotation provided in an embodiment of this application. For example... Figure 3 As shown, the server can acquire a small amount of modal information via the network and perform fine-grained annotation in conjunction with the terminal. Fine-grained annotation can include two parts: dialogue annotation and fine-grained annotation. Dialogue annotation mainly annotates the preceding and following context of the dialogue based on the modal information, such as questions and answers. Fine-grained annotation uses language to describe in detail all content and attributes appearing in the modal information, resulting in descriptive text. Questions, answers, and descriptive text can all be annotated by different people.

[0087] Optionally, based on the modal information, the preceding and following dialogues are determined through manual annotation to obtain basic training samples, which may include:

[0088] Send a first display instruction to a first terminal, the first display instruction being used to display the modal information to a first annotator; obtain the dialogue context annotated by the first annotator based on the modal information returned by the first terminal; send a second display instruction to a second terminal, the second display instruction being used to display the modal information and the annotated dialogue context to a second annotator; obtain the dialogue context annotated by the second annotator based on the modal information and the dialogue context returned by the second terminal.

[0089] Accordingly, determining the description text corresponding to the modal information may include: sending a third display instruction to a third terminal, the third display instruction being used to display the modal information to a third annotator; and obtaining the description text annotated by the third annotator based on the modal information returned by the third terminal.

[0090] Specifically, the first terminal, second terminal, and third terminal can be different terminals, and the first annotator, second annotator, and third annotator can be different annotators. The first terminal, second terminal, and third terminal can display a page for annotation to the corresponding annotator. The page can display modal information and a window for inputting annotation content. After obtaining the annotation content from the annotator, the annotator can send the annotation content to the server, which will then aggregate it to obtain the corresponding training samples.

[0091] By sending modal information to different terminals and having different annotators annotate the dialogue context, dialogue context, and descriptive text, multimodal dialogue in real-world scenarios can be effectively simulated, improving the accuracy of training samples and making the fine-tuned model more in line with actual needs, thus enhancing the model's dialogue performance.

[0092] In another alternative implementation, basic training samples for the description class can be directly constructed based on the modal information and the corresponding descriptive information. In the basic training samples of the description class, the preceding dialogue text is an instruction describing the modal information, and the following dialogue text is the descriptive information.

[0093] For example, the basic training samples for describing the class may include: modal information, such as an image; instructions for describing the image, such as “Please describe this image in detail,” as the preceding text of the dialogue; and descriptive text, such as “The image shows a man and a dog playing frisbee on a lawn…”, as the following text of the dialogue.

[0094] In this way, there is no need for manual annotation of the dialogue context and dialogue afterword. Only some fixed instructions that guide the model to describe the image are used directly as the dialogue context, and the aforementioned descriptive text is used directly as the dialogue afterword. Since the descriptive text is manually annotated, it has high accuracy. The training samples generated directly using the descriptive text can also be used as a type of basic training sample to fine-tune the model, improve the diversity of samples and the overall accuracy.

[0095] The following explains the spliced training samples.

[0096] Optionally, there can be multiple base training samples and self-generated training samples. In practical applications, multiple modal information can be collected to construct multiple training samples. The same modal information can generate one base training sample / self-generated training sample, or multiple base training samples / self-generated training samples. For example, for the same image, different questions can be asked, resulting in different answers.

[0097] After obtaining multiple training samples, different training samples can be spliced together. For ease of description, in this embodiment, the training sample to be spliced is referred to as the original training sample. The original training sample can be a basic training sample or a self-generated training sample.

[0098] Optionally, multiple basic training samples and multiple self-generated training samples can be put into a sample pool, at least two original training samples can be randomly sampled from the sample pool, and the dialogue context can be randomly determined. An indicator information referring to the nth modality information can be inserted before the dialogue context for splicing operation.

[0099] Alternatively, the sample pool may contain only one type of training sample, for example, only basic training samples, with self-generated training samples not participating in the splicing operation.

[0100] After obtaining at least two original training samples, the at least two original training samples are concatenated to obtain concatenated training samples, which may include:

[0101] Modal information of each original training sample is extracted from the at least two original training samples to obtain at least two modal information, and the dialogue context is extracted from the first original training sample; wherein, the first original training sample is any one of the at least two original training samples;

[0102] Generate indication information, which is used to indicate the sequence number of the modal information corresponding to the extracted dialogue context in the at least two modal information;

[0103] Based on the extracted modal information from at least two modalities, the dialogue context, and the instruction information, a spliced training sample is generated.

[0104] The at least two original training samples may contain content of multiple modalities, such as both images and audio, or the at least two original training samples may contain content of the same modality, such as both images but not video and audio. This embodiment does not impose any restrictions on this.

[0105] In at least two original training samples, there are at least two modal information and at least two dialogue contexts. The at least two modal information can be randomly sorted and used as modal information in the concatenated training samples, and one dialogue context can be selected as the question in the concatenated training samples.

[0106] To accommodate questions about specific modal information, instruction information can be added to the concatenated training samples. This instruction information can be used to indicate which modal information the dialogue context is referring to, specifically in the form of the nth position from top to bottom.

[0107] In one optional implementation, generating a concatenated training sample based on the extracted at least two modal information, the dialogue context, and the instruction information may include: using the extracted at least two modal information as the modal information in the concatenated training sample; using the instruction information and the extracted dialogue context as the dialogue context in the concatenated training sample; and using the dialogue context corresponding to the extracted dialogue context as the dialogue context in the concatenated training sample.

[0108] For example, a splicing operation is performed on three original training samples, wherein original training sample 1 contains: image 1, question 1, and answer 1; original training sample 2 contains: image 2, question 2, and answer 2; and original training sample 3 contains: image 3, question 3, and answer 3.

[0109] Three images are extracted from three original training samples. These three images can be randomly ordered. At the same time, one question can be extracted from the three original training samples. This question can be any question from any of the original training samples. The indication information is used to indicate which image the question refers to.

[0110] For example, the stitched training sample may include: Image 1, Image 2, Image 3, instruction information, Question 3 (e.g., "What age is this product suitable for?"), and Answer 3 (e.g., "This product is suitable for preschool children"). The instruction information specifically refers to "the third image". In the stitched training sample, the modal information includes Image 1, Image 2, and Image 3. The dialogue context is the combination of the instruction information and the extracted dialogue context (Question 3) from the original training sample, i.e., "What age is this product in the third image suitable for?". The dialogue context is the extracted dialogue context corresponding to the dialogue context, i.e., Answer 3.

[0111] In another alternative implementation, the concatenated training samples are generated based on at least two extracted modal information, the dialogue context, and the instruction information. This can include:

[0112] Extract dialogue content from the remaining original training samples; wherein, the remaining original training samples are any one or more original training samples other than the first original training sample among the at least two original training samples, and the dialogue content includes the dialogue context and the dialogue after text.

[0113] The extracted modal information, dialogue content from the remaining original training samples, dialogue context from the first original training sample, and instruction information are concatenated to obtain the concatenated training sample.

[0114] In the spliced training samples, the dialogue context from different original training samples is randomly ordered, and / or the modal information is randomly ordered.

[0115] Specifically, at least two modal information extracted can be used as modal information in the concatenated training samples; at least one set of dialogue content, instruction information, and dialogue context from the first original training sample can be used as dialogue context from the concatenated training samples; and dialogue context from the first original training sample can be used as dialogue context from the concatenated training samples.

[0116] Optionally, if dialogue context is extracted from at least two original training samples, the dialogue context from different original training samples can be randomly ordered in the concatenated training samples, or the modal information from different original training samples can be randomly ordered. In this way, the sequence number of the dialogue context from the same original training sample in at least two dialogue contexts and the sequence number of the modal information in at least two modal information samples can be inconsistent.

[0117] For example, three images are extracted from three original training samples. The three images can be randomly ordered. At the same time, the last original training sample can be selected as the first original training sample. The questions and answers extracted from the other two original training samples are concatenated with the questions and indication information in the first original training sample. The indication information is used to indicate the sequence number of the image corresponding to the question in the first original training sample.

[0118] In the stitched training samples, both images and dialogues can be randomly ordered. For example, in the stitched training samples, the order of the three images is: image 1, image 2, image 3, and the order of the questions is: question 2, question 3, question 1.

[0119] Accordingly, the stitched training sample can sequentially include: Image 1, Image 2, Image 3, Question 2, Answer 2, Question 3, Answer 3, Instruction information, Question 1 (e.g., "How is this product?"), Answer 1 (e.g., "This product is very good"). The instruction information specifically refers to "Image 1". The stitched instruction information and Question 1 together form "How is this product in Image 1?". Here, Image 1, Image 2, and Image 3 serve as the modal information of the stitched training sample; Question 2, Answer 2, Question 3, Answer 3, the instruction information, and Question 1 form the context of the dialogue in the stitched training sample; and Answer 1 forms the context of the dialogue.

[0120] In practical applications, users may ask questions out of order. For example, after sending three images, they might ask a question about any one of them, or they might ask about the second and third images first, and then ask about the first image. This embodiment uses a splicing method to generate spliced training samples after adding instruction information, simulating the situation of users asking questions out of order in real-world scenarios, thus improving the model's dialogue performance and adaptability.

[0121] Furthermore, during the stitching process, if there are multiple sets of dialogues, the order of the dialogues and the order of the images can be different. This allows for random sorting and combination of images and dialogues, further simulating the situation where users ask questions out of order in complex scenarios and improving the dialogue performance of the model.

[0122] The above describes the generation of training samples using images as an example. When the modal information is audio or video, the method of generating training samples is similar to that of images. For example, questions and answers can be asked and answered for videos and audio to obtain the corresponding dialogue context and dialogue afterword. Fine-grained descriptions can also be performed on videos and audio to obtain descriptive text for videos and audio, and dialogue context and dialogue afterword can be automatically generated based on the descriptive text. In addition, splicing operations can also be performed.

[0123] Step 203: Construct a multimodal dialogue model based on the language model, and fine-tune the multimodal dialogue model based on the self-generated training samples.

[0124] Optionally, there are several ways to construct a multimodal dialogue model based on the language model.

[0125] For example, different channels can be set up in a multimodal dialogue model. One channel is used to receive modal information, and the other channel is initialized based on a pre-trained language model to receive the dialogue context. The two channels process the modal information and the dialogue context respectively, and then fuse them in the final output layer.

[0126] For example, a multimodal encoder can be added before the language model to construct a multimodal dialogue model. The output of the multimodal encoder serves as part of the input to the language model. Optionally, different encoders can be set for images, audio, and video.

[0127] Figure 4 This is a schematic diagram illustrating the principle of a multimodal dialogue model provided in an embodiment of this application. Figure 4 As shown, a multimodal dialogue model built using a pre-trained language model can include:

[0128] A multimodal encoder is used to acquire modal information and encode the modal information to obtain the element information (token) corresponding to the modal information;

[0129] A pre-trained language model is used to obtain the element information corresponding to the modality information and the element information corresponding to the dialogue context, and output the predicted dialogue context.

[0130] Specifically, since the dialogue context is in text format, the corresponding token can be directly obtained and input into the language model for processing. However, modal information such as images cannot be directly input into the language model. Therefore, a multimodal encoder can be added to encode the images, obtaining the corresponding encoded information, which is equivalent to the image's token. This token is then concatenated with the token corresponding to the dialogue context before being input into the language model for processing. Separators can be inserted during concatenation for differentiation. This embodiment does not restrict the specific form of the input.

[0131] Optionally, fine-tuning the multimodal dialogue model based on the self-generated training samples may include: fine-tuning the multimodal dialogue model based on the basic training samples, the concatenated training samples, and the self-generated training samples.

[0132] Through the aforementioned steps, basic training samples, spliced training samples, and self-generated training samples can be constructed, and these training samples can be used together to fine-tune the model.

[0133] Optionally, the model fine-tuning process may include: inputting modal information and dialogue context from the training samples into the model to obtain the predicted dialogue context; calculating the loss value based on the predicted dialogue context and the dialogue context in the training samples; and adjusting the model parameters based on the loss value. The goal of fine-tuning is to make the dialogue context predicted by the model as close as possible to the real dialogue context in the training samples.

[0134] In this embodiment, the basic training samples are manually labeled, highly accurate training samples. Supervised fine-tuning of the model can improve its accuracy. The concatenated training samples can expand the training sample pool and handle various complex, out-of-order dialogue scenarios. Self-generated training samples utilize the language capabilities of a pre-trained language model, combined with fine-grained descriptions to generate more training samples, enabling automatic expansion of the training sample pool and obtaining a richer pool of training samples with a limited number of labels. The combined use of various types of training samples can effectively improve the model's fine-tuning effect, enhancing its efficiency and applicability while maintaining a certain level of accuracy.

[0135] Optionally, in practical applications, to improve fine-tuning efficiency, one or two types of training samples can be selected for fine-tuning. For example, the model can be fine-tuned based solely on the base training samples and self-generated training samples. Of course, in addition to the training samples mentioned above, other training samples can be added to fine-tune the model; this embodiment does not impose any restrictions on this.

[0136] Step 204: Using the fine-tuned multimodal dialogue model, generate the dialogue context and dialogue afterword based on the modal information, and continue to fine-tune the multimodal dialogue model; wherein, the fine-tuned multimodal dialogue model is used to determine the dialogue afterword based on the modal information and the dialogue context.

[0137] Optionally, generating the dialogue context and dialogue follow-up based on modal information using a fine-tuned multimodal dialogue model can include: inputting the modal information and prompts into the fine-tuned multimodal dialogue model to obtain the dialogue context; and inputting the modal information and the obtained dialogue context into the fine-tuned multimodal dialogue model to obtain the dialogue follow-up. The prompts used here can be the same as in the previous steps, except that the descriptive text input to the model is replaced with the modal information.

[0138] Specifically, in step 202, a dialogue is generated based on fine-grained descriptive text and prompts. After the fine-tuning process in step 203, the model has the ability to process modal information and can input modal information into the model to generate dialogue.

[0139] The modal information used in step 204 can be the same as or different from the modal information in the previous steps. Even if unlabeled modal information is input into the model, the finely tuned model still has the ability to understand and process it.

[0140] In this way, by first generating dialogue content based on fine-grained descriptive text, and then combining the dialogue content with the original modal information, the model is fine-tuned. This fine-tuned model then has the ability to process modal information, and can be directly used to generate more dialogue content based on modal information. This allows the model to leverage its understanding of modal information to generate richer and more diverse training samples, providing sample support for subsequent iterative fine-tuning.

[0141] After obtaining the dialogue content generated based on modal information, the fine-tuned multimodal dialogue model can be further fine-tuned based on this content. Following fine-tuning, new dialogue context and context can be generated using the fine-tuned multimodal dialogue model based on modal information, and the model can then be fine-tuned again based on these new contexts. The process of generating dialogue content and fine-tuning can be iteratively executed multiple times until a stopping condition is met.

[0142] Optionally, after obtaining the dialogue context and dialogue background generated based on modal information, the generated dialogue context and dialogue background can be filtered or optimized before fine-tuning the model.

[0143] There are several ways to filter and optimize. For example, the generated dialogue content can be input into a scoring model, and high-quality dialogue content can be filtered out based on the score. Alternatively, it can be combined with the descriptive text of modal information, and the context of the dialogue generated based on the descriptive text can be used to determine whether the context and context of the dialogue match, and then filter out the matching dialogue content.

[0144] Alternatively, human feedback can be used to filter and rewrite the dialogue content generated based on modal information, and further fine-tune the model.

[0145] Optionally, by generating dialogue context and dialogue aftercare based on modal information using the fine-tuned multimodal dialogue model, further fine-tuning of the multimodal dialogue model may include:

[0146] By using the fine-tuned multimodal dialogue model, dialogue context and context are generated based on modal information, resulting in training samples generated based on modal information.

[0147] Send a screening instruction to the screening terminal, the screening instruction being used to display at least one training sample generated based on modality information to the screening personnel;

[0148] The training sample returned by the screening terminal is obtained after the screening personnel have screened and / or rewritten the at least one training sample generated based on modality information;

[0149] The multimodal dialogue model is fine-tuned based on the returned training samples;

[0150] Repeat the above process of generating training samples based on modal information, filtering, rewriting, and fine-tuning until the performance of the multimodal dialogue model meets the preset requirements.

[0151] Specifically, the fine-tuned model can generate dialogue context and dialogue background based on modal information. The modal information, dialogue context, and dialogue background can be used as training samples generated based on modal information. These training samples can be manually selected and rewritten. For example, high-quality training samples can be selected from at least one training sample, or at least some training samples can be rewritten to make the rewritten dialogue context and dialogue background more consistent with the actual dialogue.

[0152] After the screening terminal returns the manually selected and rewritten training samples, the model can be fine-tuned based on these samples. This fine-tuning process can be repeated multiple times until the preset requirements are met. These preset requirements can be set according to actual needs.

[0153] In one alternative implementation, a training sample set and a test sample set can be constructed. The model is fine-tuned based on the training sample set. After fine-tuning, the fine-tuned model can be tested based on the test sample set. If the performance of the model is determined to meet the requirements based on the test sample set, the fine-tuning operation ends; otherwise, the fine-tuning continues.

[0154] In another alternative implementation, after sending the training samples generated based on modal information to the screening personnel, in addition to obtaining the training samples screened and rewritten by the screening personnel, it is also possible to obtain the evaluation information of the training samples generated based on modal information by the screening personnel. This evaluation information is used to evaluate the quality of the training samples generated by the model. If the quality of the newly generated training samples basically meets the requirements, fine-tuning can be considered to end; otherwise, fine-tuning can continue.

[0155] Alternatively, the quality of the generated training samples can be measured by the difference between the training samples returned by the screening personnel and the training samples generated by the model. If the difference before and after the screening is not significant, it means that the model can generate high-quality answers, and at this point, fine-tuning can be considered to end.

[0156] The above method allows for manual filtering of the dialogue content generated by the fine-tuned model, selecting high-quality questions and answers, and making necessary modifications to some answers to better meet expectations. The model is then fine-tuned based on the filtered and rewritten dialogues, resulting in a new model with better performance. This model can then be used for further sample generation, thus achieving a feedback and model update cycle to continuously optimize the model's performance.

[0157] Optionally, a fourth display instruction can also be output, which is used to display a modality list to the user, the modality list including: images, audio, and video; and to obtain any one or more modalities selected by the user from the modality list. Correspondingly, constructing a multimodal dialogue model based on the language model may include: adding encoders corresponding to the one or more modalities selected by the user to the language model to obtain the multimodal dialogue model.

[0158] In this embodiment, the user is allowed to select the modality to be trained. For example, the user can choose to train a model that can process images, a model that can process audio, or a model that can process both audio and images simultaneously. Accordingly, when constructing a multimodal dialogue model, an encoder for the corresponding modality can be added.

[0159] Specifically, a list of selectable modalities can be displayed to the user through the front-end interactive interface. The user selects one or more modalities in the interactive interface. For example, if images are selected, the modal information obtained during the training process can include the content of the image modality.

[0160] The above approach allows users to choose the processing capabilities of the multimodal dialogue model, enabling the trained model to be applicable to various practical needs, meet users' personalized requirements, and improve user experience.

[0161] In this embodiment of the application, various basic functions related to model training can be made available to users through the front-end interactive interface, which can be used to adjust the model training strategy, including but not limited to: model input, model output, model structure, model training process, etc.

[0162] For example, regarding model input, users can adjust the input format and modality, and add, modify, or filter input information in the training samples. Regarding model output, users can adjust the output format and modality, and modify or filter labels. Regarding model structure, users can adjust the specific structure and number of parameters of the model, and add, remove, or modify one or more modules in the model. Regarding the model training process, users can plan the training phases, select the dataset and input / output strategies used in each phase, the loss function, and the stopping conditions for training.

[0163] Optionally, the interactive interface can also display relevant information to assist users in making choices, such as the available options or detailed information about each option. Furthermore, the interface can show users various dynamic information during model training, such as intermediate training results, to facilitate timely strategy updates.

[0164] In summary, this embodiment can acquire modal information, determine the corresponding descriptive text, generate dialogue context and context based on the descriptive text using a pre-trained language model to obtain self-generated training samples, construct a multimodal dialogue model based on the language model, fine-tune the multimodal dialogue model based on the self-generated training samples, and generate dialogue context and context based on the modal information using the fine-tuned multimodal dialogue model to continue fine-tuning the multimodal dialogue model. This enables the model to process multimodal information and allows for iterative fine-tuning using the model's multimodal understanding capabilities. Only a small amount of initial annotation is required to complete the model training, reducing the time spent on manual annotation and improving overall training efficiency. Furthermore, generating training samples first using the descriptive text and then using the modal information itself can uncover rich detailed features in the modal information, improving the model's accuracy.

[0165] Based on the technical solutions provided in the above embodiments, after fine-tuning, the multimodal dialogue model can be further optimized by combining contrastive learning and reinforcement learning. Figure 5 This is a flowchart illustrating an optimization model provided in an embodiment of this application. Figure 5 As shown, after fine-tuning the model, the method may further include:

[0166] Step 501: Based on the fine-tuned multimodal dialogue model, construct a matching model and train the matching model using multiple training samples.

[0167] The matching model includes two branches. The input to the first branch is modal information and the dialogue context, and the input to the second branch is the dialogue context. The matching model is used to determine the degree of matching between the inputs of the two branches.

[0168] In this embodiment, the matching model can be implemented based on contrastive learning, aiming to determine the context of the dialogue that is more consistent with expectations, modal information, and dialogue context.

[0169] Optionally, the two branches of the matching model can be initialized using the parameters of a fine-tuned multimodal dialogue model to improve the training performance of the matching model. Alternatively, the matching model can be initialized using other parameters.

[0170] As described above, a multimodal dialogue model can include a multimodal encoder and a language model. Specifically, the multimodal encoder can be used to acquire modal information and encode the modal information to obtain the element information corresponding to the modal information; the language model can be used to acquire the element information corresponding to the modal information and the element information corresponding to the dialogue context, and output the predicted dialogue context.

[0171] In this embodiment, optionally, the language model may include: a text processing module, used to obtain element information corresponding to the modality information and element information corresponding to the dialogue context, and output the corresponding embedding features; and an output layer, used to output the predicted dialogue context based on the embedding features.

[0172] Accordingly, this step, based on the fine-tuned multimodal dialogue model, constructs a matching model and trains the matching model using multiple training samples, which may include:

[0173] Based on the parameters of the multimodal encoder and text processing module in the fine-tuned multimodal dialogue model, the first branch in the matching model is initialized, and based on the parameters of the text processing module in the fine-tuned multimodal dialogue model, the second branch in the matching model is initialized.

[0174] During training, modal information and dialogue context from any training sample are input into the first branch, and dialogue context from any training sample is input into the second branch. Based on the embedding features output by the two branches, the parameters of the matching model are adjusted through comparative learning supervision.

[0175] Figure 6 This is a schematic diagram illustrating the principle of comparative learning provided in an embodiment of this application. Figure 6 As shown, the matching model includes two branches. The first branch includes a multimodal encoder plus a text processing module, which is used to input modal information and dialogue context. The second branch includes a text processing module, which is used to input dialogue context. The two branches are supervised through comparative learning.

[0176] Optionally, in a multimodal dialogue model, embedding features can be obtained through autoregression based on modal information and the token corresponding to the dialogue context. In a matching model, a special token can be added after the modal information and the token corresponding to the dialogue context. The feature output after obtaining the special token is used as the embedding feature corresponding to the modal information and the dialogue context. This feature is then compared with the embedding feature output by another branch of the matching model. Through contrastive learning supervision, the matching model is trained.

[0177] The training objective is to make the modal information and the embedding features corresponding to the preceding and following dialogue texts in the same training sample as close as possible, and to make the modal information and the embedding features corresponding to the preceding and following dialogue texts in different training samples as far apart as possible.

[0178] For example, training samples may include images, questions, and answers. Images and questions are input into the first branch of the matching model, and answers are input into the second branch of the matching model. If the content input into the two branches matches, that is, they belong to the same training sample, then the embedded features output by the two branches will be closer. If the image, question, and answer do not match, that is, they do not belong to the same training sample, then the embedded features input into the two branches will be further apart.

[0179] Optionally, the loss function for contrastive learning can be InfoNCEloss, and the loss value can be calculated using the following formula:

[0180]

[0181] When InfoNCEloss is applied to the embodiments of this application, the relevant parameters are as follows: K is the number of training samples; during training, q is the embedding feature corresponding to the input of the first branch (modal information and dialogue context); k+ is the embedding feature corresponding to the dialogue context that matches the input; ki is the embedding feature corresponding to the dialogue context in the i-th training sample; and τ is the temperature coefficient.

[0182] The trained matching model can be used to determine the degree of matching between the inputs of two branches. Specifically, the closer the embedded features of the outputs of the two branches are, the higher the degree of matching; conversely, the lower the degree of matching, the more similar the embedded features are.

[0183] By comparing with existing labeled data, the matching model can learn and adjust based on the difference between the predicted results and the actual labels, thereby improving the accuracy of the matching model and enhancing the effectiveness of subsequent optimization of the multimodal dialogue model based on the matching model.

[0184] The training samples used when training the matching model can be the same as those used when fine-tuning the multimodal dialogue model, or they can be different.

[0185] Alternatively, in addition to the aforementioned training samples, a massive amount of image and text data can also be used to train the matching model.

[0186] Specifically, multiple sets of image-text pairs can be crawled from the web. Each set of image-text pairs can include an image and its corresponding descriptive information, such as an image obtained from the web and its corresponding title. The image and descriptive information are then input into two branches of the matching learning process for comparative learning.

[0187] Step 502: Use the fine-tuned multimodal dialogue model as the policy model for reinforcement learning, and determine the reward information in reinforcement learning based on the matching model. Optimize the multimodal dialogue model and the matching model through proximal policy optimization.

[0188] In this embodiment, the reinforcement learning used can be policy model-based reinforcement learning. In reinforcement learning, state information is input into the policy model to obtain action information, and the corresponding reward information is determined. The policy model is then iteratively optimized based on the reward information.

[0189] Optionally, this step may specifically include:

[0190] By inputting modal information and dialogue context into a multimodal dialogue model, at least one predicted dialogue context can be obtained;

[0191] For any predicted dialogue context, the modal information and the dialogue context are input into the first branch of the matching model, and the predicted dialogue context is input into the second branch of the matching model to obtain the matching degree, which is used as the reward information corresponding to the predicted dialogue context.

[0192] Based on the reward information corresponding to each predicted dialogue, the multimodal dialogue model and matching model are optimized through near-end strategy optimization.

[0193] In this embodiment, the policy model for reinforcement learning can be the multimodal dialogue model. The modal information and dialogue context can be used as state information. After being input into the policy model, the resulting predicted dialogue context can be used as behavioral information. There can be one or more predicted dialogue contexts. When there are multiple predicted dialogue contexts, each predicted dialogue context can correspond to reward information. This reward information can be a score determined by the matching model, which represents the degree of matching between the behavioral information (predicted dialogue context) and the state information (modal information and dialogue context).

[0194] After obtaining the reward information, the policy model (i.e., the multimodal dialogue model) and the matching model can be optimized using Proximal Policy Optimization (PPO). PPO is a reinforcement learning algorithm that aims to maximize the expected cumulative reward by optimizing the model's policy.

[0195] Specifically, PPO can be trained through multiple iterations. In each iteration, multiple predicted dialogue contexts can be obtained from the same state information. Among them, the higher-quality predicted dialogue contexts have higher scores. Then, the parameters of the multimodal dialogue model and the matching model can be updated based on the scores to optimize the multimodal dialogue model and the matching model, enabling them to score and predict more accurately.

[0196] Through repeated iterative optimization processes, multimodal dialogue models and matching models will gradually learn more accurate and realistic prediction and scoring strategies, thereby improving the accuracy and robustness of the models. Furthermore, in contrastive learning and reinforcement learning, the global features of the input information are supervised. Compared to the token-by-token prediction in the fine-tuning stage of multimodal dialogue models, this can better achieve globally consistent rewards or penalties, further improving the model's performance and reducing illusions.

[0197] This application also provides a method for generating a multimodal dialogue model, which integrates the fine-tuning and optimization processes of the above-mentioned stages. Figure 7 This is a schematic diagram illustrating the principle of a method for generating a multimodal dialogue model provided in an embodiment of this application. Figure 7 As shown, in the first stage, basic data construction and data augmentation can be carried out, and the model can be fine-tuned.

[0198] Specifically, by acquiring multimodal information from the network and obtaining basic training samples through manual annotation, and then combining pre-trained language models such as LLM and concatenation strategies to expand the training samples, the resulting training samples are used for supervised fine-tuning of the multimodal dialogue model. The multimodal dialogue model can include a multimodal encoder and a language model.

[0199] In the second stage, the model can be fine-tuned iteratively through manual feedback and self-instruction.

[0200] Specifically, a fine-tuned multimodal dialogue model can be used to input a large amount of modal data to generate corresponding dialogue content as training samples. Poor dialogue content can be manually filtered out, and some dialogue content can be rewritten. The filtered and rewritten training samples can be used to fine-tune the multimodal dialogue model, forming a flywheel of feedback and model updates to continuously optimize the model.

[0201] In the third stage, in order to overcome the hallucination phenomenon that may occur in the model, the multimodal dialogue model can be optimized based on the matching model and reinforcement learning to help the multimodal dialogue model predict answers that are more in line with expectations and more in line with the input information.

[0202] Specifically, the two branches of the matching model can be constructed by a multimodal encoder and a language model (specifically, the text processing module within the language model used to obtain embedded features). In contrastive learning, the degree of matching between the two branches of the matching model can be determined. The input to one branch is modal information and the dialogue context, while the input to the other branch is the dialogue context. In reinforcement learning, the modal information and the dialogue context can be input into the multimodal dialogue model to obtain multiple predicted dialogue contexts. These are then scored based on the matching model, and the parameters of both the multimodal dialogue model and the matching model are updated according to the scores.

[0203] This embodiment enables fine-tuning of multimodal dialogue models, fully leveraging the textual capabilities of language models. Combined with foundational data and iterative human feedback, the fine-tuned model possesses multimodal dialogue and comprehension abilities. The introduced human feedback-based self-instruction iteration identifies model shortcomings for improvement, eliminating the need for extensive manual data labeling and reducing labeling costs, thus lowering the overall production cost of fine-tuning data. This allows for more targeted model updates by building a model and data iteration frisbee. Furthermore, this embodiment does not require additional models to assist in understanding multimodal information, resulting in a simpler system that effectively reduces computational resources while better adapting to multimodal tasks. Simultaneously, utilizing a contrastive learning matching model combined with reinforcement learning maximizes the encouragement of accurate model outputs, reduces model speculation, and generates more relevant outputs.

[0204] The multimodal dialogue model provided in this application can be applied to various scenarios, such as e-commerce, meetings, and customer service. Figure 8 This is a flowchart illustrating another method for generating a multimodal dialogue model provided in an embodiment of this application. Figure 8 This demonstrates a specific training method for a multimodal dialogue model in an e-commerce scenario. For example... Figure 8 As shown, the method includes:

[0205] Step 801: Obtain the modal information of the product, wherein the modal information includes at least one of the following: images, videos, and audio of the product.

[0206] The product can be any product on the e-commerce platform, and the product's modal information can be images, videos, audio, etc. of the product displayed on the e-commerce platform.

[0207] Step 802: Determine the description text corresponding to the product by extracting text from the product details page and / or manually annotating it. Based on the description text, generate the dialogue context and dialogue background using a pre-trained language model to obtain self-generated training samples. The self-generated training samples include: the modal information, the dialogue context, and the dialogue background.

[0208] In this embodiment, the descriptive text corresponding to the modal information can be generated through manual annotation or extracted directly from the product details page. For example, the product details page may include images, videos, audio, etc. of the product, as well as text used to introduce the product. Using this text as the descriptive text corresponding to the modal information, or using this text after manual optimization as the corresponding descriptive text, can effectively improve the efficiency of determining the descriptive text and help generate richer and more comprehensive descriptive text.

[0209] Step 803: Construct a multimodal dialogue model based on the language model, and fine-tune the multimodal dialogue model based on the self-generated training samples.

[0210] Step 804: Using the fine-tuned multimodal dialogue model, generate the dialogue context and dialogue background based on the modal information, and continue to fine-tune the multimodal dialogue model; wherein, the fine-tuned multimodal dialogue model is used to determine the dialogue background based on the modal information of the product and the dialogue context.

[0211] The specific implementation principle and process of the method in this embodiment can be found in the foregoing embodiments, and will not be repeated here.

[0212] In summary, the multimodal dialogue model generation method provided in this embodiment can fine-tune the model based on the multimodal information of the product, enabling the model to process multimodal information. Furthermore, iterative fine-tuning can be performed using the model's multimodal understanding capabilities. Only a small amount of initial annotation is required to complete model training, reducing the time spent on manual annotation and improving overall training efficiency. Moreover, by first generating training samples using descriptive text and then using the modal information itself, rich detailed features within the product modal information can be extracted, improving the model's accuracy, meeting the needs of intelligent customer service dialogue in e-commerce scenarios, and enhancing user experience.

[0213] Referring to the above method, this application also provides a method for generating a multimodal model, which can be applied to dialogue and any other domain. For example, it can be used for downstream tasks such as translation, summarization, automatic question answering, and chatbots. Figure 9 This is a flowchart illustrating a method for generating a multimodal model according to an embodiment of this application. Figure 9 As shown, the method includes:

[0214] Step 901: Obtain modal information, wherein the modal information includes content of any one or more modalities other than text.

[0215] Step 902: Determine the description text corresponding to the modality information. Based on the description text, generate input information and output information through a pre-trained base model to obtain self-generated training samples. The self-generated training samples include: the modality information, the input information, and the output information.

[0216] Step 903: Construct a multimodal model based on the base model, and fine-tune the multimodal model based on the self-generated training samples.

[0217] Step 904: Using the fine-tuned multimodal model, generate input and output information based on modal information, and continue to fine-tune the multimodal model; wherein, the fine-tuned multimodal model is used to determine the output information based on the modal information and input information.

[0218] The input information can be any information to be input into the base model, such as the preceding dialogue, the text to be translated, or requirement information. The output information can be any information output by the base model, such as the following dialogue, the translated text, or code information. The base model can be any pre-trained model, such as a language model used to obtain text output from text input, or other base models, such as a model to obtain an image from text.

[0219] For example, in the field of translation, the translated text can be obtained through a multimodal model based on modal information and the text to be translated. For instance, given an image or video, the text to be translated is the content contained in the image or video described in a first language. The text is translated into a second language to obtain the translated text. Combining images and videos during translation allows for reference to richer content in the images and videos, thereby improving the accuracy of the translation.

[0220] In the field of code generation, code information can be obtained through models based on modal information and requirement information. For example, given a chart, the requirement information is to perform corresponding analysis based on the chart, and the code information is the code generated based on the chart and requirement information. Thus, multiple modal information can be referenced during code generation to meet the needs of different scenarios.

[0221] Besides translation and code generation, it can also be applied to other fields. For example, the modal information can be video, the input information can be an instruction to generate corresponding text based on the video, and the text output information is the text obtained through the model. As another example, the modal information can be a chart, the input information can be a piece of code used to analyze the chart, the model can debug the code, and the resulting text output information can be the debugged code or the problems existing in the code. In practical applications, the input and output information of the model can be determined according to the specific application field of the model; this embodiment does not impose any limitations.

[0222] The base model can also be a model that generates other modal outputs from text, such as a model that generates an image from text, in which case the output information can be an image. Based on the base model, the capabilities of other modalities, such as audio, can be further superimposed, thus constructing a multimodal model that generates an image from audio and text.

[0223] Optionally, the method further includes:

[0224] Based on the modal information, the input and output information are determined through manual annotation to obtain basic training samples; wherein, there are multiple basic training samples and multiple self-generated training samples.

[0225] At least two original training samples are concatenated to obtain concatenated training samples; wherein the original training samples are training samples arbitrarily selected from multiple basic training samples and multiple self-generated training samples.

[0226] Accordingly, fine-tuning the multimodal model based on the self-generated training samples includes: fine-tuning the multimodal model based on the basic training samples, the spliced training samples, and the self-generated training samples.

[0227] Optionally, at least two original training samples are concatenated to obtain concatenated training samples, including:

[0228] Modal information of each original training sample is extracted from the at least two original training samples to obtain at least two modal information, and input information is extracted from the first original training sample; wherein, the first original training sample is any one of the at least two original training samples;

[0229] Generate indication information, which is used to indicate the sequence number of the modal information corresponding to the extracted input information in the at least two modal information;

[0230] Based on the extracted modal information, input information, and instruction information, a spliced training sample is generated.

[0231] Optionally, the step of determining the input and output information based on the modal information through manual annotation to obtain basic training samples includes:

[0232] Send a first display instruction to the first terminal, the first display instruction being used to display the modal information to the first annotation personnel;

[0233] Obtain the input information returned by the first terminal, which was annotated by the first annotator based on the modal information;

[0234] Send a second display instruction to the second terminal, the second display instruction being used to display the modal information and the annotation input information to the second annotation personnel;

[0235] Obtain the output information annotated by the second annotator based on the modal information and input information, returned by the second terminal;

[0236] The step of determining the description text corresponding to the modality information includes:

[0237] A third display instruction is sent to a third terminal, the third display instruction being used to display the modal information to a third annotation person;

[0238] Obtain the descriptive text annotated by the third annotator based on the modal information, returned by the third terminal.

[0239] Optionally, using the fine-tuned multimodal model, input and output information are generated based on modal information, and the multimodal model is further fine-tuned, including:

[0240] By using the fine-tuned multimodal model, input and output information are generated based on modal information to obtain training samples generated based on modal information;

[0241] Send a screening instruction to the screening terminal, the screening instruction being used to display at least one training sample generated based on modality information to the screening personnel;

[0242] The training sample returned by the screening terminal is obtained after the screening personnel have screened and / or rewritten the at least one training sample generated based on modality information;

[0243] The multimodal model is fine-tuned based on the returned training samples;

[0244] Repeat the above process of generating training samples based on modal information, filtering, rewriting, and fine-tuning until the performance of the multimodal model meets the preset requirements.

[0245] Optionally, the step of generating input and output information based on the description text using a pre-trained base model includes:

[0246] Obtain prompt words, input the description text and prompt words into the pre-trained base model to obtain input information, and input the description text and the obtained input information into the base model to obtain output information;

[0247] The process of generating input and output information based on modal information using a fine-tuned multimodal model includes:

[0248] Modal information and prompt words are input into the fine-tuned multimodal model to obtain input information. The modal information and the obtained input information are then input into the fine-tuned multimodal model to obtain output information.

[0249] Optionally, the method further includes:

[0250] Output a fourth display instruction, which is used to display a modal list to the user, the modal list including: images, audio, and video;

[0251] Obtain any one or more modalities selected by the user from the modal list;

[0252] Accordingly, a multimodal model is constructed based on the aforementioned basic model, including:

[0253] Based on one or more modalities selected by the user, encoders corresponding to the one or more modalities are added to the base model to obtain the multimodal model.

[0254] Optionally, the method further includes:

[0255] Based on the fine-tuned multimodal model, a matching model is constructed and trained using multiple training samples. The matching model includes two branches: the input of the first branch is modal information and input information, and the input of the second branch is output information. The matching model is used to determine the degree of matching between the inputs of the two branches.

[0256] The fine-tuned multimodal model is used as the policy model for reinforcement learning, and the reward information in reinforcement learning is determined based on the matching model. The multimodal model and the matching model are optimized through proximal policy optimization.

[0257] Optionally, multimodal models built from a pre-trained base model include:

[0258] A multimodal encoder is used to acquire modal information and encode the modal information to obtain the element information corresponding to the modal information;

[0259] The pre-trained base model includes: a text processing module, used to acquire element information corresponding to the modality information and element information corresponding to the input information, and output corresponding embedding features; and an output layer, used to output predicted output information based on the embedding features.

[0260] Accordingly, based on the fine-tuned multimodal model, a matching model is constructed, and the matching model is trained using multiple training samples, including:

[0261] Based on the parameters of the multimodal encoder and text processing module in the fine-tuned multimodal model, the first branch in the matching model is initialized, and based on the parameters of the text processing module in the fine-tuned multimodal model, the second branch in the matching model is initialized.

[0262] During training, modal information and input information from any training sample are input into the first branch, and output information from any training sample is input into the other branch. Based on the embedded features output by the two branches, the parameters of the matching model are adjusted through comparative learning supervision.

[0263] Optionally, the fine-tuned multimodal model is used as the policy model for reinforcement learning, and the reward information in reinforcement learning is determined based on the matching model. The multimodal model and the matching model are optimized through proximal policy optimization, including:

[0264] Modal information and input information are input into a multimodal model to obtain at least one predicted output.

[0265] For any predicted output information, the modal information and input information are input into the first branch of the matching model, and the predicted output information is input into the second branch of the matching model to obtain the matching degree, which is used as the reward information corresponding to the predicted output information.

[0266] Based on the reward information corresponding to each predicted output, the multimodal model and the matching model are optimized through near-end strategy optimization.

[0267] This application also provides a method for generating a multimodal translation model, applied in the field of translation, including:

[0268] Obtain modal information, which includes content of any one or more modalities other than text; determine the descriptive text corresponding to the modal information; and generate the text to be translated and the translated text based on the descriptive text using a pre-trained language model to obtain self-generated training samples, which include: the modal information, the text to be translated, and the translated text.

[0269] A multimodal translation model is constructed based on the language model, and the multimodal translation model is fine-tuned based on the self-generated training samples;

[0270] The fine-tuned multimodal translation model generates the text to be translated and the translated text based on modal information, and then continues to fine-tune the multimodal translation model; wherein, the fine-tuned multimodal translation model is used to determine the translated text based on the modal information and the text to be translated.

[0271] This application also provides a method for generating a multimodal code model, applied in the field of code generation, including:

[0272] Obtain modal information, which includes content of any one or more modalities other than text; determine the description text corresponding to the modal information; and generate requirement information and code information based on the description text using a pre-trained language model to obtain self-generated training samples, which include the modal information, the requirement information, and the code information.

[0273] A multimodal code model is constructed based on the language model, and the multimodal code model is fine-tuned based on the self-generated training samples;

[0274] The fine-tuned multimodal code model generates requirement information and code information based on modal information, and the multimodal code model is further fine-tuned; wherein, the fine-tuned multimodal code model is used to determine the code information based on the modal information and requirement information.

[0275] In various fields, the training methods for the models can refer to the solutions provided in the foregoing embodiments, as long as the following in the foregoing embodiments are replaced: the multimodal dialogue model is replaced with the corresponding model, such as a multimodal model, a multimodal translation model, a multimodal code model, etc.; the language model is replaced with the basic model; the dialogue context is replaced with the actual input of the model, such as input information, text to be translated, requirement information, etc.; and the dialogue context is replaced with the actual output of the model, such as output information, translated text, code information, etc.

[0276] This embodiment can acquire modal information, determine the descriptive text corresponding to the modal information, generate input and output information based on the descriptive text using a pre-trained base model to obtain self-generated training samples, construct a multimodal model based on the base model, fine-tune the multimodal model based on the self-generated training samples, and continue to fine-tune the multimodal model by generating input and output information based on the modal information using the fine-tuned multimodal model. This enables the model to process multimodal information and allows for iterative fine-tuning using the model's multimodal understanding capabilities. Only a small amount of initial annotation is required to complete the model training, reducing the time spent on manual annotation and improving overall training efficiency. Furthermore, by first generating training samples using the descriptive text and then using the modal information itself, rich detailed features in the modal information can be extracted, improving the model's accuracy.

[0277] Figure 10 This is a flowchart illustrating a multimodal processing method provided in an embodiment of this application. Figure 10 As shown, the method may include:

[0278] Step 1001: Obtain information to be processed, which includes modal information and input information. The modal information includes content in any one or more modalities other than text.

[0279] Step 1002: Based on the multimodal model, determine the output information corresponding to the information to be processed.

[0280] The multimodal model is generated by the method described in any of the above embodiments. For example, it can be a multimodal dialogue model or a multimodal translation model.

[0281] Optionally, the method can be applied to a server, a terminal device, or can be executed jointly by a server and a terminal device.

[0282] The multimodal processing method provided in this embodiment can determine the output information during the interaction process based on the multimodal model. The multimodal model has the ability to process multimodal information and can be iteratively fine-tuned using the model's multimodal understanding capabilities. Only a small amount of initial annotation is required to complete the model training, reducing the time spent on manual annotation and improving the overall training efficiency. Furthermore, during the fine-tuning process, training samples are first generated using descriptive text and then using the modal information itself, which can uncover rich detailed features in the modal information and improve the accuracy of the model.

[0283] Optionally, in the dialogue domain, the input information and output information can be the dialogue context and dialogue background, respectively. The multimodal model is specifically a multimodal dialogue model, including: a multimodal encoder for processing modal information to obtain element information corresponding to the modal information; and a language model for processing the element information corresponding to the modal information and the element information corresponding to the dialogue context. After determining the dialogue background corresponding to the input information based on the multimodal dialogue model, the dialogue background can be displayed to the user, and the content of the user's next input can be obtained. If it is detected that the user's next input still pertains to the aforementioned modal information, the next dialogue background is determined using the language model and the element information corresponding to the modal information and displayed to the user.

[0284] In this way, in multi-turn dialogue interaction, if a user's question in a certain turn is directed at an image from a previous turn, the element information corresponding to the image obtained in the previous turn and the element information of the question in the current turn can be directly concatenated and input into the language model for processing. This eliminates the step of inputting the image into a multimodal encoder for processing, effectively improving the processing efficiency in practical applications, increasing the response speed, and enhancing the user experience.

[0285] This application also provides an application system for a multimodal dialogue model, the multimodal dialogue model comprising: a multimodal encoder for processing modal information to obtain element information corresponding to the modal information; and a language model for processing the element information corresponding to the modal information and the element information corresponding to the dialogue context.

[0286] The system includes: an edge device and a cloud device; the edge device is equipped with a language model, and the cloud device is equipped with a multimodal encoder and a language model.

[0287] The endpoint device is used to acquire information to be processed and send it to the cloud-side device. The information to be processed includes modal information and dialogue context. At least part of the content of the information to be processed is provided by the user.

[0288] The cloud-side device is used to obtain the element information corresponding to the modal information and the dialogue context corresponding to the information to be processed according to the multimodal dialogue model, and send them to the end-side device.

[0289] The edge device is also used to: display the following dialogue text to the user and obtain the content of the user's next round of input; if the content of the user's next round of input is still related to the aforementioned modal information, then use the locally deployed language model and the element information corresponding to the modal information to determine the following dialogue text for the next round and display it to the user; specifically, the element information corresponding to the modal information and the element information corresponding to the dialogue text (including the content of the next round of input) can be input into the language model to obtain the following dialogue text for the next round.

[0290] The multimodal dialogue model is generated by the method described in any of the foregoing embodiments.

[0291] In this embodiment, the system can adopt a traditional server architecture, where the cloud-side device and the terminal-side device can be a server and a user terminal, respectively. Alternatively, the system can adopt a distributed cloud architecture, where the cloud-side device and the terminal-side device can be a central cloud device and an edge cloud device, respectively. All of the above devices can execute the multimodal dialogue model generation method provided in any embodiment of this application.

[0292] In practical applications, multimodal dialogue models can be deployed through edge-cloud collaboration to assist in dialogue interaction. Specifically, the cloud-side device deploys a multimodal encoder and a language model. The multimodal encoder can process modal information to obtain modal information such as element information corresponding to an image. The language model can then process the element information corresponding to the image and the preceding dialogue text, such as the element information corresponding to a question, to obtain the following dialogue text, such as an answer.

[0293] The edge device can acquire the element information corresponding to the image and the answer, and display the answer to the user. The edge device is equipped with a language model. After acquiring the user's next round of input questions, if the user's next round of input questions is still related to the previously appeared image, the device uses the locally deployed language model, the element information corresponding to the image, and the next round of input questions to determine the next round of answers and display them to the user.

[0294] Understandably, the terminal device can display the answer to the user either directly or by sending the answer to another terminal, which will then display it to the user.

[0295] The system provided in this embodiment deploys models on both the edge device and the cloud device. The edge device only deploys the language model. During the dialogue with the user, the step of inputting the image into the multimodal encoder for processing can be appropriately omitted according to the dialogue situation, reducing the number of interactions with the cloud device, effectively improving the processing efficiency in actual applications, increasing the response speed, and enhancing the user experience.

[0296] Corresponding to the above system, this application embodiment also provides a multimodal dialogue method applied to an end-side device, including:

[0297] The system acquires pending information and sends it to the cloud-side device. The pending information includes modal information and dialogue context. At least part of the pending information is provided by the user, so that the cloud-side device can obtain the element information corresponding to the modal information and the dialogue context corresponding to the pending information based on the multimodal dialogue model, and send them to the end-side device. The dialogue context is then displayed to the user, and the system acquires the content of the user's next input. If the user's next input is still related to the aforementioned modal information, the system uses the locally deployed language model and the element information corresponding to the modal information to determine the dialogue context for the next round and displays it to the user.

[0298] This application also provides a multimodal dialogue method applied to cloud-side devices, including:

[0299] The system acquires pending information sent by the endpoint device. The pending information includes modal information and dialogue context, at least part of which is provided by the user. Based on a multimodal dialogue model, the system obtains the element information corresponding to the modal information and the dialogue context corresponding to the pending information, and sends them to the endpoint device so that the endpoint device can display the dialogue context to the user. The system also acquires the content of the user's next input. If the user's next input is still related to the aforementioned modal information, the system uses the language model deployed locally on the endpoint device and the element information corresponding to the modal information to determine the dialogue context for the next round and display it to the user.

[0300] In other alternative implementations, the language model can be deployed only on the server, without deploying it on the client-side device. In each round of dialogue, the context is determined by the cloud-side device, which can reduce the burden on the client-side device and has strong applicability.

[0301] In the technical solutions provided in the above embodiments, a model with text processing capabilities is used as the basic model, and based on the text processing capabilities, processing capabilities for other modalities are further extended. Referring to the above principles, multimodal models can be constructed based on basic models of other modalities.

[0302] Specifically, embodiments of this application also provide a method for generating a multimodal model, which constructs a multimodal model through a pre-trained base model, wherein the pre-trained base model has the ability to obtain information of a second modality based on information of a first modality; the method for generating the multimodal model may include:

[0303] The modality information is acquired, which includes content from one or more modalities other than the first modality; corresponding descriptive information is obtained based on the modality information, which is information from the first modality; based on the descriptive information, input information and output information are generated through a pre-trained base model to obtain self-generated training samples, which include: the modality information, the input information, and the output information; the input information includes information from the first modality, and the output information includes information from the second modality;

[0304] A multimodal model is constructed based on the base model, and the multimodal model is fine-tuned based on the self-generated training samples;

[0305] The fine-tuned multimodal model generates input and output information based on modal information, and is then further fine-tuned. The fine-tuned multimodal model is used to determine the output information based on the modal information and the input information.

[0306] The first mode and the second mode can be the same or different.

[0307] For example, the first modality is audio and the second modality is image. The pre-trained base model can obtain the corresponding image based on the input audio, that is, it has the ability to generate images based on audio. Based on the base model, the capabilities of other modalities, such as text, can be further superimposed, thus constructing a multimodal model that generates images based on audio and text.

[0308] Specifically, modal information can be obtained, which includes text; corresponding descriptive information is obtained from the text, which is the information of the first modality that the basic model can originally process, that is, the descriptive audio corresponding to the text; and then, based on the descriptive audio, input information and output information (corresponding to audio and image respectively) are generated through the pre-trained basic model to obtain self-generated training samples, that is, the self-generated training samples include: text, input audio and output image.

[0309] A multimodal model is constructed based on the base model, and the multimodal model is fine-tuned based on the self-generated training samples, so that the multimodal model has the ability to obtain images based on text and audio; the fine-tuned multimodal model is iterated multiple times, with each iteration generating input audio and output image based on text, and the multimodal model is fine-tuned again.

[0310] The solution in this embodiment is to replace the input capability of the basic model in the previous embodiment with the first modality instead of text. The specific implementation principle, process and effect can be referred to the previous embodiment, and will not be repeated here.

[0311] Corresponding to the methods described above, this application also provides corresponding apparatus for performing the methods described in any of the foregoing embodiments.

[0312] In one example, this application embodiment provides a multimodal model generation apparatus, comprising:

[0313] The first acquisition module is used to acquire modal information, wherein the modal information includes content of any one or more modalities other than text;

[0314] The first determining module is used to determine the description text corresponding to the modality information, and generate input information and output information based on the description text through a pre-trained base model to obtain self-generated training samples. The self-generated training samples include: the modality information, the input information and the output information.

[0315] The first fine-tuning module is used to construct a multimodal model based on the base model and to fine-tune the multimodal model based on the self-generated training samples.

[0316] The second fine-tuning module is used to generate input and output information based on modal information using the fine-tuned multimodal model, and to further fine-tune the multimodal model; wherein, the fine-tuned multimodal model is used to determine the output information based on the modal information and the input information.

[0317] In another example, embodiments of this application provide an apparatus for generating a multimodal dialogue model, comprising:

[0318] The second acquisition module is used to acquire modal information, which includes content of any one or more modalities other than text.

[0319] The second determining module is used to determine the description text corresponding to the modal information, and generate the dialogue context and dialogue background based on the description text through a pre-trained language model to obtain self-generated training samples. The self-generated training samples include: the modal information, the dialogue context, and the dialogue background.

[0320] The third fine-tuning module is used to construct a multimodal dialogue model based on the language model and to fine-tune the multimodal dialogue model based on the self-generated training samples.

[0321] The fourth fine-tuning module is used to generate dialogue context and dialogue context based on modal information using the fine-tuned multimodal dialogue model, and to further fine-tune the multimodal dialogue model; wherein, the fine-tuned multimodal dialogue model is used to determine the dialogue context based on the modal information and dialogue context.

[0322] In yet another example, embodiments of this application provide a multimodal processing apparatus, comprising:

[0323] Obtain information to be processed, which includes modal information and input information, wherein the modal information contains content of any one or more modalities other than text;

[0324] Based on the multimodal model, the output information corresponding to the information to be processed is determined;

[0325] The multimodal model is generated by the method described in any of the foregoing embodiments.

[0326] The devices provided in the embodiments of this application can be used to execute the technical solutions of the aforementioned corresponding embodiments. Their implementation principles and technical effects are similar, and will not be repeated here.

[0327] Figure 11 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 11 As shown, the electronic device in this embodiment may include:

[0328] At least one processor 1101; and

[0329] Memory 1102 communicatively connected to the at least one processor;

[0330] The memory 1102 stores instructions that can be executed by the at least one processor 1101, which, when executed by the at least one processor 1101, cause the electronic device to perform the method described in any of the above embodiments.

[0331] Alternatively, the memory 1102 can be either standalone or integrated with the processor 1101.

[0332] The implementation principle and technical effects of the electronic device provided in this embodiment can be found in the foregoing embodiments, and will not be repeated here.

[0333] This application also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the method described in any of the foregoing embodiments.

[0334] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the methods described in any of the foregoing embodiments.

[0335] In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated into another system, or some features may be ignored or not executed.

[0336] The integrated modules implemented as software functional modules described above can be stored in a computer-readable storage medium. These software functional modules, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute some steps of the methods described in the various embodiments of this application.

[0337] It should be understood that the aforementioned processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. A general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in the application can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor. The memory may include high-speed RAM, and may also include non-volatile memory (NVM), such as at least one disk storage device, and may also be a USB flash drive, external hard drive, read-only memory, disk, or optical disc, etc.

[0338] The aforementioned storage medium can be implemented from any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The storage medium can be any available medium accessible to general-purpose or special-purpose computers.

[0339] An exemplary storage medium is coupled to a processor, enabling the processor to read information from and write information to the storage medium. Alternatively, the storage medium can be an integral part of the processor. Both the processor and the storage medium can reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the processor and storage medium can exist as discrete components in an electronic device or host device.

[0340] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0341] The sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0342] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk), and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0343] The above are merely preferred embodiments of this application and do not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.

Claims

1. A method for generating a multimodal model, characterized in that, include: Obtain modal information, wherein the modal information includes content of any one or more modalities other than text; The description text corresponding to the modality information is determined, prompt words are obtained, the description text and prompt words are input into the pre-trained base model to obtain input information, the description text and the obtained input information are input into the base model to obtain output information, and self-generated training samples are obtained. The self-generated training samples include: the modality information, the input information and the output information. A multimodal model is constructed based on the base model, and the multimodal model is fine-tuned based on the self-generated training samples; Modal information and prompt words are input into the fine-tuned multimodal model to obtain input information. The modal information and the obtained input information are then input into the fine-tuned multimodal model to obtain output information. The multimodal model is then further fine-tuned. The fine-tuned multimodal model is used to determine the output information based on the modal information and the input information.

2. The method according to claim 1, characterized in that, Also includes: Based on the modal information, the input and output information are determined through manual annotation to obtain basic training samples; wherein, there are multiple basic training samples and multiple self-generated training samples. At least two original training samples are concatenated to obtain concatenated training samples; wherein the original training samples are training samples arbitrarily selected from multiple basic training samples and multiple self-generated training samples. Accordingly, fine-tuning the multimodal model based on the self-generated training samples includes: fine-tuning the multimodal model based on the basic training samples, the spliced training samples, and the self-generated training samples.

3. The method according to claim 2, characterized in that, At least two original training samples are concatenated to obtain concatenated training samples, including: Modal information of each original training sample is extracted from the at least two original training samples to obtain at least two modal information, and input information is extracted from the first original training sample; wherein, the first original training sample is any one of the at least two original training samples; Generate indication information, which is used to indicate the sequence number of the modal information corresponding to the extracted input information in the at least two modal information; Based on the extracted modal information, input information, and instruction information, a spliced training sample is generated.

4. The method according to claim 2, characterized in that, The step of determining input and output information based on the modal information through manual annotation to obtain basic training samples includes: Send a first display instruction to the first terminal, the first display instruction being used to display the modal information to the first annotation personnel; Obtain the input information returned by the first terminal, which was annotated by the first annotator based on the modal information; Send a second display instruction to the second terminal, the second display instruction being used to display the modal information and the annotation input information to the second annotation personnel; Obtain the output information annotated by the second annotator based on the modal information and input information, returned by the second terminal; The step of determining the description text corresponding to the modality information includes: A third display instruction is sent to a third terminal, the third display instruction being used to display the modal information to a third annotation person; Obtain the descriptive text annotated by the third annotator based on the modal information, returned by the third terminal.

5. The method according to claim 1, characterized in that, Continue to fine-tune the multimodal model, including: By using the fine-tuned multimodal model, input and output information are generated based on modal information to obtain training samples generated based on modal information; Send a screening instruction to the screening terminal, the screening instruction being used to display at least one training sample generated based on modality information to the screening personnel; The training sample returned by the screening terminal is obtained after the screening personnel have screened and / or rewritten the at least one training sample generated based on modality information; The multimodal model is fine-tuned based on the returned training samples; Repeat the above process of generating training samples based on modal information, filtering, rewriting, and fine-tuning until the performance of the multimodal model meets the preset requirements.

6. The method according to claim 1, characterized in that, Also includes: Output a fourth display instruction, which is used to display a modal list to the user, the modal list including: images, audio, and video; Obtain any one or more modalities selected by the user from the modal list; Accordingly, a multimodal model is constructed based on the aforementioned basic model, including: Based on one or more modalities selected by the user, encoders corresponding to the one or more modalities are added to the base model to obtain the multimodal model.

7. The method according to any one of claims 1-6, characterized in that, Also includes: Based on the fine-tuned multimodal model, a matching model is constructed and trained using multiple training samples. The matching model includes two branches: the input of the first branch is modal information and input information, and the input of the second branch is output information. The matching model is used to determine the degree of matching between the inputs of the two branches. The fine-tuned multimodal model is used as the policy model for reinforcement learning, and the reward information in reinforcement learning is determined based on the matching model. The multimodal model and the matching model are optimized through proximal policy optimization.

8. The method according to claim 7, characterized in that, Multimodal models built from pre-trained base models include: A multimodal encoder is used to acquire modal information and encode the modal information to obtain the element information corresponding to the modal information; The pre-trained base model includes: a text processing module, used to acquire element information corresponding to the modality information and element information corresponding to the input information, and output corresponding embedding features; and an output layer, used to output predicted output information based on the embedding features. Accordingly, based on the fine-tuned multimodal model, a matching model is constructed, and the matching model is trained using multiple training samples, including: Based on the parameters of the multimodal encoder and text processing module in the fine-tuned multimodal model, the first branch in the matching model is initialized, and based on the parameters of the text processing module in the fine-tuned multimodal model, the second branch in the matching model is initialized. During training, modal information and input information from any training sample are input into the first branch, and output information from any training sample is input into the other branch. Based on the embedded features output by the two branches, the parameters of the matching model are adjusted through comparative learning supervision.

9. The method according to claim 7, characterized in that, The fine-tuned multimodal model is used as the policy model for reinforcement learning, and the reward information in reinforcement learning is determined based on the matching model. The multimodal model and the matching model are optimized through proximal policy optimization, including: Modal information and input information are input into a multimodal model to obtain at least one predicted output. For any predicted output information, the modal information and input information are input into the first branch of the matching model, and the predicted output information is input into the second branch of the matching model to obtain the matching degree, which is used as the reward information corresponding to the predicted output information. Based on the reward information corresponding to each predicted output, the multimodal model and the matching model are optimized through near-end strategy optimization.

10. A method for generating a multimodal dialogue model, characterized in that, include: Obtain modal information, wherein the modal information includes content of any one or more modalities other than text; Determine the description text corresponding to the modality information, obtain the prompt words, input the description text and prompt words into a pre-trained language model to obtain input information, input the description text and the obtained input information into the language model to obtain output information, and obtain self-generated training samples. The self-generated training samples include: the modality information, the dialogue context, and the dialogue context. A multimodal dialogue model is constructed based on the language model, and the multimodal dialogue model is fine-tuned based on the self-generated training samples; Modal information and prompts are input into the fine-tuned multimodal dialogue model to obtain input information. The modal information and the obtained input information are then input into the fine-tuned multimodal dialogue model to obtain output information. The multimodal dialogue model is then further fine-tuned. The fine-tuned multimodal dialogue model is used to determine the following dialogue based on the modal information and the preceding dialogue context.

11. The method according to claim 10, characterized in that, The acquisition of modal information includes: acquiring modal information of a product, wherein the modal information includes at least one of the following: an image, video, or audio of the product; The step of determining the descriptive text corresponding to the modal information includes: determining the descriptive text corresponding to the product by extracting text from the product details page and / or by manual annotation; Among them, the fine-tuned multimodal dialogue model is used to determine the following dialogue based on the modal information of the product and the preceding dialogue context.

12. A multimodal processing method, characterized in that, include: Obtain information to be processed, which includes modal information and input information, wherein the modal information contains content of any one or more modalities other than text; Based on the multimodal model, the output information corresponding to the information to be processed is determined; The multimodal model is generated by the method described in any one of claims 1-9.

13. An electronic device, characterized in that, include: At least one processor; as well as A memory that is communicatively connected to the at least one processor; The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, cause the electronic device to perform the method according to any one of claims 1-12.