Multimodal model training method, multimodal data processing method, device, equipment and medium

By encoding, compressing, and fusing multimodal data, and combining the target loss function value, a low-cost, high-precision multimodal model is trained, solving the problem of high multimodal data processing costs in existing technologies.

CN122287784APending Publication Date: 2026-06-26SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD
Filing Date
2024-12-24
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies require the deployment of multiple single-modal models when processing multimodal data, resulting in high costs. How to obtain multimodal models for processing multimodal data at a lower cost is an urgent problem to be solved.

Method used

The training samples are encoded using initial encoders corresponding to multiple modalities. The encoded outputs are compressed using an initial compression layer to make them the same size. Then, the initial language model is used for fusion processing, and the multimodal model is determined by combining the target loss function value.

Benefits of technology

This approach enables the training of multimodal models capable of accurately processing both textual and non-textual modal data at a lower cost, reducing training costs and time while improving the model's processing accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122287784A_ABST
    Figure CN122287784A_ABST
Patent Text Reader

Abstract

This invention discloses a multimodal model training method, a multimodal data processing method, apparatus, device, and medium. The method includes: acquiring multiple training samples; encoding the training samples using initial encoders corresponding to multiple modalities to determine multiple initial encoded outputs; compressing the initial encoded outputs using initial compression layers corresponding to multiple modalities to determine initial compressed outputs of the same size for each training sample; fusing each initial compressed output using an initial language model to determine initial text data; calculating a loss based on the text compressed output, non-text compressed output, and initial text data to determine a target loss function value; and determining a multimodal model based on the initial encoder, initial compression layer, and initial language model when the target loss function value satisfies a preset convergence condition. This method can obtain a multimodal model for processing textual and non-textual modal data with relatively low training cost.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a multimodal model training method, a multimodal data processing method, apparatus, equipment, and medium. Background Technology

[0002] With the development of artificial intelligence technology, applications often require the processing of multimodal data. For example, in the application scenario of intelligent interaction systems, these systems need to understand the text-based data sent by the client and at least one non-text-based data, including but not limited to images, videos, and audio. Based on this combined data, they need to respond to the client and interact with the user. Therefore, applications processing multimodal data require models capable of processing multimodal data simultaneously. However, current technologies typically use APIs corresponding to multiple modalities to call multiple single-modal models for data processing. This method requires deploying multiple models capable of recognizing single-modal data in the application scenario, resulting in high costs. Therefore, how to obtain multimodal models for processing multimodal data at a lower cost is a pressing technical problem that needs to be solved. Summary of the Invention

[0003] This invention provides a multimodal model training method, a multimodal data processing method, an apparatus, a device, and a medium to address the problem of how to obtain a multimodal model for processing multimodal data at a lower cost.

[0004] A multimodal model training method, comprising: Multiple training samples are obtained; each training sample includes first training data of text modality and second training data of non-text modality; Each training sample is encoded using an initial encoder corresponding to multiple modalities, and multiple initial encoding outputs corresponding to each training sample are determined. The initial encoding outputs include text encoding outputs corresponding to the first training data and non-text encoding outputs corresponding to the second training data. Multiple modal-corresponding initial compression layers are used to compress the initial encoding output corresponding to each training sample, and an initial compressed output of the same size corresponding to each training sample is determined. The initial compressed output includes the text compressed output corresponding to the first training data and the non-text compressed output corresponding to the second training data. The initial compressed output corresponding to each training sample is fused using the initial language model to determine the initial text data corresponding to each training sample. Loss calculation is performed based on the text compression output, the non-text compression output, and the initial text data corresponding to each training sample to determine the target loss function value; When the target loss function value satisfies the preset convergence condition, a multimodal model is determined based on the initial encoder corresponding to multiple modalities, the initial compression layer corresponding to multiple modalities, and the initial language model.

[0005] A multimodal data processing method includes: Obtain at least one data group to be processed, each data group to be processed including a first data group to be processed corresponding to a text modality and a second data group to be processed corresponding to a non-text modality; The multimodal model trained using the above-described multimodal model training method processes the first data to be processed corresponding to the text modality and the second data to be processed corresponding to the non-text modality in each data group to be processed, thereby obtaining the target text data corresponding to each data group to be processed.

[0006] A multimodal model training device, comprising: A training sample acquisition module is used to acquire multiple training samples; each training sample includes first training data of text modality and second training data of non-text modality. The encoding processing module is used to encode each training sample using an initial encoder corresponding to multiple modalities, and to determine multiple initial encoding outputs corresponding to each training sample. The initial encoding outputs include text encoding outputs corresponding to the first training data and non-text encoding outputs corresponding to the second training data. The compression processing module is used to compress the initial encoding output corresponding to each training sample using an initial compression layer corresponding to multiple modalities, and to determine an initial compressed output of the same size corresponding to each training sample. The initial compressed output includes the text compressed output corresponding to the first training data and the non-text compressed output corresponding to the second training data. The initial text data determination module is used to perform fusion processing on the initial compressed output corresponding to each training sample using an initial language model to determine the initial text data corresponding to each training sample. The target loss function value determination module performs loss calculation based on the text compression output, the non-text compression output, and the initial text data corresponding to each training sample to determine the target loss function value. The multimodal model determination module is used to determine the multimodal model based on the initial encoder corresponding to multiple modalities, the initial compression layer corresponding to multiple modalities, and the initial language model when the target loss function value satisfies the preset convergence condition.

[0007] A multimodal data processing device, comprising: The pending data group acquisition module is used to acquire at least one pending data group, each pending data group including a first pending data corresponding to a text modality and a second pending data corresponding to a non-text modality. The target text data acquisition module is used to process the first data to be processed corresponding to the text modality and the second data to be processed corresponding to the non-text modality in each data group to be processed using a trained multimodal model, so as to obtain the target text data corresponding to each data group to be processed.

[0008] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the multimodal model training method described above, or, when the processor executes the computer program, it implements the multimodal data processing method described above.

[0009] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described multimodal model training method, or, when executed by a processor, implements the above-described multimodal data processing method.

[0010] The aforementioned multimodal model training method, multimodal data processing method, apparatus, equipment, and medium employ multiple initial compression layers corresponding to different modalities to compress the initial encoded output corresponding to each training sample. This ensures that each training sample has an initial compressed output of the same size. This not only compresses the text encoded output corresponding to the text modality and the non-text encoded output corresponding to the non-text modality within each training sample, but also aligns the compressed text output corresponding to the text modality and the non-compressed non-text output corresponding to the non-text modality. By using an initial language model to fuse the aligned compressed text output and non-compressed non-text output corresponding to each training sample, the initial text data corresponding to each training sample, including data from different modalities, can be obtained relatively accurately. Furthermore, since the compressed text output and non-compressed non-text output are size-compressed data, the length of the data input to the language model is smaller, effectively reducing training costs and saving training time. Based on the target loss function value determined by the text compression output, non-text compression output, and initial text data corresponding to each training sample, this method effectively reflects the training loss of the initial compression layer and the initial language model. When the target loss function value converges, a multimodal model including the initial encoder, the updated initial compression layer, and the updated initial language model can be obtained. This multimodal model can achieve relatively accurate processing results for data including text modalities and at least one non-text modality. This method, through encoding, compression alignment, and fusion processing of multiple training samples corresponding to text and non-text modalities, can achieve the training of a multimodal model capable of accurately processing data corresponding to both text and non-text modalities with relatively low training cost, demonstrating high application value. Attached Figure Description

[0011] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0012] Figure 1 This is a schematic diagram of an application environment for a multimodal model training method according to an embodiment of the present invention; Figure 2 This is a flowchart of a multimodal model training method according to an embodiment of the present invention; Figure 3 This is another flowchart of a multimodal model training method in one embodiment of the present invention; Figure 4 This is another flowchart of a multimodal model training method in one embodiment of the present invention; Figure 5This is another flowchart of a multimodal model training method in one embodiment of the present invention; Figure 6 This is another flowchart of a multimodal model training method in one embodiment of the present invention; Figure 7 This is a flowchart of a multimodal data processing method according to an embodiment of the present invention; Figure 8 This is a schematic diagram of a multimodal model training device according to an embodiment of the present invention; Figure 9 This is a schematic diagram of a computer device according to an embodiment of the present invention. Detailed Implementation

[0013] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0014] An embodiment of the present invention provides a multimodal model training method, which can be applied to, for example... Figure 1 The application environment is shown. Specifically, this multimodal model training method is applied in a multimodal model training system, which includes, for example, […]. Figure 1 The diagram illustrates a client and server that communicate over a network to acquire multimodal models for processing multimodal data at a lower cost. The client, also known as the user terminal, is the program that provides local services to the client, corresponding to the server. The client can be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented using a standalone server or a server cluster consisting of multiple servers.

[0015] In one embodiment, such as Figure 2 As shown, a multimodal model training method is provided, which is then applied to... Figure 1 Taking the server in the example, the following steps are included: S201: Obtain multiple training samples; each training sample includes first training data for the text modality and second training data for the non-text modality; S202: Use an initial encoder corresponding to multiple modalities to encode each training sample, and determine multiple initial encoding outputs corresponding to each training sample. The initial encoding outputs include the text encoding output corresponding to the first training data and the non-text encoding output corresponding to the second training data. S203: Use multiple modal-corresponding initial compression layers to compress the initial encoding output corresponding to each training sample, and determine the same size initial compression output corresponding to each training sample. The initial compression output includes the text compression output corresponding to the first training data and the non-text compression output corresponding to the second training data. S204: Use the initial language model to fuse the initial compressed output corresponding to each training sample to determine the initial text data corresponding to each training sample. S205: Calculate the loss based on the text compression output, non-text compression output and initial text data corresponding to each training sample, and determine the target loss function value; S206: When the target loss function value meets the preset convergence condition, determine the multimodal model based on the initial encoder, the initial compression layer and the initial language model corresponding to multiple modalities.

[0016] Training samples refer to the data used for model training. Training samples include data in multiple modalities. Text modality refers to modalities displayed in text form. Non-text modality refers to modalities displayed in forms including, but not limited to, images, audio, and video. The first training data refers to the text modality data in the training samples. The second training data refers to the non-text modality data in the training samples.

[0017] As an example, in step S201, the server obtains multiple training samples containing various modalities corresponding to the application scenario. In this example, each training sample includes a first training data for the text modality and at least one second training data for the non-text modality. For example, the server obtains a text data and an image data from the application scenario as a set of training samples. Another example is that the server obtains a text data, an image data, an audio data segment, and a video data segment from the application scenario as a set of training samples. As an example in an application scenario, in an intelligent interaction system, the server obtains a question data input by the client in text modality and an image data to be identified as a set of training samples. For example, for a set of training samples, the first training data for the text modality is "Please identify the author of this painting," the non-text modality is the image modality, and the second training data for the image modality is a painting to be identified. This set of training samples corresponds to multiple modalities. Understandably, since data corresponding to the text modality has less redundancy compared to data corresponding to the non-text modality, and its application scenarios are more extensive, the multimodal model to be trained in this embodiment of the invention is used to process one or more sets of data consisting of text modality and at least one non-text modality. Therefore, during model training, each training sample needs to include a first training data corresponding to the text modality and a second training data corresponding to at least one non-text modality. In this example, the multimodal model can be applied to application scenarios including but not limited to intelligent interaction systems and cross-modal retrieval systems. For example, in an intelligent interaction system, the multimodal model is used to understand and recognize the data corresponding to the text modality and the data corresponding to the non-text modality sent by the client, and output a text response. In a cross-modal retrieval system, the multimodal model is used to understand and recognize the data corresponding to the text modality and the data corresponding to the non-text modality sent by the client, and to search the system database to output text data that matches the data corresponding to the text modality and the data corresponding to the non-text modality.

[0018] Here, the initial encoder refers to the encoder used for encoding. Understandably, since both text and non-text modal data need to be encoded, multiple initial encoders capable of encoding different modalities are required. For example, initial encoders for encoding data corresponding to the text modality, initial encoders for encoding data corresponding to the image modality, initial encoders for encoding data corresponding to the audio modality, and initial encoders for encoding data corresponding to the video modality. The initial encoded output refers to the encoded output corresponding to the training samples. The text encoded output is the initial encoded output corresponding to the first training data for the text modality. The non-text encoded output refers to the initial encoded output corresponding to the second training data for the non-text modality, such as the initial encoded output corresponding to images, audio, and / or video.

[0019] As an example, in step S202, after determining multiple training samples including various modalities, the server uses initial encoders corresponding to different modalities to encode the data of each modality in each training sample, obtaining the initial encoding output corresponding to the data of each modality in each training sample. The initial encoding output includes each text encoding output corresponding to each first training data point after encoding the first training data of the text modality, and each non-text encoding output corresponding to each second training data point after encoding the second training data of the non-text modality. The first training data of the text modality includes text. The second training data of the non-text modality includes, but is not limited to, images, audio, and video. The non-text encoding output includes, but is not limited to, the initial encoding output corresponding to images, the initial encoding output corresponding to audio, and the initial encoding output corresponding to video.

[0020] In this example, the server uses the initial encoder corresponding to the text modality to encode the first training data of the text modality in each training sample, obtaining the initial encoded output corresponding to the text modality in each training sample. For example, the server uses the BERT (Bidirectional Encoder Representations from Transformers) model to encode the first training data of the text modality in the training samples. Perform encoding processing and output. Corresponding initial encoded output = The text encoding output, which is the first training data corresponding to the text modality in the training samples, is used as the training sample. Let be a matrix of rows t and columns d. When the server determines that the training samples include second training data for image modalities, video modalities, and audio modalities, it uses the initial encoder corresponding to the image modalities to encode the second training data for the image modalities, obtaining the initial encoded output for the second training data for the image modalities. Similarly, it uses the initial encoder corresponding to the video modalities to encode the second training data for the video modalities, obtaining the initial encoded output for the second training data for the video modalities. For example, the server uses the CLIP (Contrastive Language-Image Pre-Training) model to encode the second training data for the image modalities respectively. Second training data for video modalities Perform encoding processing and output. Corresponding initial encoded output = and Corresponding initial encoding output = The initial encoding output corresponding to the image modality. For a matrix of row i and column d, the initial encoded output corresponding to the video modality. This is a v x d matrix. The server uses the initial encoder corresponding to the audio modality to encode the second training data of the audio modality, obtaining the initial encoded output corresponding to the second training data of the audio modality. For example, the server uses the MERT (Music Understanding Model with Large-Scale Self-supervised Training) model to encode the second training data of the audio modality. Perform encoding processing and output the corresponding initial encoded output. = The initial encoding output corresponding to the image modality. It is a matrix with p rows and d columns.

[0021] As can be seen from the initial encoding outputs above, the text encoding outputs corresponding to the first training data of the text modality and the non-text encoding outputs corresponding to the second training data of the non-text modality correspond to matrices of different sizes. This results in a misalignment between the text encoding and non-text encoding outputs. If the initial language model is directly used to fuse the text encoding and non-text encoding outputs, the output results of the initial language model will be inaccurate. Therefore, it is necessary to perform alignment processing on the text encoding and non-text encoding outputs to improve the accuracy of the model.

[0022] Here, the initial compression layer refers to the compression layer used for compression processing. The initial compressed output refers to the output after compressing the initial encoded output. Compression processing refers to the method of compressing the initial encoded output according to a preset size. The preset size refers to the preset size of the initial compressed output.

[0023] As an example, in step S203, the server inputs the initial encoded output corresponding to each training sample into the initial compression layer of the corresponding modality. Multiple initial compression layers corresponding to different modalities are used, according to a preset size, to sequentially compress the initial encoded output of each training sample that corresponds to the modality of the initial compression layer, outputting an initial compressed output with a preset size corresponding to each initial encoded output of each training sample. Understandably, since the initial compressed output corresponding to each training sample has a preset size, the size of the initial compressed output corresponding to all training samples is the same.

[0024] In this example, the server uses the initial compression layer corresponding to the text modality, and compresses it according to a preset size for the first... Text encoding output from group training samples The text is compressed to obtain a compressed output with a preset size. ,in, Indicates the first Group training samples, Indicates the first The compressed text output corresponding to the first training data of the text modality in the group of training samples. The server determines the compressed text output of the first training data of the text modality in the group of training samples. When the second training data for non-textual modalities in the training samples includes images, audio, and video, the initial compression layer corresponding to the image is used, and the initial encoding output corresponding to the image is processed according to a preset size. = The text is compressed to obtain a non-text compressed output with a preset size. ,in, Indicates the first Group training samples, Indicates the first The non-text compressed output corresponding to the images in the training samples. The initial compression layer corresponding to the video is used, and the initial encoded output corresponding to the video is processed according to a preset size. = The text is compressed to obtain a non-text compressed output with a preset size. ,in, Indicates the first Group training samples, Indicates the first The non-text compressed output corresponding to the videos in the training samples. The initial encoded output corresponding to the audio is obtained by using the initial compression layer corresponding to the audio. = The text is compressed to obtain a non-text compressed output with a preset size. ,in, Indicates the first Group training samples, Indicates the first The non-text compressed output corresponding to the audio in the training samples. Understandably, this is due to the text compressed output corresponding to the text modality. Non-text compressed output of images Non-text compressed output of video Non-text compressed output corresponding to audio All have preset sizes. Therefore, this compression method can not only compress the text-encoded output corresponding to the text modality and the non-text-encoded output corresponding to the non-text modality to obtain the text-compressed output corresponding to the text modality and the non-text-compressed output corresponding to the non-text modality, but also align the text-compressed output corresponding to the text modality and the non-text-compressed output corresponding to the non-text modality to improve training efficiency, save training costs, and obtain a more accurate multimodal model.

[0025] In this example, multiple initial compression layers corresponding to different modalities are used to compress the initial encoded output corresponding to each training sample. The same initial compressed output of the same size is determined for each training sample. This achieves compression and alignment of the initial encoded output corresponding to the text modality and the initial encoded output corresponding to the non-text modality. This not only reduces the length of the input to the subsequent initial language model, thereby reducing training time and cost, but also improves the accuracy of the model.

[0026] The initial language model refers to the language model used to process multimodal data. The initial language model includes, but is not limited to, a large language model, used to recognize, understand, and output text from data of multiple modalities, thus achieving the processing of multimodal data. Initial text data refers to the text data output by the initial language model after processing multiple initial compressed outputs corresponding to each training sample.

[0027] As an example, in step S204, the server inputs the same-sized compressed text output and non-compressed text output corresponding to each training sample into the initial language model. The initial language model then fuses the same-sized compressed text output and at least one non-compressed text output corresponding to each training sample, outputting the initial text data corresponding to each training sample. Understandably, the initial language model processes the initial compressed output differently for different application scenarios. In an intelligent interaction system, the initial language model identifies and responds to multiple initial compressed outputs corresponding to each training sample, outputting response text data corresponding to the multimodal training samples, and determining the response text data as the initial text data corresponding to each training sample. In a cross-modal retrieval system, the initial language model identifies and retrieves multiple initial compressed outputs corresponding to each training sample, determines the text data matching each training sample, and determines the text data matching each training sample as the initial text data corresponding to each training sample. In this example, the initial language model is used to fuse multiple text-compressed and non-text-compressed outputs of the same size corresponding to each training sample. Since the text-compressed and non-text-compressed outputs have the same size, that is, the text-compressed and non-text-compressed outputs are aligned data, the initial text data corresponding to each training sample can be obtained more accurately. Furthermore, since the text-compressed and non-text-compressed outputs are data after size compression, the length of the data input to the language model is smaller, which can effectively reduce training costs and save training time.

[0028] The target loss function value refers to the loss value generated during model training.

[0029] As an example, in step S205, the server analyzes and processes the text compression output and at least one non-text compression output corresponding to each training sample to determine the loss function value between the text compression output and each non-text compression output corresponding to each training sample. The server calculates the loss function value for the initial text data corresponding to each training sample to determine the loss function value for the initial text data corresponding to each training sample. The server then processes the loss function value between the text compression output and each non-text compression output corresponding to each training sample, as well as the loss function value for the initial text data corresponding to each training sample, to obtain the target loss function value. For example, if a set of training samples includes text, images, videos, and audio, then the initial compression output corresponding to the images is obtained respectively. Initial compressed output of the video Initial compressed output corresponding to audio Text compression output corresponding to the text The loss function value between the compressed text output and the uncompressed text output corresponding to the training samples is obtained. In this example, the loss function value between the compressed text output and the uncompressed text output for each training sample reflects the training loss of the initial compression layer, and the loss function value of the initial text data for each training sample reflects the training loss of the initial language model. This method can determine the target loss function value of the multimodal model during the training process, so as to update the multimodal model according to the target loss function value and realize the training of the multimodal model.

[0030] The preset convergence condition is used to indicate when to stop model training, including but not limited to the target loss function value being less than a preset threshold, or the target loss function value stabilizing. Understandably, if the target loss function value is less than the preset threshold, it is determined that the target loss function value meets the preset convergence condition; if the target loss function value is not less than the preset threshold, it is determined that the target loss function value does not meet the preset convergence condition. Alternatively, if the target loss function value stabilizes, it is determined that the target loss function value meets the preset convergence condition; if the target loss function value does not stabilize, it is determined that the target loss function value does not meet the preset convergence condition.

[0031] As an example, in step S206, the server determines whether the target loss function value meets the preset convergence condition. If the target loss function value does not meet the preset convergence condition, the initial compression layer and initial language model corresponding to each modality are updated. If the target loss function value meets the preset convergence condition, the initial encoder corresponding to each modality, the updated initial compression layer corresponding to each modality, and the updated initial language model are determined as a multimodal model. Understandably, the initial encoder of each modality is connected to the initial compression layer of the same modality, and the initial compression layer of each modality is connected to the initial language model, forming a multimodal model. When the target loss function value meets the preset convergence condition, the trained multimodal model is obtained. This multimodal model can be deployed in various application scenarios to process text-modal data and non-text-modal data, outputting the corresponding processed text, thus realizing the function of processing multimodal data in different application scenarios.

[0032] In this embodiment, multiple modality-corresponding initial compression layers are used to compress the initial encoded output corresponding to each training sample. This ensures that each training sample has an initial compressed output of the same size. This not only compresses the text encoded output corresponding to the text modality and the non-text encoded output corresponding to the non-text modality within each training sample, but also aligns the text compressed output corresponding to the text modality and the non-text compressed output corresponding to the non-text modality. Using the initial language model, the aligned text compressed output and non-text compressed output corresponding to each training sample are fused. This allows for a relatively accurate acquisition of the initial text data corresponding to each training sample, including data from different modalities. Furthermore, since the text compressed output and non-text compressed output are size-compressed data, the length of the data input to the language model is smaller, effectively reducing training costs and saving training time. Based on the target loss function value determined by the text compression output, non-text compression output, and initial text data corresponding to each training sample, this method effectively reflects the training loss of the initial compression layer and the initial language model. When the target loss function value converges, a multimodal model including the initial encoder, the updated initial compression layer, and the updated initial language model can be obtained. This multimodal model can achieve relatively accurate processing results for data including text modalities and at least one non-text modality. This method, through encoding, compression alignment, and fusion processing of multiple training samples corresponding to text and non-text modalities, can achieve the training of a multimodal model capable of accurately processing data corresponding to both text and non-text modalities with relatively low training cost, demonstrating high application value.

[0033] In one embodiment, the initial compression layer includes multiple sequentially connected initial decoders.

[0034] The initial decoder refers to a decoder consisting of a multi-head cross-attention layer and a fully connected forward layer connected sequentially. The initial compression layer consists of c sequentially connected initial decoders, where c ≥ 2.

[0035] In one embodiment, such as Figure 3 As shown, step S203, which involves using multiple modal-corresponding initial compression layers to compress the initial encoded output corresponding to each training sample, and determining the initial compressed output of the same size for each training sample, includes: S301: Using the current initial decoder corresponding to multiple modalities, the preset guide word and multiple target inputs corresponding to each training sample are compressed to determine multiple initial decoding outputs of the same size corresponding to each training sample. The target input is the initial encoding output or the initial decoding output corresponding to the previous initial decoder. S302: When the current initial decoder corresponding to multiple modalities is not the last initial decoder of the initial compression layer, update the multiple initial decoder outputs of the same size corresponding to each training sample to the multiple target inputs corresponding to each training sample, and repeatedly execute the compression processing of the preset guide words and the multiple target inputs corresponding to each training sample using the current initial decoder corresponding to multiple modalities. S303: When the current initial decoder corresponding to multiple modalities is the last initial decoder of the initial compression layer, the multiple initial decoding outputs of the same size corresponding to each training sample are determined as the initial compression outputs of the same size corresponding to each training sample.

[0036] Here, "current initial decoder" refers to the initial decoder currently performing compression processing. "Target input" refers to the input to the current initial decoder. "Initial decoder output" refers to the output of the initial decoder.

[0037] As an example, in step S301, the server determines whether the target input is the initial encoded output or the initial decoded output corresponding to the previous initial decoder. When the server determines that the target input is the initial encoded output, it determines the first layer initial decoder of the initial compression layer as the current initial decoder, and inputs the initial encoded outputs of multiple different modalities corresponding to each training sample into the current initial decoder of the corresponding modality. Using the current initial decoders corresponding to different modalities, the initial encoded outputs of the same modality are compressed according to a preset size to obtain multiple initial decoded outputs of the same size corresponding to each training sample from the current initial decoders of multiple different modalities. For example, using the current initial decoder corresponding to the text modality, the initial encoded output of the text modality corresponding to the j-th group of training samples is compressed to obtain the initial decoded output of the text modality in the j-th group of training samples. Using the current initial decoder corresponding to the image modality, the initial encoded output of the image modality corresponding to the j-th training sample is compressed to obtain the initial decoded output corresponding to the image in the j-th training sample. Using the current initial decoder corresponding to the video modality, the initial encoded output of the video modality corresponding to the j-th training sample is compressed to obtain the initial decoded output corresponding to the video in the j-th training sample. Using the current initial decoder corresponding to the audio modality, the initial encoded output of the audio modality corresponding to the j-th training sample is compressed to obtain the initial decoded output corresponding to the audio in the j-th training sample. .in, , , and They have the same size.

[0038] When the server determines that the target input is the initial decoding output corresponding to the previous initial decoder, it identifies the next initial decoder adjacent to the previous initial decoder in the initial compression layer as the current initial decoder. It then inputs the multiple initial decoding outputs corresponding to each training sample from the previous initial decoder into the current initial decoder for the corresponding modality. Using the current initial decoder, it compresses the multiple initial decoding outputs corresponding to each training sample according to a preset size, obtaining the same initial decoding output size for each training sample from the current initial decoder. For example, if the current initial decoder is the i-th (where i ≥ 2) initial decoder and the previous initial decoder is the (i-1)-th initial decoder, the initial decoding output corresponding to the text modality of the j-th training sample output from the (i-1)-th initial decoder is... The initial decoded output corresponding to the image Initial decoding output corresponding to the video Initial decoding output corresponding to the audio The input is fed into the current initial decoder to obtain the initial decoding output corresponding to the text modality in the j-th training sample. The initial decoded output corresponding to the image Initial decoding output corresponding to the video Initial decoding output corresponding to the audio ,in, , , and They have the same size.

[0039] In this example, multiple modal current initial decoders are used to compress the target input, resulting in initial decoded outputs of the same size for each current initial decoder. This allows for the subsequent acquisition of initial compressed outputs of the same size for different modalities based on the initial decoded outputs of the same size for different modalities.

[0040] As an example, in step S302, when the server determines that the current initial decoder corresponding to multiple modalities is not the last initial decoder of the initial compression layer, it updates the multiple initial decoding outputs of the same size corresponding to each training sample to the multiple target inputs corresponding to each training sample, updates the current initial decoder to the previous initial decoder, updates the next decoder adjacent to the current initial decoder to the current initial decoder, repeats step S301, and continues to perform compression processing.

[0041] As an example, in step S303, when the server determines that the current initial decoder corresponding to different modalities is the last initial decoder of the initial compression layer, it determines the multiple initial decoding outputs of the same size corresponding to each training sample as the multiple initial compression outputs of the same size corresponding to each training sample. In this example, if there are c initial decoders for each modal corresponding to the initial compression layer, where c≥2, when the server determines that the current initial decoder corresponding to different modalities is the c-th initial decoder of the initial compression layer, it determines the initial decoding output of the text modal corresponding to the j-th training sample output by the c-th initial decoder. The text compression output corresponding to the text modality of the j-th training sample group is determined. The initial decoded output corresponding to the image The initial compressed output of the image corresponding to the j-th training sample is determined. The initial decoding output corresponding to the video The initial compressed output of the video corresponding to the j-th training sample is determined. The initial decoding output corresponding to the audio The initial compressed output corresponding to the audio of the j-th training sample is determined. ,in, , , and They have the same size.

[0042] In this embodiment, multiple initial decoders in the initial compression layer sequentially compress the initial encoded output and the initial decoded output of the previous initial decoder. This allows for the generation of initial compressed outputs corresponding to different modalities of the same size in the training samples. This enables the compression and alignment of the initial encoded outputs corresponding to different modalities in the training samples, thereby improving model training efficiency and reducing model training costs by using the initial compressed outputs corresponding to different modalities.

[0043] In one embodiment, the initial decoder includes a multi-head cross-attention layer.

[0044] As an example, each initial decoder includes a multi-head cross-attention layer and a fully connected feedforward network layer connected in sequence. The initial compression layer includes c (c≥2) initial decoder layers. The output of the multi-head cross-attention layer in the previous initial decoder is connected to the input of the fully connected feedforward network layer in the previous initial decoder and the input of the next initial decoder adjacent to the previous initial decoder. The first initial decoder layer is used to compress the preset guideword and the initial encoded output of the initial encoder. The remaining c-1 initial decoder layers are used to compress the initial decoded output of the previous initial decoder, the preset guideword, and the output of the multi-head cross-attention layer in the previous initial decoder.

[0045] In one embodiment, such as Figure 4 As shown, step S301, which involves using the current initial decoder corresponding to multiple modalities to compress the preset guide words and multiple target inputs corresponding to each training sample, and determining multiple initial decoding outputs of the same size corresponding to each training sample, includes: S401: When the current initial decoder corresponding to multiple modalities is the first layer initial decoder of the initial compression layer, the current initial decoder corresponding to multiple modalities is used to compress the preset guide word and the multiple initial encoding outputs corresponding to each training sample, and to determine the multiple initial decoding outputs of the same size corresponding to each training sample. S402: When the current initial decoder corresponding to multiple modalities is not the first initial decoder of the initial compression layer, the current initial decoder corresponding to multiple modalities is used to compress the preset guide word, the multiple initial decoding outputs of the previous initial decoder corresponding to each training sample, and the multiple attention outputs of the multi-head cross-attention layer in the previous initial decoder corresponding to each training sample, so as to determine the multiple initial decoding outputs of the same size corresponding to each training sample.

[0046] As an example, in step S401, when the server determines that the current initial decoder corresponding to multiple modalities is the first layer initial decoder of the initial compression layer, it uses the current initial decoder of multiple modalities to compress the initial encoding output and preset guidance words corresponding to each training sample, thereby obtaining multiple initial decoding outputs of the same size corresponding to each training sample. In this example, the server uses a preset guidance word f of size s rows and d columns and the projection matrix corresponding to the m-th attention head in the multi-head cross-attention layer. Multiplying them together yields the retrieval matrix corresponding to the m-th attention head. The initial encoding output corresponding to one modality in the j-th training sample group is used, along with the projection matrix corresponding to the m-th attention head in the multi-head cross-attention layer. Multiplying them together yields the retrieval matrix corresponding to the m-th attention head. The projection matrix corresponding to the m-th attention head in the multi-head cross-attention layer is used. Multiplying the initial encoded output of one modality in the j-th training sample group yields the retrieval matrix corresponding to the m-th attention head. A multi-head cross-attention layer is used to focus on the m-th attention head. , and And f is compressed to obtain the output of the first layer of the initial decoder when the current initial decoder is the initial compression layer. The output is the output of the m-th attention head of the multi-head cross-attention layer in the current initial decoder after processing the initial encoded output corresponding to a modality in the j-th group of training samples. : =softmax( ) + Where T denotes matrix transpose. express The transpose of the matrix, Let f be the transpose matrix. Following the above processing method, the server obtains the multi-head cross-attention layer for each attention head. , and And the output after compression of f is: [ , , , , ], where M is the total number of attention heads. The server [ , , , , The mean-averaging algorithm is applied to obtain the multi-head cross-attention layer of the current initial decoder, and the attention output is obtained after compressing the training sample size corresponding to one modality in the j-th group of training samples. =average[ , , , , ],in, The size is d rows and s columns. The server will... The input is fed into the fully connected feedforward network layer of the current initial decoder to obtain the output. =MLP( )= ,in, The size is d rows and s columns, and the server... Perform transpose to obtain = ,in, The size is s rows and d columns. When the current initial decoder is the first initial decoder of the initial compression layer, this is the initial decoding output corresponding to one modality in the j-th group of training samples. In this example, the server performs the above steps to obtain the initial decoder as the first initial decoder of the initial compression layer, and when the j-th group of training samples includes the first training data corresponding to the text modality, the second training data corresponding to the image, the second training data corresponding to the audio, and the second training data corresponding to the video, the initial decoding output of the text modality in the j-th group of training samples is of size s rows and d columns. The initial decoding output of the image corresponding to the j-th training sample group has dimensions of s rows and d columns. The initial decoding output of the video corresponding to the j-th training sample with dimensions of s rows and d columns. The initial decoding output of size s rows and d columns corresponding to the audio in the j-th training sample. In this example, the initial decoder of the initial compression layer processes the preset guide words of size s rows and d columns, as well as the initial encoded outputs corresponding to different sizes and modalities in each training sample. This not only compresses the initial encoded output, reducing its length, but also yields initial decoded outputs of the same size for multiple modalities, achieving alignment processing of training data for different modalities in the training samples. The attention output refers to the output after processing the target input using a multi-head cross-attention layer.

[0047] As an example, in step S402, when the server determines that the current initial decoder corresponding to multiple modalities is not the first initial decoder of the initial compression layer, it uses the current initial decoder corresponding to multiple modalities to compress the preset guide word, the multiple initial decoding outputs of the previous initial decoder corresponding to each training sample, and the multiple attention outputs of the multi-head cross-attention layer in the previous initial decoder corresponding to each training sample, to obtain multiple initial decoding outputs of the same size.

[0048] In this example, the server determines the current initial decoder to be the i-th initial decoder in the initial compression layer, where i ≥ 2. The server uses a preset guide word f of size s rows and d columns to project the m-th attention head in the multi-head cross-attention layer of the i-th initial decoder. Multiplying them together yields the retrieval matrix corresponding to the m-th attention head. The initial decoding output of the (i-1)th initial decoder of one mode in the j-th training sample group is used. In the multi-head cross-attention layer of the i-th initial decoder, the projection matrix corresponding to the m-th attention head in the multi-head cross-attention layer. Multiplying them together yields the retrieval matrix corresponding to the m-th attention head. In the multi-head cross-attention layer using the i-th initial decoder, the projection matrix corresponding to the m-th attention head in the multi-head cross-attention layer. The initial decoding output of the (i-1)th initial decoder of a mode in the j-th training sample group Multiplying them together yields the retrieval matrix corresponding to the m-th attention head. The multi-head cross-attention layer of the i-th initial decoder is used to apply attention to the m-th attention head. , and and the attention output of the multi-head cross-attention layer of the (i-1)th initial decoder After compression, the output of the m-th attention head in the multi-head cross-attention layer of the i-th initial decoder is obtained after processing the initial encoded output corresponding to a modality in the j-th group of training samples. for: =softmax( ) + Where T denotes matrix transpose. express The transpose of the matrix, express The transpose of the matrix. Following the above processing method, the server obtains the multi-head cross-attention layer of the i-th initial decoder for each attention head. , and and the output of the multi-head cross-attention layer of the (i-1)th initial decoder The output after compression is: [ , , , , ], where M is the total number of attention heads. The server [ , , , , The mean-averaging algorithm is applied to obtain the attention output of the multi-head cross-attention layer of the current initial decoder corresponding to one modality in the j-th training sample group. =average[ , , , , ],in, The size is d rows and s columns. The server will... The input is fed into the fully connected feedforward network layer of the current initial decoder (i.e., the i-th initial decoder), and the output is obtained. =MLP( )= ,in, The size is d rows and s columns, and the server... Perform transpose to obtain = ,in, The size is s rows and d columns. This refers to the initial decoding output corresponding to a modality in the j-th training sample group when the current initial decoder is the i-th initial decoder of the initial compression layer. In this example, the server performs the above steps to obtain the initial decoding output of the text modality with a size of s rows and d columns in the j-th training sample group when the current initial decoder is the i-th initial decoder of the initial compression layer. The initial decoding output of the image corresponding to the j-th training sample group has dimensions of s rows and d columns. The initial decoding output of the video corresponding to the j-th training sample with dimensions of s rows and d columns. The initial decoding output of size s rows and d columns corresponding to the audio in the j-th training sample. .

[0049] In this example, when the current initial decoder corresponding to multiple modalities is not the first initial decoder of the initial compression layer, the current initial decoder corresponding to multiple modalities is used to compress the preset guide word, the multiple initial decoding outputs of the previous initial decoder corresponding to each training sample, and the multiple attention outputs of the multi-head cross-attention layer in the previous initial decoder corresponding to each training sample, to obtain the same initial decoding outputs corresponding to multiple training samples, thereby achieving the compression and alignment of training data of different modalities in the training samples. Here, the training data refers to the first training data and the second training data in the training samples.

[0050] In this embodiment, depending on whether the current initial decoder is the first initial decoder of the initial compression layer, different methods are used to compress different target inputs. This achieves the purpose of using multiple initial decoders to compress the size of the initial encoded output, so that the training data of different modalities in the training samples correspond to the same size initial decoding output of the same current initial decoder. This facilitates the subsequent determination of the initial compressed output corresponding to the training data of different modalities in each training sample based on the initial decoding output of the same size, thereby achieving the alignment of training samples of different modalities.

[0051] In one embodiment, such as Figure 5 As shown, step S205, which involves calculating the loss based on the text compression output, non-text compression output, and initial text data corresponding to each training sample, and determining the target loss function value, includes: S501: Obtain the text global representation corresponding to the first training data of the text modality and the non-text global representation corresponding to the second training data of each non-text modality in each training sample. S502: Based on the text global representation and non-text global representation corresponding to each training sample, determine the alignment loss function value corresponding to all training samples; S503: Based on the initial text data corresponding to each training sample and the preset label text data corresponding to each training sample, determine the label loss function value corresponding to all training samples; S504: Determine the target loss function value based on the alignment loss function value and the label loss function value corresponding to each training sample.

[0052] Here, text global representation refers to the representation corresponding to the compressed text output learned during model training. Non-text global representation refers to the representation corresponding to the non-text compressed output learned during model training.

[0053] As an example, in step S501, the server takes the first word in the text compression output of each training sample as the global word in the text compression output of each training sample, and takes the first word in the non-text compression output of each training sample as the global word in the non-text compression output of each training sample. The coordinates corresponding to the global word in the text compression output of each training sample are determined as the text global representation corresponding to the first training data of the text modality in each training sample. The coordinates corresponding to the global word in the non-text compression output of each training sample are determined as the non-text global representation corresponding to the second training data of the non-text modality in each training sample. For example, for the text compression output corresponding to the j-th training sample... Its size is s rows and d columns, and the 1×d vector corresponding to each row is a word. The word corresponding to the first row is the first word. The first word is determined as the global word in the text compression output, and the coordinates of the global word are... The text compression output corresponding to the j-th training sample is determined. The text global representation. If the non-text modalities of the j-th training sample include video, the initial compressed output of the video corresponding to the j-th training sample with size s rows and d columns is obtained. The first lexical unit is determined as the global lexical unit, and the coordinates corresponding to the global lexical unit are... The initial compressed output corresponding to the j-th training sample is determined. Non-textual global representation.

[0054] The alignment loss function value refers to the difference loss between at least one non-text compressed output and the text compressed output corresponding to all training samples. Understandably, the initial language model needs to be able to process the text compressed output corresponding to each training sample and the non-text compressed output corresponding to at least one non-text modality. To optimize the processing performance of the initial language model, it is necessary to align the text compressed output with the non-text compressed output corresponding to each non-text modality. This means that the text global representation and the non-text global representation need to be trained to be infinitely close, so that the initial language model can adapt to data from different modalities. Therefore, the alignment loss function value is determined based on the text global representation corresponding to the text compressed output and the non-text global representation corresponding to the non-text compressed output. This allows for the determination of whether the text compressed output and the non-text compressed output corresponding to each non-text modality are aligned, thereby improving the model training performance.

[0055] As an example, in step S502, the server processes the text global representation and at least one non-text global representation corresponding to each training sample, determines the difference between each non-text global representation and the text global representation corresponding to each training sample, processes the difference between each non-text global representation and the text global representation corresponding to each training sample, and obtains the alignment loss function value corresponding to all groups of training samples. For example, for each image-text pair consisting of N sets of image-text pairs (one image and one text), if N (where N > 1) sets of image-text pairs are input into a multimodal model, after passing through an initial encoder and an initial compression layer, N non-text compressed outputs corresponding to the image and N text compressed outputs corresponding to the text are obtained. The server obtains the non-text global representation corresponding to the non-text compressed output and the text global representation corresponding to the text compressed output in each set of image-text pairs, determines the difference between the non-text global representation and the text global representation in each set of image-text pairs, obtains N differences, processes the N differences, and obtains the alignment loss function value. .

[0056] Here, the pre-labeled text data refers to the text data expected to be output by the initial language model. The label loss function value refers to the difference between the initial text data and the pre-labeled text data corresponding to all training samples. Understandably, the smaller the label loss function value between the initial text data and the pre-labeled text data, the closer the initial text data is to the pre-labeled text data; that is, the smaller the label loss function value, the higher the accuracy of the initial text data.

[0057] As an example, in step S503, the server obtains the label difference between the initial text data and the corresponding preset label text data in each group of training samples, thus obtaining the label difference for each training sample. The server then processes this label difference to obtain the label loss function values ​​for all groups of training samples. In this example, for N groups of training samples, the multimodal model outputs N initial text data and N preset label text data, where N > 1. The server obtains the first probability distribution of each initial text data in the preset vocabulary. The second probability distribution corresponding to each preset label text data Where j is the j-th training sample, and the server calculates the first probability distribution corresponding to the j-th training sample. Second probability distribution The cross-entropy loss between the training samples is used as the label difference corresponding to the j-th training sample. The label differences of the j-th group are summed to obtain the cross-entropy loss function value corresponding to the N training samples. This cross-entropy loss function value is then determined as the label loss function value. : = As an example, in step S504, the server uses a preset first correction coefficient. Alignment loss function values ​​for all training samples Make corrections to obtain The preset second correction coefficient is used. Label loss function values ​​for all training samples Make corrections to obtain ,Will and The sum is determined as the target loss function value. ,Right now = + .

[0058] In this embodiment, the alignment loss function value is used to reflect the compression alignment effect of the initial compression layer, and the label loss function value is used to reflect the fusion effect of the initial language model on multiple initial compression outputs corresponding to the text modality and at least one non-text modality. Based on the alignment loss function value and the label loss function value, the target loss function value is determined, which can accurately reflect the training effect of the multimodal model. This allows the initial compression layer and initial language model corresponding to multiple modalities to be updated according to the target loss function value, thereby training a multimodal model that can have a better data processing effect.

[0059] In one embodiment, the non-text global representation includes image global representation, audio global representation, and / or video global representation.

[0060] Understandably, depending on the application scenario and user needs, the set of data input to the multimodal model includes a set of text modal data and at least one non-text modal data. Therefore, during the training process of the multimodal model, each training sample includes a first set of text modal training data and at least one non-text modal second training data. The non-text compressed output corresponding to each training sample includes at least one non-text global representation corresponding to a modality. That is, the non-text global representation corresponding to each training sample includes image global representation, audio global representation and / or video global representation.

[0061] In one embodiment, such as Figure 6 As shown, step S502, which involves determining the alignment loss function value for all training samples based on the text global representation and non-text global representation corresponding to each training sample, includes: S601: Obtain the first alignment loss value between the text global representation and the image global representation corresponding to each training sample, the second alignment loss value between the text global representation and the audio global representation corresponding to each training sample, and / or the third alignment loss value between the text global representation and the video global representation corresponding to each training sample. S602: Based on the first alignment loss value, the second alignment loss value, and / or the third alignment loss value corresponding to each training sample, determine the alignment loss function value corresponding to all training samples.

[0062] The first alignment loss value refers to the alignment loss function value between the global text representation and the global image representation corresponding to multiple training samples. The second alignment loss value refers to the alignment loss function value between the global text representation and the global audio representation corresponding to multiple training samples. The third alignment loss value refers to the alignment loss function value between the global text representation and the global video representation corresponding to multiple training samples.

[0063] As an example, in step S601, the server obtains the type of non-text global representation corresponding to all groups of training samples, determines whether each group of training samples contains image global representation, audio global representation, and video global representation, processes the image global representation and text global representation in each training sample containing image global representation, and determines the first alignment loss value. The audio global representation and text global representation in each training sample containing the audio global representation are processed to determine the second alignment loss value. The video global representation and text global representation in each training sample containing the video global representation are processed to determine the third alignment loss value. .

[0064] In this example, the first alignment loss value for: = in, The number of training samples for non-text global representations, including image global representations. Let be the global text representation corresponding to the j-th training sample. Let be the global representation of the image corresponding to the j-th training sample.

[0065] Second alignment loss value for: = in, The number of training samples for non-text global representations, including audio global representations. Let be the global text representation corresponding to the j-th training sample. Let be the global audio representation corresponding to the j-th training sample.

[0066] Third alignment loss value for: = in, The number of training samples for non-textual global representations, including video global representations. Let be the text global representation corresponding to the j-th training sample that contains the global representation of the image. Let be the global video representation corresponding to the j-th training sample.

[0067] in, ≤N, ≤N, and ≤N, where N is the number of training sample groups.

[0068] As an example, in step S602, the server adopts a third preset correction coefficient. For the first alignment loss value After correction The fourth preset correction coefficient is adopted. For the second alignment loss value After correction The fifth preset correction coefficient is adopted. For the third alignment loss value After correction ,Will , and / or The sum of these values ​​is determined as the alignment loss function value corresponding to all training samples. If , and If none of them are 0, then = ,like and Both are not 0, and If it is 0, then = .like and Both are not 0, and If it is 0, then = .like and Both are not 0, and If it is 0, then = .like and Both are 0, and If it is not 0, then = .like and Both are 0, and If it is not 0, then = .like and Both are 0, and If it is not 0, then = .

[0069] In this embodiment, based on each non-text global representation and text global representation in the same set of training samples, the alignment loss value between the non-text compressed output corresponding to each non-text modality and the text compressed output of the text modality is determined. Based on the alignment loss value between the non-text compressed output corresponding to each non-text modality and the text compressed output of the text modality, the alignment loss function value corresponding to all sets of training samples is determined. This method can determine the alignment loss function value used to reflect whether the non-text compressed output corresponding to at least one non-text modality is aligned with the text compressed output corresponding to the text modality, so as to update the initial compression layer corresponding to different modalities according to the alignment loss function value and obtain an initial compression layer with higher alignment accuracy.

[0070] Another embodiment of the present invention provides a multimodal data processing method for processing data including text modalities and at least one non-text modality.

[0071] In another embodiment, such as Figure 7 As shown, a multimodal data processing method is provided, which can be applied to... Figure 1 Taking the server in the example, the following steps are included: S701: Obtain at least one data group to be processed, each data group to be processed including first data to be processed corresponding to the text modality and second data to be processed corresponding to the non-text modality; S702: Using the trained multimodal model, the first data to be processed corresponding to the text modality and the second data to be processed corresponding to the non-text modality in each data group to be processed are processed to obtain the target text data corresponding to each data group to be processed.

[0072] The "data set to be processed" refers to a group of data comprising various modalities that the multimodal model needs to train. The first set of data to be processed consists of the text modal data within the data set. The second set of data to be processed consists of the non-text modal data within the data set.

[0073] As an example, in step S701, the server acquires at least one group of data to be processed, comprising a first group of data to be processed corresponding to a text modality and at least one group of data to be processed corresponding to a non-text modality. For example, in an intelligent interaction system, at least one group of data to be processed is acquired, each group of data to be processed including one piece of text data, and each group of data to be processed including at least one piece of image data, one piece of audio data, and one piece of video data. For example, a text data "Please identify the animals in the picture and video", an image data, and a video data segment can be considered as a group of data to be processed.

[0074] The target text data refers to the text data output by the trained multimodal model after processing each group of data to be processed.

[0075] As an example, in step S702, the server inputs the first data to be processed corresponding to the text modality and the second data to be processed corresponding to at least one non-text modality in each data group to be processed into the multimodal model trained by the multimodal model training method in any of the above embodiments, and outputs the target text data corresponding to each data group to be processed. For example, a dataset consisting of the text "Please identify the animals in the picture and video", an image, and a video is input into a trained multimodal model. The initial encoder for the text modality encodes the text data to obtain a text-encoded output. The initial encoder for the image modality encodes the image data to obtain a non-text-encoded output. The initial encoder for the video modality encodes the video data to obtain a non-text-encoded output. The initial compression layer for the text modality compresses the text-encoded output to obtain a text-compressed output of a preset size. The initial compression layer for the image modality compresses the non-text-encoded output to obtain a non-text-compressed output to the image of a preset size. The initial compression layer for the video modality compresses the non-text-encoded output to the video of a preset size. The text-compressed output, the non-text-compressed output for the image, and the non-text-compressed output for the video of the same preset size are then input into the trained initial language model to output the target text data corresponding to the dataset. The trained multimodal model performs the above processing steps on a set of data to be processed, including a text message "Please identify the animals in the image and video," an image, and a video. The output target text data contains the names of the animals in the image and video. The server processes all sets of data to be processed using the above steps to obtain the target text data for each set.

[0076] In this embodiment, the trained multimodal model processes the first data to be processed in the text modality and at least one second data to be processed in the at least one non-text modality in each data group to be processed, and outputs the target text data corresponding to each data group to be processed, thereby achieving the purpose of analyzing and processing the multimodal data groups to be processed and completing the text generation task corresponding to each multimodal data group to be processed.

[0077] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0078] In one embodiment, a multimodal model training apparatus is provided, which corresponds one-to-one with the multimodal model training methods described in the above embodiments. For example... Figure 8 As shown, the multimodal model training device includes a training sample acquisition module 801, an encoding processing module 802, a compression processing module 803, an initial text data determination module 804, a target loss function value determination module 805, and a multimodal model determination module 806. Detailed descriptions of each functional module are as follows: The training sample acquisition module 801 is used to acquire multiple training samples; each training sample includes first training data of text modality and second training data of non-text modality. The encoding processing module 802 is used to encode each training sample using an initial encoder corresponding to multiple modalities, and to determine multiple initial encoding outputs corresponding to each training sample. The initial encoding outputs include text encoding outputs corresponding to the first training data and non-text encoding outputs corresponding to the second training data. The compression processing module 803 is used to compress the initial encoding output corresponding to each training sample using an initial compression layer corresponding to multiple modalities, and to determine an initial compressed output of the same size corresponding to each training sample. The initial compressed output includes the text compressed output corresponding to the first training data and the non-text compressed output corresponding to the second training data. The initial text data determination module 804 is used to perform fusion processing on the initial compressed output corresponding to each training sample using an initial language model to determine the initial text data corresponding to each training sample. The target loss function value determination module 805 performs loss calculation based on the text compression output, the non-text compression output and the initial text data corresponding to each training sample to determine the target loss function value. The multimodal model determination module 806 is used to determine a multimodal model based on the initial encoder corresponding to multiple modalities, the initial compression layer corresponding to multiple modalities, and the initial language model when the target loss function value satisfies the preset convergence condition.

[0079] In one embodiment, the compression processing module 803 includes: The compression processing submodule is used to compress the preset guide words and multiple target inputs corresponding to each training sample using the current initial decoder corresponding to multiple modalities, and to determine multiple initial decoding outputs of the same size corresponding to each training sample. The target input is the initial encoding output or the initial decoding output corresponding to the previous initial decoder. The first judgment submodule is used to update the multiple initial decoding outputs of the same size corresponding to each training sample to the multiple target inputs corresponding to each training sample when the current initial decoder corresponding to multiple modalities is not the last initial decoder of the initial compression layer. The module then repeatedly executes the compression processing of the preset guide words and the multiple target inputs corresponding to each training sample using the current initial decoder corresponding to multiple modalities. The second judgment submodule is used to determine the multiple initial decoding outputs of the same size corresponding to each training sample as the initial compressed outputs of the same size corresponding to each training sample when the current initial decoder corresponding to multiple modalities is the last initial decoder of the initial compression layer.

[0080] In one embodiment, the compression processing submodule includes: The first compression processing unit is used to compress the preset guide word and the multiple initial encoding outputs corresponding to each training sample by using the current initial decoder corresponding to multiple modalities when the current initial decoder corresponding to multiple modalities is the first layer initial decoder of the initial compression layer, and to determine the multiple initial decoding outputs of the same size corresponding to each training sample. The second compression processing unit is used to compress the preset guide word, the multiple initial decoding outputs of the previous initial decoder corresponding to each training sample, and the multiple attention outputs of the multi-head cross-attention layer in the previous initial decoder corresponding to each training sample when the current initial decoder corresponding to multiple modalities is not the first initial decoder of the initial compression layer, and to determine the multiple initial decoding outputs of the same size corresponding to each training sample.

[0081] In one embodiment, the target loss function value determination module 805 includes: The global representation acquisition submodule is used to acquire the text global representation corresponding to the first training data of the text modality and the non-text global representation corresponding to the second training data of each non-text modality in each training sample. The alignment loss function value determination submodule determines the alignment loss function value for all training samples based on the text global representation and non-text global representation corresponding to each training sample. The label loss function value determination submodule determines the label loss function value for all training samples based on the initial text data corresponding to each training sample and the preset label text data corresponding to each training sample. The target loss function value determination submodule determines the target loss function value based on the alignment loss function value and the label loss function value corresponding to each training sample.

[0082] In one embodiment, the alignment loss function value determination submodule includes: The alignment loss value determination unit is used to obtain the first alignment loss value between the text global representation and the image global representation corresponding to each training sample, the second alignment loss value between the text global representation and the audio global representation corresponding to each training sample, and / or the third alignment loss value between the text global representation and the video global representation corresponding to each training sample. The alignment loss function value determination unit determines the alignment loss function value for all training samples based on the first alignment loss value, the second alignment loss value, and / or the third alignment loss value corresponding to each training sample.

[0083] In another embodiment, a multimodal data processing apparatus is provided, which corresponds one-to-one with the multimodal data processing methods described in the above embodiments. Detailed descriptions of each functional module are as follows: The pending data group acquisition module is used to acquire at least one pending data group, each pending data group including a first pending data corresponding to a text modality and a second pending data corresponding to a non-text modality. The target text data acquisition module is used to process the first data to be processed corresponding to the text modality and the second data to be processed corresponding to the non-text modality in each data group to be processed using a trained multimodal model, so as to obtain the target text data corresponding to each data group to be processed.

[0084] Specific limitations regarding the multimodal model training device and the multimodal data processing device can be found in the limitations regarding the multimodal model training method and the multimodal data processing method described above, and will not be repeated here. Each module in the aforementioned multimodal model training device and multimodal data processing device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in the computer device in hardware form, or stored in the memory of the computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0085] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 9 As shown. The computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database stores data used or generated during the execution of a multimodal model training method or a multimodal data processing method. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements a multimodal model training method or a multimodal data processing method.

[0086] In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the multimodal model training method described in the above embodiments, for example... Figure 2 As shown in S201-S206, or Figures 3 to 6 As shown, to avoid repetition, it will not be described again here. Alternatively, when the processor executes the computer program, it implements the functions of each module / unit in this embodiment of the multimodal model training device, for example... Figure 8 The functions of the training sample acquisition module 801, encoding processing module 802, compression processing module 803, initial text data determination module 804, target loss function value determination module 805, and multimodal model determination module 806 shown are not described again here to avoid repetition. Alternatively, the processor may implement the multimodal data processing method described in the above embodiments when executing a computer program, for example... Figure 7 S701-S702 shown are omitted here to avoid repetition. Alternatively, the processor executes the computer program to implement the functions of each module / unit in this embodiment of the multimodal data processing device; these will also be omitted here to avoid repetition.

[0087] In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When executed by a processor, the computer program implements the multimodal model training method described in the above embodiment, for example... Figure 2 As shown in S201-S206, or Figures 3 to 6 As shown, to avoid repetition, it will not be described again here. Alternatively, when the computer program is executed by the processor, it implements the functions of each module / unit in this embodiment of the multimodal model training device, for example... Figure 8The functions of the training sample acquisition module 801, encoding processing module 802, compression processing module 803, initial text data determination module 804, target loss function value determination module 805, and multimodal model determination module 806 shown are not described again here to avoid repetition. Alternatively, when this computer program is executed by a processor, it implements the multimodal data processing method described in the above embodiments, for example... Figure 7 S701-S702, as shown, will not be described again here to avoid repetition. Alternatively, when the computer program is executed by the processor, it implements the functions of each module / unit in this embodiment of the multimodal data processing device; to avoid repetition, this will not be described again here.

[0088] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. This computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0089] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.

[0090] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.

Claims

1. A method for training a multi-modal model, the method comprising: include: Obtain multiple training samples; Each training sample includes first training data for the text modality and second training data for the non-text modality; Each training sample is encoded using an initial encoder corresponding to multiple modalities, and multiple initial encoding outputs corresponding to each training sample are determined. The initial encoding outputs include text encoding outputs corresponding to the first training data and non-text encoding outputs corresponding to the second training data. Multiple modal-corresponding initial compression layers are used to compress the initial encoding output corresponding to each training sample, and an initial compressed output of the same size corresponding to each training sample is determined. The initial compressed output includes the text compressed output corresponding to the first training data and the non-text compressed output corresponding to the second training data. The initial compressed output corresponding to each training sample is fused using the initial language model to determine the initial text data corresponding to each training sample. Loss calculation is performed based on the text compression output, the non-text compression output, and the initial text data corresponding to each training sample to determine the target loss function value; When the target loss function value satisfies the preset convergence condition, a multimodal model is determined based on the initial encoder corresponding to multiple modalities, the initial compression layer corresponding to multiple modalities, and the initial language model. 2.The multi-modal model training method of claim 1, wherein, The initial compression layer includes multiple sequentially connected initial decoders; The step of using multiple modality-corresponding initial compression layers to compress the initial encoded output corresponding to each training sample, and determining the same initial compressed output for each training sample, includes: The current initial decoder corresponding to multiple modalities is used to compress the preset guide word and multiple target inputs corresponding to each training sample, and to determine multiple initial decoding outputs of the same size corresponding to each training sample. The target input is the initial encoding output or the initial decoding output corresponding to the previous initial decoder. When the current initial decoder corresponding to the multiple modalities is not the last initial decoder of the initial compression layer, the multiple initial decoder outputs of the same size corresponding to each training sample are updated to the multiple target inputs corresponding to each training sample. The process of using the current initial decoder corresponding to the multiple modalities to compress the preset guide word and the multiple target inputs corresponding to each training sample is repeated. When the current initial decoder corresponding to the multiple modalities is the last initial decoder of the initial compression layer, the multiple initial decoding outputs of the same size corresponding to each training sample are determined as the initial compression outputs of the same size corresponding to each training sample. 3.The multi-modal model training method of claim 2, wherein, The initial decoder includes a multi-head cross-attention layer; The step of employing a current initial decoder corresponding to multiple modalities to compress a preset guide word and multiple target inputs corresponding to each training sample, and determining multiple initial decoding outputs of the same size corresponding to each training sample, includes: When the current initial decoder corresponding to the multiple modalities is the first layer initial decoder of the initial compression layer, the current initial decoder corresponding to the multiple modalities is used to compress the preset guide word and the multiple initial encoding outputs corresponding to each training sample, and to determine the multiple initial decoding outputs of the same size corresponding to each training sample; When the current initial decoder corresponding to the multiple modalities is not the first initial decoder of the initial compression layer, the current initial decoder corresponding to the multiple modalities is used to compress the preset guide word, the multiple initial decoding outputs of the previous initial decoder corresponding to each training sample, and the multiple attention outputs of the multi-head cross-attention layer in the previous initial decoder corresponding to each training sample, so as to determine the multiple initial decoding outputs of the same size corresponding to each training sample. 4.The multi-modal model training method of claim 1, wherein, The step of calculating the loss based on the text compression output, the non-text compression output, and the initial text data corresponding to each training sample, and determining the target loss function value, includes: In each training sample, obtain the text global representation corresponding to the first training data of the text modality and the non-text global representation corresponding to the second training data of each non-text modality. Based on the text global representation and non-text global representation corresponding to each training sample, determine the alignment loss function value corresponding to all training samples; Based on the initial text data corresponding to each training sample and the preset label text data corresponding to each training sample, determine the label loss function value corresponding to all training samples; The target loss function value is determined based on the alignment loss function value and the label loss function value corresponding to each training sample.

5. The multi-modal model training method of claim 4, wherein, The non-text global representation includes image global representation, audio global representation and / or video global representation; The step of determining the alignment loss function value for all training samples based on the text global representation and non-text global representation corresponding to each training sample includes: Obtain a first alignment loss value between the text global representation and the image global representation corresponding to each training sample, a second alignment loss value between the text global representation and the audio global representation corresponding to each training sample, and / or a third alignment loss value between the text global representation and the video global representation corresponding to each training sample; Based on the first alignment loss value, the second alignment loss value, and / or the third alignment loss value corresponding to each training sample, determine the alignment loss function value corresponding to all training samples.

6. A multi-modal data processing method, characterized by, include: Obtain at least one data group to be processed, each data group to be processed including a first data group to be processed corresponding to a text modality and a second data group to be processed corresponding to a non-text modality; The multimodal model trained using the multimodal model training method according to any one of claims 1 to 5 processes the first data to be processed corresponding to the text modality and the second data to be processed corresponding to the non-text modality in each data group to be processed, thereby obtaining the target text data corresponding to each data group to be processed.

7. A multi-modal model training apparatus, comprising: include: The training sample acquisition module is used to acquire multiple training samples; Each training sample includes first training data for the text modality and second training data for the non-text modality; The encoding processing module is used to encode each training sample using an initial encoder corresponding to multiple modalities, and to determine multiple initial encoding outputs corresponding to each training sample. The initial encoding outputs include text encoding outputs corresponding to the first training data and non-text encoding outputs corresponding to the second training data. The compression processing module is used to compress the initial encoding output corresponding to each training sample using an initial compression layer corresponding to multiple modalities, and to determine an initial compressed output of the same size corresponding to each training sample. The initial compressed output includes the text compressed output corresponding to the first training data and the non-text compressed output corresponding to the second training data. The initial text data determination module is used to perform fusion processing on the initial compressed output corresponding to each training sample using an initial language model to determine the initial text data corresponding to each training sample. The target loss function value determination module performs loss calculation based on the text compression output, the non-text compression output, and the initial text data corresponding to each training sample to determine the target loss function value. The multimodal model determination module is used to determine the multimodal model based on the initial encoder corresponding to multiple modalities, the initial compression layer corresponding to multiple modalities, and the initial language model when the target loss function value satisfies the preset convergence condition.

8. A multi-modal data processing apparatus, characterized by, include: The pending data group acquisition module is used to acquire at least one pending data group, each pending data group including a first pending data corresponding to a text modality and a second pending data corresponding to a non-text modality. The target text data acquisition module is used to process the first data to be processed corresponding to the text modality and the second data to be processed corresponding to the non-text modality in each data group to be processed using a trained multimodal model, so as to obtain the target text data corresponding to each data group to be processed.

9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the multimodal model training method according to any one of claims 1 to 5, or, when the processor executes the computer program, it implements the multimodal data processing method according to claim 6.

10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the multimodal model training method according to any one of claims 1 to 5, or, when the computer program is executed by the processor, it implements the multimodal data processing method according to claim 6.