Model training method and system, data processing method and system, and electronic device

By acquiring training data from multimodal datasets and performing concatenation, mixing, and model tuning, a target multimodal representation model is generated. This solves the problem of heavy computational and storage burden on CLIP models in complex retrieval scenarios, and achieves efficient and accurate multimodal data processing.

WO2026123749A1PCT designated stage Publication Date: 2026-06-18ALIBABA (CHINA) CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
ALIBABA (CHINA) CO LTD
Filing Date
2025-08-12
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing multimodal representation learning methods, such as the CLIP model, have heavy computational and storage burdens in complex retrieval scenarios, making them unable to effectively handle multimodal data, resulting in limitations in information retrieval and failing to meet the needs of complex retrieval scenarios.

Method used

By acquiring training data from the target multimodal dataset, multimodal pre-trained models are spliced ​​and mixed to generate multimodal representations. The multimodal pre-trained models are then adjusted to obtain the target multimodal representation model, which is used to uniformly process multimodal query data and realize knowledge retrieval and response content generation.

🎯Benefits of technology

It improves the efficiency and accuracy of information retrieval, reduces system overhead, enhances the ability to perform complex retrieval scenarios, and improves the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025114236_18062026_PF_FP_ABST
    Figure CN2025114236_18062026_PF_FP_ABST
Patent Text Reader

Abstract

The present disclosure relates to the technical field of data processing techniques and large models. Disclosed are a model training method and system, a data processing method and system, and an electronic device. The model training method comprises: acquiring multi-modal training data from a target multi-modal data set; using a multi-modal pre-trained model to perform concatenation and mixing on the multi-modal training data, so as to obtain multi-modal representations; and using the multi-modal representations to adjust the multi-modal pre-trained model, so as to obtain a target multi-modal representation model, wherein the target multi-modal representation model is used for performing knowledge retrieval and response content generation on multi-modal query data input by a user, so as to obtain a target answer. The present disclosure solves the technical problems in the related art of heavy computational and storage burdens, limitations in information retrieval, and failing to be applied to composite retrieval scenarios caused by the use of a CLIP model for multi-modal representation.
Need to check novelty before this filing date? Find Prior Art

Description

Model training methods, data processing methods, systems and electronic equipment Technical Field

[0001] This disclosure relates to data processing technology and large model technology, and more specifically, to a model training method, a data processing method, a system, and an electronic device. Background Technology

[0002] Multimodal representation plays a central role in information retrieval and representation learning. It can process and integrate various forms of data such as text, images, audio, and video, thereby providing more accurate and comprehensive information understanding capabilities in complex environments. It is a major research hotspot in the field of artificial intelligence.

[0003] Currently, multimodal representation learning mainly employs Contrastive Language-Image Pre-training (CLIP) models. However, CLIP models are generally only applicable to cross-modal retrieval scenarios (such as searching for images from text). For complex retrieval scenarios (such as searching for both images and text from text), CLIP models can only adopt a divide-and-conquer strategy. This strategy increases the additional storage and computational burden, thus limiting information retrieval and affecting the model's performance in practical applications.

[0004] There is currently no effective solution to the above problems. Summary of the Invention

[0005] This disclosure provides a model training method, a data processing method, a system, and an electronic device to at least solve the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limited information retrieval capabilities, and inapplicability to complex retrieval scenarios.

[0006] According to one aspect of the present disclosure, a model training method is provided, comprising: acquiring multimodal training data from a target multimodal dataset; concatenating and mixing the multimodal training data using a multimodal pre-trained model to obtain a multimodal representation; and adjusting the multimodal pre-trained model using the multimodal representation to obtain a target multimodal representation model, wherein the target multimodal representation model is used to perform knowledge retrieval and response content generation on multimodal query data input by a user to obtain a target answer.

[0007] According to another aspect of the embodiments of this disclosure, a data processing method is also provided, including: acquiring multimodal query data input by a user; using a target multimodal representation model to perform knowledge retrieval and response content generation on the multimodal query data to obtain a target answer; wherein the target multimodal representation model is generated according to any of the above-described model training methods.

[0008] According to another aspect of the embodiments of this disclosure, a data processing method is also provided, comprising: acquiring multimodal product query data, wherein the multimodal product query data is used to provide descriptive information of products to be selected; employing a target multimodal representation model to perform knowledge retrieval and response content generation on the multimodal product query data to obtain target product information; wherein the target multimodal representation model is generated according to any of the above-described model training methods.

[0009] According to another aspect of the embodiments of this disclosure, a data processing method is also provided, comprising: obtaining a data processing request through a first application programming interface, wherein the request data carried in the data processing request includes: multimodal query data; and returning a data processing response through a second application programming interface, wherein the response data carried in the data processing response includes: a target answer, wherein the target answer is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model, and the target multimodal representation model is generated according to any one of the above-described model training methods.

[0010] According to another aspect of the embodiments of this disclosure, a data processing method is also provided, comprising: acquiring a currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multimodal query data; responding to the data processing dialogue request, returning a data processing dialogue response, wherein the information carried in the data processing dialogue response includes: a target answer, wherein the target answer is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model, the target multimodal representation model being generated according to any one of the above-described model training methods; and displaying the target answer in a graphical user interface.

[0011] According to another aspect of the embodiments of this disclosure, a data processing system is also provided, including: a client for sending multimodal query data; a server connected to the client for using a target multimodal representation model to perform knowledge retrieval and response content generation on the multimodal query data to obtain a target answer, wherein the target multimodal representation model is generated according to any of the above-described model training methods; and the client is also used to output the target answer.

[0012] According to another aspect of the present disclosure, an electronic device is also provided, including: a memory storing an executable program; and a processor connected to the memory via a bus for running the program, wherein the program executes the methods in various embodiments of the present disclosure during runtime.

[0013] According to another aspect of the embodiments of this disclosure, a computer-readable storage medium is also provided, the computer-readable storage medium including a stored executable program, wherein, when the executable program is running, it controls the device where the computer-readable storage medium is located to perform any of the above-described model training method or data processing method.

[0014] According to another aspect of the embodiments of this disclosure, a computer program product is also provided, including a computer program that, when executed by a processor, implements any of the above-described model training methods or data processing methods.

[0015] In this embodiment, multimodal training data is obtained from a target multimodal dataset. Then, a multimodal pre-trained model is used to concatenate and mix the multimodal training data to obtain a multimodal representation. Finally, the multimodal representation is used to adjust the multimodal pre-trained model to obtain a target multimodal representation model. This target multimodal representation model is used to perform knowledge retrieval and response content generation on user-input multimodal query data to obtain the target answer. This achieves the goal of processing multimodal data in a unified manner, overcoming limitations in complex retrieval scenarios, enhancing the capabilities of complex retrieval scenarios, improving the efficiency and accuracy of information retrieval, reducing system overhead, and improving user experience. This solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limiting information retrieval and making it unsuitable for complex retrieval scenarios.

[0016] It is worth noting that the above general description and the following detailed description are merely for illustrative and explanatory purposes and do not constitute a limitation thereof. Attached Figure Description

[0017] The accompanying drawings, which are included to provide a further understanding of this disclosure and form part of this disclosure, illustrate exemplary embodiments of the present disclosure and are used to explain the disclosure, but do not constitute an undue limitation of the disclosure. In the drawings:

[0018] Figure 1 is a schematic diagram of an application scenario of a model training method according to an embodiment of the present disclosure;

[0019] Figure 2 is a flowchart of a model training method according to an embodiment of the present disclosure;

[0020] Figure 3 is a schematic diagram of the unified representation based on a multimodal pre-trained large model according to an embodiment of the present disclosure;

[0021] Figure 4 is a schematic diagram of unified multimodal representation training based on a contrastive learning training method according to an embodiment of the present disclosure;

[0022] Figure 5 is a schematic diagram illustrating a unified multimodal retrieval scenario according to an embodiment of the present disclosure;

[0023] Figure 6 is a flowchart of a data processing method according to an embodiment of the present disclosure;

[0024] Figure 7 is a flowchart of a data processing method according to an embodiment of the present disclosure;

[0025] Figure 8 is a flowchart of a data processing method according to an embodiment of the present disclosure;

[0026] Figure 9 is a flowchart of a data processing method according to an embodiment of the present disclosure;

[0027] Figure 10 is a schematic diagram of the structure of a data processing system according to an embodiment of the present disclosure;

[0028] Figure 11 is a schematic diagram of a model training device according to an embodiment of the present disclosure;

[0029] Figure 12 is a schematic diagram of the structure of a data processing apparatus according to an embodiment of the present disclosure;

[0030] Figure 13 is a schematic diagram of another data processing apparatus according to an embodiment of the present disclosure;

[0031] Figure 14 is a schematic diagram of another data processing apparatus according to an embodiment of the present disclosure;

[0032] Figure 15 is a schematic diagram of the structure of another data processing apparatus according to an embodiment of the present disclosure;

[0033] Figure 16 is a structural block diagram of a computing device according to an embodiment of the present disclosure;

[0034] Figure 17 is a structural block diagram of an electronic device according to an embodiment of the present disclosure. Detailed Implementation

[0035] To enable those skilled in the art to better understand the present disclosure, the technical solutions of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of the present disclosure, and not all embodiments. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present disclosure.

[0036] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0037] The technical solution disclosed herein is primarily implemented using large-scale model technology. Here, "large-scale model" refers to a deep learning model with a massive number of parameters, typically containing hundreds of millions, tens of billions, hundreds of billions, trillions, or even tens of trillions of parameters. Large-scale models, also known as foundation models, are pre-trained using large-scale unlabeled corpora to produce pre-trained models with hundreds of millions of parameters. These models are adaptable to a wide range of downstream tasks and exhibit good generalization ability. Examples include Large Language Models (LLMs) and multi-modal pre-training models.

[0038] It should be noted that, in practical applications, large models can be fine-tuned using a small number of samples to adapt them to different tasks. For example, large models can be widely applied in Natural Language Processing (NLP), computer vision, and speech processing. Specifically, they can be applied to computer vision tasks such as Visual Question Answering (VQA), Image Captioning (IC), and Image Generation, as well as NLP tasks such as text-based sentiment classification, text summarization, and machine translation. Therefore, the main application scenarios for large models include, but are not limited to, digital assistants, intelligent robots, search, online education, office software, e-commerce, and intelligent design. In this embodiment, the explanation is based on a data processing scenario using a target multimodal representation model trained by the model training method provided in this embodiment.

[0039] First, some nouns or terms that appear in the description of the embodiments of this disclosure shall be interpreted as follows:

[0040] Large-scale pre-trained language models refer to deep learning models trained on large-scale corpora for natural language processing tasks. These models typically contain multi-layered neural networks, taking a text sequence as input and outputting a linguistic representation of that sequence. Pre-training means that the model has been trained and pre-learned to process large amounts of language data before a specific task. By pre-training the model, it can capture more complex linguistic and semantic rules, thus performing well on various natural language processing tasks and reducing the need for large-scale data for specific tasks.

[0041] Multimodal processing generally refers to combining multiple different types of data or information patterns. Modalities can include text, images, audio, video, and so on. In the field of computing, multimodal processing means utilizing these different types of data to improve task performance or achieve more complex functions.

[0042] Multimodal large models: These generally refer to large-scale machine learning models capable of processing multiple data modalities simultaneously. Multimodal large models typically utilize complementary information from different types of data (such as text, images, and audio) to improve the overall performance of the model and are often used for complex tasks, such as text-to-image generation and image understanding.

[0043] Multimodal representation: Multimodal representation refers to unifying data from different modalities into a common feature space for modeling and analysis. This representation method helps models better understand and integrate information from multiple modalities, improving the accuracy and efficiency of information processing.

[0044] Multimodal retrieval: Multimodal retrieval is an information retrieval technique that utilizes data from multiple modalities for searching. For example, a user might provide text descriptions and images for searching, and the system will return relevant results based on the combined information from these modalities. This method can improve the accuracy of retrieval and the user experience.

[0045] Transformer: The Transformer architecture is a neural network model based on attention mechanisms for handling natural language processing tasks. The Transformer architecture consists of multiple layers of encoders and decoders. The encoder transforms the input sequence into a series of hidden states and uses attention mechanisms to focus on different parts of the input sequence. The decoder uses a similar mechanism but also employs self-attention to ensure the fluency and consistency of the output sequence.

[0046] Contrastive learning is a machine learning method used to learn similarities and differences. This method learns similarity representations by aggregating samples of the same class and distinguishing samples of different classes.

[0047] Multimodal representation, as a fundamental task in information retrieval and representation learning, is widely applied in downstream tasks such as multimodal retrieval and clustering. Currently, multimodal representation learning mainly employs the CLIP model. The CLIP model constructs two dedicated encoders to process text and image modalities respectively, converting them into continuous high-dimensional vector representations, thereby achieving tight alignment and matching of text and images in a shared embedding space. Although the CLIP model significantly improves the performance of multimodal representation, it is generally only suitable for cross-modal retrieval, such as searching for images from text. For complex retrieval scenarios, such as searching for both images and text simultaneously, the CLIP model can only employ a divide-and-conquer strategy. This strategy increases the additional storage and computational burden, leading to low efficiency in information fusion and affecting the model's performance in practical applications.

[0048] The related technologies that use the CLIP model for multimodal representation learning have the following drawbacks.

[0049] Defect 1: Information retrieval has limitations. The CLIP model represents data from different modalities independently. For more complex retrieval scenarios, such as text-image search, it is necessary to independently index candidate text and images, thus increasing the additional storage and computational burden. Moreover, the relevance scores of retrieval results between different modalities cannot be directly compared, reducing the efficiency of information fusion and affecting the overall performance of the model in practical applications.

[0050] Defect 2: Insufficient information interaction. While the CLIP model can generate high-quality modal representations for text and images respectively, it lacks consideration for effective information interaction between single and cross-modal expressions. This separate processing approach significantly reduces the model's effectiveness when handling complex retrieval tasks, such as text-to-image or image-to-text searches, because these scenarios require the model to simultaneously consider and integrate information from multiple modalities.

[0051] Defect 3: Low utilization of pre-training capabilities. Currently, large-scale pre-trained models have achieved significant results in natural language processing, typically capable of understanding and efficiently processing information from multiple modalities. However, the CLIP model architecture fails to fully utilize the potential of these large multimodal pre-trained models, especially in multimodal understanding and generation. In other words, the CLIP model cannot effectively leverage the multimodal understanding capabilities of existing large multimodal pre-trained models.

[0052] To address the aforementioned deficiencies, no effective solution has been proposed prior to this disclosure.

[0053] According to embodiments of this disclosure, a model training method is provided. It should be noted that the steps shown in the flowcharts in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowcharts, in some cases, the steps shown or described may be executed in a different order than that shown here.

[0054] Considering the large number of model parameters in large models and the limited computing resources of mobile terminals, the method provided in this disclosure can be applied to the application scenario shown in Figure 1, but is not limited thereto. In the application scenario shown in Figure 1, the large model is deployed on server 10. Server 10 can connect to one or more client devices 20 via a local area network (LAN), wide area network (WAN), Internet, or other types of data networks. These client devices 20 may include, but are not limited to, smartphones, tablets, laptops, PDAs, personal computers, smart home devices, and in-vehicle devices. Client devices 20 can interact with users through a graphical user interface to invoke the large model, thereby implementing the method provided in this disclosure.

[0055] In this embodiment of the disclosure, the system consisting of a client device and a server can perform the following steps: the client device performs steps such as obtaining multimodal training data from the target multimodal dataset and sending the multimodal training data to the server; the server performs steps such as concatenating and mixing the multimodal training data using a multimodal pre-trained model to obtain a multimodal representation, adjusting the multimodal pre-trained model using the multimodal representation to obtain a target multimodal representation model, and returning the target multimodal representation model to the client device. It should be noted that, provided that the operating resources of the client device can meet the deployment and operation conditions of a large model, this embodiment of the disclosure can be performed on the client device.

[0056] It should be noted that with the rapid development of high-performance computing units, the methods provided in this disclosure can also be applied to integrated model machines in other application scenarios. In one optional embodiment, the integrated model machine has multiple built-in models. Users can select one model to adjust as needed to obtain their own model. The high-performance computing unit built into the integrated model machine can then directly call the adjusted model to execute the methods provided in this disclosure. In another optional embodiment, the large integrated model machine has a pre-trained model built-in. Therefore, the high-performance computing unit built into the integrated model machine can directly call this model to execute the methods provided in this disclosure.

[0057] Furthermore, when users need to train their own models, they can upload their own datasets via the client. These datasets are then sent to the server, allowing the server to adjust the pre-trained model using the dataset to obtain the user's customized model, which can then be deployed to the production environment. To facilitate users' model adjustment needs, the server provides complete adjustment tools, development frameworks, and processes, supporting multiple adjustment strategies. This allows the adjusted model to better adapt to different application domains and achieve a high degree of customization.

[0058] Under the above operating environment, this disclosure provides a model training method as shown in Figure 2. Figure 2 is a flowchart of a model training method according to an embodiment of this disclosure. As shown in Figure 2, the method may include the following steps:

[0059] Step S21: Obtain multimodal training data from the target multimodal dataset;

[0060] Step S22: Use a multimodal pre-trained model to concatenate and mix the multimodal training data to obtain a multimodal representation;

[0061] Step S23: Adjust the multimodal pre-trained model using multimodal representation to obtain the target multimodal representation model. The target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal query data input by the user to obtain the target answer.

[0062] In this embodiment of the disclosure, multimodal training data is first obtained from a target multimodal dataset. The target multimodal dataset can be understood as a pre-constructed large-scale, diverse, and high-quality multimodal dataset. Exemplarily, large-scale multimodal data, such as images, text, and videos, can be obtained from multiple sources including the Internet, public datasets, and business data through web scraping or other data acquisition methods. This is not limited here, and the resulting target multimodal dataset can be constructed. The target multimodal dataset may include image-text pairing data, video-description text pairing data, etc., which are not limited here.

[0063] Multimodal training data obtained from the target multimodal dataset can be understood as multimodal relevance training data, that is, training data used to train multimodal pre-trained models. For example, multimodal training data can include information from multiple modalities, such as text, images, audio, and video, and these different modalities are semantically related or complementary. It can also be understood as multimodal training data including image-text pairing data, where each pair includes a text description and a related image; text-audio pairing data, where each pair includes a text and a related audio clip; and text-video pairing data, where each pair includes a text description and a video clip, with the video content matching the text description. Cross-modal pairing data, i.e., combinations of multiple modalities, such as composite data of text, images, audio, and video, is used to train the model's fusion ability when processing complex multimodal information. Additionally, it can include unimodal data to help the model better understand the features and representations of a specific modality; this is not a limitation.

[0064] After obtaining multimodal training data from the target multimodal dataset, a multimodal pre-trained model is used to concatenate and mix the multimodal training data to obtain multimodal representations. The multimodal pre-trained model can be understood as a pre-trained large-scale multimodal model (which can be referred to as the foundation of the large-scale multimodal pre-trained model), meaning it is pre-trained on large-scale multimodal data to obtain a multimodal pre-trained model with rich cross-modal representation capabilities. This multimodal pre-trained model can process and understand multiple types of data modalities, such as text, images, audio, or video, and can establish associations and understanding between these different modalities.

[0065] For example, the multimodal pre-trained model can be a large model or other deep learning model, without limitation here.

[0066] Using a multimodal pre-trained model to concatenate and mix multimodal training data can be understood as using the multimodal pre-trained model to concatenate inputs from any modality, such as unimodal and mixed modal inputs, in a certain order or structure to obtain a unified input vector, which serves as the input to the multimodal pre-trained model. For example, for a combination of text and images, the feature vector of the image and the encoded sequence of the text can be concatenated in some way to form a unified input vector that contains the multimodal information of the original data. In other words, data from different modalities (such as text, images, audio, etc.) are integrated at the input end of the multimodal pre-trained model into a unified input format so that the multimodal pre-trained model can process and understand these inputs simultaneously.

[0067] Multimodal representation can be understood as a unified, high-dimensional vector representation generated by a multimodal pre-trained model after processing input data from different modalities. This representation comprehensively reflects the features and semantics of these data. Multimodal representation integrates different modal information into a single continuous feature space, enabling the comparison, retrieval, and fusion of data from different modalities, such as text and images, within a common framework. This helps the multimodal pre-trained model understand the relationships between different modalities, providing a foundation for subsequent multimodal retrieval, understanding, and generation tasks.

[0068] Therefore, using the multimodal representation obtained by splicing and mixing as the input of the multimodal pre-trained model to achieve a unified multimodal representation enables the multimodal pre-trained model to operate in a unified feature space, which can improve the accuracy and efficiency of information retrieval and avoid the limitation of traditional methods that require independent processing of each modality data.

[0069] After obtaining the multimodal representations, the multimodal pre-trained model is adjusted using these representations to obtain the target multimodal representation model. This can be understood as adjusting the parameters of the multimodal pre-trained model using the multimodal representations, so that the representations generated by the multimodal pre-trained model in the final output layer can more accurately reflect the semantics of the input data and establish effective correspondences between the representations of different modalities, thereby improving the accuracy of retrieval and understanding.

[0070] For example, the parameters of the multimodal pre-trained model can be further fine-tuned using supervised or unsupervised learning methods to optimize the performance of the multimodal representation. The adjusted multimodal pre-trained model, i.e., the target multimodal representation model, is used to perform knowledge retrieval and response content generation on the multimodal query data input by the user to obtain the target answer. It can be understood that the target multimodal representation model can process the multimodal query data input by the user more efficiently, realize knowledge retrieval and response content generation, and thus obtain a more accurate target answer.

[0071] In this embodiment, multimodal training data is obtained from a target multimodal dataset. Then, a multimodal pre-trained model is used to concatenate and mix the multimodal training data to obtain a multimodal representation. Finally, the multimodal representation is used to adjust the multimodal pre-trained model to obtain a target multimodal representation model. This target multimodal representation model is used to perform knowledge retrieval and response content generation on user-input multimodal query data to obtain the target answer. Therefore, by obtaining training data from a specially constructed target multimodal dataset, this disclosure allows the model to access a wide range of diverse multimodal samples, which helps the model learn the complex relationships between different modalities, improves its generalization ability, and because the multimodal training data contains combinations of multiple modalities, the model is more robust in handling real-world composite modality retrieval scenarios and can cope with different types of input data.

[0072] Meanwhile, this disclosure integrates data from different modalities into a unified representation framework through splicing and mixing operations, generating a comprehensive multimodal representation. This helps to achieve a more unified and efficient retrieval strategy in cross-modal retrieval tasks, and also enables the model to learn the deep semantic relationships of various modal data, thereby enhancing the model's ability to understand complex retrieval scenarios.

[0073] Furthermore, this disclosure optimizes the model by employing multimodal representations, enabling more accurate and efficient matching in retrieval tasks. It also allows for more efficient information extraction in multimodal fusion, demonstrating superior adaptability and efficiency across different application scenarios. Since it eliminates the need to build separate retrieval models for each modality, it reduces computational and storage resource consumption. Simultaneously, the target multimodal representation model trained by this disclosure overcomes the limitations of traditional CLIP schemes in complex retrieval scenarios. This disclosure can handle queries and candidates of different modalities in a unified manner, avoiding the additional burden of divide-and-conquer strategies, improving the overall efficiency and accuracy of the system when handling complex retrieval tasks, and enhancing the user experience.

[0074] The model training method provided in this disclosure can be applied, but is not limited to, to application scenarios involving the training of multimodal representation models in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, it can be used to train multimodal representation models in e-commerce services, education services, and legal services. No limitation is imposed here.

[0075] This embodiment of the invention obtains multimodal training data from a target multimodal dataset. Then, a multimodal pre-trained model is used to concatenate and mix the multimodal training data to obtain a multimodal representation. Finally, the multimodal representation is used to adjust the multimodal pre-trained model to obtain a target multimodal representation model. This target multimodal representation model is used to perform knowledge retrieval and response content generation on user-input multimodal query data to obtain the target answer. This achieves the goal of processing multimodal data in a unified manner, overcoming limitations in complex retrieval scenarios, enhancing the capabilities of complex retrieval scenarios, improving the efficiency and accuracy of information retrieval, reducing system overhead, and improving user experience. Furthermore, it solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limiting information retrieval and making it unsuitable for complex retrieval scenarios.

[0076] In one optional embodiment, the model training method further includes the following steps:

[0077] Step S24: Obtain the initial multimodal dataset;

[0078] Step S25: Perform data cleaning and preprocessing on the initial multimodal dataset to obtain an intermediate multimodal dataset;

[0079] Step S26: Perform data alignment and data annotation on the intermediate multimodal dataset to obtain the target multimodal dataset.

[0080] In this embodiment of the disclosure, when acquiring the target multimodal dataset, an initial multimodal dataset can be acquired first. This can be understood as initial data collection, which can be done through web crawling technology or other data acquisition methods to collect multimodal data from a wide range of resources. For example, text descriptions, images, videos, and other data can be crawled from social media, news websites, video sharing platforms, etc., without limitation. Then, multimodal data from different sources are integrated to construct a preliminary initial multimodal dataset.

[0081] After obtaining the initial multimodal dataset, data cleaning and preprocessing can be performed to obtain an intermediate multimodal dataset. This can be understood as performing noise reduction, deduplication, and format conversion on the data in the initial multimodal dataset to ensure data quality and consistency.

[0082] For example, duplicate, irrelevant, or low-quality data can be removed from the dataset. This includes identifying and deleting duplicate image or text descriptions, and filtering out data with weak multimodal relevance, such as irrelevant images or noisy text—the limitations of this approach are not limited here. Data can also be formatted and standardized to ensure it can be correctly recognized and processed by the multimodal pre-trained model. This includes image preprocessing (such as resizing and color standardization), text preprocessing (such as word segmentation, stop word removal, and standardized encoding), etc.

[0083] After obtaining the intermediate multimodal dataset, data alignment and annotation can be performed on it to obtain the target multimodal dataset. This can be understood as establishing connections between different modalities, such as text alignment, image-text alignment, and audio-video synchronization, to ensure the semantic relevance of the multimodal data.

[0084] For example, data from different modalities such as text, images, and videos are matched to ensure that the data from different modalities are semantically related. For instance, matching text describing a scene with an image depicting that scene, or matching text descriptions of video content with video clips. The aligned data is then labeled to explicitly identify which data pairs are positive samples (semantically related) and which are negative samples (semantically unrelated), thus guiding the model to optimize the similarity of positive samples during training while minimizing the similarity of negative samples. After data alignment and labeling, the data undergoes further standardization and formatting to ensure it conforms to the structure and format requirements of the target multimodal dataset. For example, the length of text descriptions is standardized, and images are adjusted to a consistent resolution and color space to facilitate subsequent processing by the multimodal pre-trained model.

[0085] The goal of data alignment is to establish clear associations between different modalities, ensuring that models can understand and learn the same scene or concept from multiple perspectives. For example, when aligning text and image data, it's crucial to ensure that the text description matches the image content—that is, content alignment. For instance, if the text description mentions "blue bicycle," then the image should also contain elements describing that description. Furthermore, location information alignment is also necessary. For a specific object or scene in an image, if the text description mentions its location, such as "blue bicycle under a tree," then the location information of "under a tree" and "blue bicycle" needs to be labeled in the image. This can be achieved through bounding boxes or image segmentation. Additionally, when processing multimodal data of video or audio and text, temporal information alignment is also required, ensuring that the text description corresponds to video or audio clips at specific points in time. This can be achieved through timestamp annotation.

[0086] Understandably, when processing cross-modal information alignment, it is necessary to ensure semantic consistency between different modal data, such as the correspondence between text subtitles in a video and the speech content in an audio file.

[0087] Data annotation is used to add additional semantic information to multimodal datasets, helping models better understand the data. For example, object recognition and classification are required, such as identifying and labeling all objects in images or videos, like pedestrians, vehicles, and animals, and classifying them as "person," "car," or "cat." Then, sentiment and emotion annotation is performed, such as classifying and labeling sentiment-related data containing emotional information, like social media posts or emotionally relevant audio, as "happy," "sad," or "angry." Next, scene descriptions are added, such as detailed descriptions of image or video scenes, like "a football match" or "a family gathering," ensuring the descriptions match the actual scene. Furthermore, this also includes relationship annotation, time information annotation, keyword or entity annotation, etc., without limitation.

[0088] Therefore, this disclosure constructs a high-quality target multimodal dataset, which not only contains diverse multimodal data, but also undergoes optimization in many aspects such as cleaning, preprocessing, alignment and annotation, providing a solid foundation for subsequent multimodal pre-training model training, thereby helping to improve the model's performance in multimodal retrieval, clustering and other tasks.

[0089] In an optional embodiment, the multimodal training data includes: multiple modality data pairs, each modality data pair including first modality data and second modality data. In step S22, the multimodal training data is concatenated and mixed using a multimodal pre-trained model to obtain a multimodal representation, including the following method steps:

[0090] Step S221: Use a multimodal pre-trained model to extract features from the first modality data to obtain a first feature vector, extract features from the second modality data to obtain a second feature vector, and concatenate and mix the first feature vector and the second feature vector to obtain a multimodal representation.

[0091] In this embodiment of the disclosure, the multimodal training data includes multiple pairs of different modal data composed of first modal data and second modal data. This can be understood as the first modal data and the second modal data forming a modal pair, that is, the first modal data and the second modal data forming a pair of correlated data. For example, in a text-image modal data pair, the first modal data is the text description, and the second modal data is an image related to the text. Similarly, in a text-audio modal data pair, the first modal data is the text description, and the second modal data is audio related to the text.

[0092] For example, the first modal data can be denoted as s1 and the second modal data as s2, so that the modal data pair can be denoted as pair(s1, s2).

[0093] When using a multimodal pre-trained model to concatenate and mix multimodal training data to obtain a multimodal representation, the multimodal pre-trained model can be used to extract features from the first modality data to obtain a first feature vector, and to extract features from the second modality data to obtain a second feature vector. This can be understood as encoding the two models and converting them into continuous high-dimensional vector representations v1 and v2.

[0094] For example, for first-modal data (such as text), the multimodal pre-trained model performs deep processing through its encoder (based on a Transformer architecture) to transform the text into a series of hidden states, ultimately obtaining a first feature vector v1 that represents the semantic information of the text. Similarly, for second-modal data (such as images), the multimodal pre-trained model also performs deep feature extraction through its encoder (possibly using a dedicated image processing network), transforming the image into a series of feature representations, ultimately obtaining a second feature vector v2 representing the content of the image.

[0095] After obtaining the first and second feature vectors, they are concatenated and mixed to obtain a multimodal representation. This can be understood as generating a unified multimodal representation by concatenating or fusing these two feature vectors. For example, the first and second feature vectors can be connected in a certain order or manner to form a high-dimensional vector containing information from both modalities. Therefore, unlike the modality-independent encoding method in the traditional CLIP model, this disclosure adopts a shared encoder strategy, where the multimodal pre-trained model simultaneously receives and processes the first and second feature vectors.

[0096] Therefore, the powerful capabilities of multimodal pre-trained models can be effectively utilized to transform data from different modalities into a unified multimodal representation, providing high-quality input for subsequent pre-set training strategies. This enables the model to learn the semantic relationships between different modalities in the feature space, thereby improving the model's multimodal understanding and retrieval capabilities.

[0097] In an optional embodiment, in step S221, a multimodal pre-trained model is used to concatenate and mix the first feature vector and the second feature vector to obtain a multimodal representation, including the following method steps:

[0098] Step S2211: The first feature vector and the second feature vector are concatenated and mixed using a multimodal pre-trained model to obtain the concatenated sequence output by the last transformation layer of the multimodal pre-trained model, and a preset marker is found in the concatenated sequence to obtain the multimodal representation, wherein the preset marker is the end marker position of the concatenated sequence.

[0099] In this embodiment of the disclosure, when using a multimodal pre-trained model to concatenate and mix the first feature vector and the second feature vector to obtain a multimodal representation, the multimodal pre-trained model can be used to concatenate and mix the first feature vector and the second feature vector to obtain the concatenated sequence output by the last transformation layer of the multimodal pre-trained model, and a preset marker can be searched in the concatenated sequence to obtain the multimodal representation.

[0100] This can be understood as using a multimodal pre-trained model to concatenate the first and second feature vectors together, forming a sequence that is then input into the multimodal pre-trained model. The last transformation layer of the multimodal pre-trained model is the last Transformer structure within the model. Therefore, the multimodal pre-trained model treats the first and second feature vectors as a unified input sequence and performs deep transformation and fusion processing through the Transformer structure within the model.

[0101] The last transformation layer of the multimodal pre-trained model outputs a deeply processed concatenated sequence containing the combined information of the first and second feature vectors after fusion. Predefined labels are then searched within this concatenated sequence to obtain the multimodal representation.

[0102] The preset marker is the end marker of the concatenated sequence, that is, the last [EOS] flag in the sequence direction. This means that the result of the end [EOS] flag of the sequence represented by the output layer of the last Transformer structure is used as the input to obtain the multimodal representation, which can be denoted as s. i .

[0103] As can be seen, this disclosure adds a preset token, namely the [EOS] token, to the end of the input sequence. This token corresponds to a specific position in the model's output layer, and the output vector at this position will be used as the final multimodal representation of the input data. The concatenated sequence output by the multimodal pre-trained model contains multiple vectors, among which the output vector corresponding to the [EOS] token is a multimodal representation that has undergone depth transformation and fusion processing, and can integrate information from both the first and second modalities.

[0104] Therefore, by adding the [EOS] marker to the end of the input sequence, this disclosure ensures that the multimodal pre-trained model generates a unified multimodal representation, rather than independent modal representations, when processing composite modal inputs. Consequently, regardless of which modalities or modalities the input data contains, the model can generate the final multimodal representation in a consistent manner. This not only simplifies model design but also improves the efficiency and accuracy of downstream tasks such as multimodal retrieval and clustering. It avoids the additional storage and computational burden caused by the divide-and-conquer strategy in previous solutions, while ensuring that the relevance scores calculated from different modalities are comparable in a unified feature space.

[0105] In an optional embodiment, in step S23, the multimodal pre-trained model is adjusted using multimodal representation to obtain the target multimodal representation model, including the following method steps:

[0106] Step S231: Construct positive sample pairs and negative sample pairs corresponding to multimodal representations using a preset training strategy, wherein positive sample pairs are related sample pairs of multimodal representations and negative sample pairs are unrelated sample pairs of multimodal representations.

[0107] Step S232: Based on the multimodal representation, positive sample pairs, and negative sample pairs, adjust the multimodal pre-trained model to obtain the target multimodal representation model.

[0108] In this embodiment of the disclosure, the preset training strategy can be a contrastive learning training strategy, a two-stream network training strategy, a joint learning training strategy, a self-supervised learning training strategy, etc., and is not limited here.

[0109] In this embodiment of the disclosure, a contrastive learning training strategy is used as an example. It can be understood that the goal of the contrastive learning training strategy is to maximize the similarity of positive sample pairs while minimizing the similarity of negative sample pairs. Typically, during training, the contrastive learning training strategy is used to optimize the model's representational ability by constructing multiple pairs of negative sample data for each pair of positive sample data and calculating the similarity difference between the positive and negative sample pairs.

[0110] When adjusting a pre-trained multimodal model using a pre-defined training strategy and multimodal representations to obtain a target multimodal representation model, a contrastive learning training strategy can be used to construct positive and negative sample pairs corresponding to the multimodal representations. Here, positive sample pairs are related sample pairs of multimodal representations, i.e., combinations of semantically related data from different modalities, which can transform the input s i Positive samples (relevant samples) are denoted as

[0111] Negative sample pairs are irrelevant sample pairs representing multimodal representations, i.e., combinations of semantically unrelated data from different modalities. It can be understood that, to enhance the model's contrastive learning ability, multiple negative sample pairs can be constructed for each positive sample pair. Therefore, the input s can be... i Negative samples (irrelevant samples) are denoted as

[0112] After constructing positive and negative sample pairs, the multimodal pre-trained model can be adjusted based on the multimodal representation, positive and negative sample pairs to obtain the target multimodal representation model.

[0113] Therefore, by constructing positive and negative sample pairs, and based on multimodal representations, positive and negative sample pairs, the multimodal pre-trained model is specifically adjusted. This ultimately yields a target multimodal representation model capable of efficiently representing and retrieving multimodal data in a unified feature space, improving the model's performance and efficiency in multimodal retrieval and understanding scenarios. In other words, by utilizing contrastive learning training strategies and multimodal representations to adjust the multimodal pre-trained model, it is possible to more accurately capture the semantic relationships between multimodal data and achieve consistent representations in the feature space.

[0114] In an optional embodiment, in step S231, constructing positive and negative sample pairs corresponding to multimodal representations using a preset training strategy includes the following method steps:

[0115] Step S2311: Using a preset training strategy, obtain positive sample pairs corresponding to multimodal representations from the same batch of sample pairs, and determine the remaining sample pairs other than positive sample pairs in the same batch of sample pairs as negative sample pairs.

[0116] In this embodiment of the disclosure, when constructing positive and negative sample pairs corresponding to multimodal representations using a preset training strategy, a contrastive learning training strategy can be used to obtain positive sample pairs corresponding to multimodal representations from the same batch of sample pairs, and to determine the remaining sample pairs other than positive sample pairs in the same batch of sample pairs as negative sample pairs.

[0117] Understandably, during the training of multimodal pre-trained models, data is typically processed in batches. A batch contains multiple sample pairs consisting of data from the first modality and data from the second modality, such as text and image pairings. These sample pairs constitute part of the training data and are processed simultaneously in one iteration.

[0118] Within a batch, a sample pair is considered a positive sample pair if the two modalities it contains are semantically related, meaning they describe or depict the same concept or entity. For example, if the text describes "a cat is sleeping" and the image shows "a sleeping cat," then this text-image pair constitutes a positive sample pair.

[0119] In this disclosure, all other sample pairs in a batch, except for positive sample pairs, are considered negative sample pairs. Negative sample pairs are semantically unrelated, meaning they do not describe or represent the same concept or entity. For example, if a sample pair consists of the text "a cat is sleeping" and the image "a car is driving," then this text-image pair constitutes a negative sample pair.

[0120] For example, during training, data can be selected from a batch of multimodal datasets, where each pair of data (i.e., a modality data pair) is considered a positive sample pair. Then, an in-batch negative approach is employed, where all data combinations within the same batch, except for the positive sample pairs, are considered negative sample pairs. Thus, each positive sample pair has multiple negative sample pairs for comparative learning, thereby helping the model distinguish between semantically similar and dissimilar samples in the feature space.

[0121] In an optional embodiment, in step S232, the multimodal pre-trained model is adjusted based on the multimodal representation, positive sample pairs, and negative sample pairs to obtain the target multimodal representation model, including the following method steps:

[0122] Step S2321: Calculate the target loss based on multimodal representation, positive sample pairs, and negative sample pairs;

[0123] Step S2322: Update the model parameters of the multimodal pre-trained model based on the target loss to obtain the target multimodal representation model.

[0124] In this embodiment of the disclosure, when adjusting the multimodal pre-trained model based on multimodal representations, positive sample pairs, and negative sample pairs to obtain the target multimodal representation model, the target loss can be calculated based on the multimodal representations, positive sample pairs, and negative sample pairs. That is, in order for the multimodal pre-training to distinguish between positive and negative sample pairs, a contrastive learning loss function can be used to calculate the target loss, maximizing the similarity between positive sample pairs while minimizing the similarity between negative sample pairs. For example, for similar image-text pairs, the representations of images and text are optimized in a common embedding space, so that image-text pairs with the same semantics are close together, while those with different semantics are far apart.

[0125] For example, the target loss can be denoted as Among them, s i For the input sample, To be with s i Positive samples with the same semantics or close relationship, Let j be a series of negative samples, where j is the index of the negative sample from 1 to n.

[0126] After calculating the target loss, the model parameters of the multimodal pre-trained model are updated based on the target loss to obtain the target multimodal representation model. This can be understood as updating the parameters of the multimodal pre-trained model based on the target loss to optimize its multimodal representation capability.

[0127] For example, this can be achieved through the backpropagation algorithm, where the target loss guides the model to adjust its internal structure and weights, enabling the model to more accurately distinguish between positive and negative sample pairs in the next iteration. This means increasing the similarity of relevant samples and decreasing the similarity of irrelevant samples. Through continuous iteration and updating of model parameters, the multimodal pre-trained model can be gradually optimized, allowing it to represent and distinguish data from different modalities more efficiently and accurately within a unified feature space.

[0128] In an optional embodiment, step S2311 involves calculating the target loss based on multimodal representation, positive sample pairs, and negative sample pairs, including the following method steps:

[0129] Step S23111: Calculate the first similarity between the multimodal representation and the positive sample pair, and perform an exponential transformation on the first similarity to obtain the first calculation result;

[0130] Step S23112: Calculate the second similarity between the multimodal representation and the negative sample pair, and perform an exponential transformation on the second similarity to obtain the second calculation result;

[0131] Step S23113: Calculate the target loss based on the first calculation result and the second calculation result.

[0132] In this embodiment of the disclosure, when calculating the target loss based on multimodal representation, positive sample pairs, and negative sample pairs, a first similarity between the multimodal representation and the positive sample pairs can be calculated, and an exponential transformation is performed on the first similarity to obtain a first calculation result. The first similarity can be denoted as... The first calculation result can be denoted as:

[0133] Then, a second similarity is calculated between the multimodal representation and the negative sample pair, and an exponential transformation is performed on the second similarity to obtain a second calculation result. The first similarity can be denoted as... The first calculation result can be denoted as:

[0134] Finally, based on the first calculation result Compared with the second calculation result The target loss is calculated, for example, as shown in the following formula (1).

[0135] Figure 3 is a schematic diagram of a unified representation based on a multimodal pre-trained large model according to an embodiment of the present disclosure. As shown in Figure 3, the multimodal pre-trained large model receives and processes different types of input data and generates corresponding outputs. It can be seen that the input data of the multimodal pre-trained large model can include data from multiple modalities such as text, images, videos, and mixed modalities. The multimodal pre-trained large model can process the input data to output vectors such as text vectors, image vectors, video vectors, and mixed modal vectors. That is, the multimodal pre-trained large model of the present disclosure can receive inputs from different modalities and generate a unified multimodal representation through its internal processing mechanism.

[0136] Figure 4 is a schematic diagram of unified multimodal representation training based on contrastive learning according to an embodiment of the present disclosure. Its core objective is to optimize the model so that it can learn to map semantically related data from different modalities to similar representation spaces. As shown in Figure 4, when training unified multimodal representations based on contrastive learning, queries and candidates can be input into a multimodal large model (MMLM). The MMLM processes the query and candidate objects separately, converts the output into numerical vector representations through embedding, and calculates the cosine similarity between the representations. By calculating the similarity between two representations, the query and candidate objects are converted into a unified feature representation.

[0137] Figure 5 is a schematic diagram illustrating a unified multimodal retrieval scenario according to an embodiment of the present disclosure. As shown in Figure 5, three different retrieval scenarios are depicted, including unimodal retrieval, cross-modal retrieval, and fusion modal retrieval. Each retrieval scenario has its specific query and candidate objects, and the unified multimodal representation model can flexibly handle these different retrieval requirements. Specifically, in the unimodal retrieval scenario, candidate text can be determined based on the query text, or visual candidates can be determined based on the visual query; this is not limited here. In the cross-modal retrieval scenario, visual candidates can be determined based on the query text, or candidate text can be determined based on the visual query; this is not limited here. In the fusion modal retrieval scenario, candidates can be determined based on the fusion modal query, or fusion modal candidates can be determined based on the query; this is not limited here.

[0138] As can be seen, this disclosure proposes a unified multimodal representation model based on a native multimodal large model. It aims to improve the overall efficiency and accuracy of downstream retrieval tasks by integrating and optimizing unified multimodal representation capabilities through a novel architecture and training strategy. This model not only achieves more efficient feature extraction in multimodal fusion but also exhibits superior adaptability and robustness in different application scenarios.

[0139] This disclosure successfully implements a unified multimodal representation model by constructing multimodal correlation training data, using a multimodal pre-trained large model foundation, and combining representation acquisition and contrastive learning training strategies. This model possesses powerful cross-modal understanding and representation capabilities, can play an important role in various application scenarios, and has significant technical advantages and broad application prospects.

[0140] It is easy to understand that the beneficial effects of the model training method provided in this disclosure include the following points.

[0141] Beneficial effects (1) Strong unified representation ability: Through large-scale multimodal pre-training, the model can learn the deep semantic association between different modalities and achieve unified multimodal representation.

[0142] Beneficial effects (2), excellent generalization ability: diverse training data and contrastive learning strategies enable the model to have good generalization ability under different tasks and different data distributions.

[0143] Beneficial effects (3) and wide applicability: The multimodal representation model disclosed herein can be widely applied in multiple fields such as image retrieval, cross-modal retrieval, video retrieval, and intelligent recommendation, and has broad application prospects.

[0144] Beneficial effects (4), high training efficiency: Advanced pre-training technology and optimization strategies are adopted to improve training efficiency and shorten model training time.

[0145] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or refuse.

[0146] Furthermore, it should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this disclosure is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this disclosure. Secondly, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this disclosure.

[0147] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, they can also be implemented by hardware. Based on this understanding, the technical solutions of this disclosure, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this disclosure.

[0148] According to an embodiment of this disclosure, a data processing method is also provided as shown in FIG6. FIG6 is a flowchart of a data processing method according to an embodiment of this disclosure. As shown in FIG6, the method includes:

[0149] Step S61: Obtain the multimodal query data input by the user;

[0150] Step S62: Use the target multimodal representation model to perform knowledge retrieval and response content generation on the multimodal query data to obtain the target answer; wherein, the target multimodal representation model is generated according to any of the above-mentioned model training methods.

[0151] In this embodiment of the disclosure, multimodal query data can be understood as multimodal data to be queried input by the user. Exemplarily, multimodal query data can be single-modal, such as a query of pure text or pure image; it can also be composite-modal, such as a query combining text description and image, and is not limited here.

[0152] After obtaining the multimodal query data input by the user, a target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal query data to obtain the target answer. The target answer can be understood as a solution closely related to the multimodal query data. The target multimodal representation model is generated according to any of the model training methods described above, which will not be elaborated here.

[0153] In this embodiment, multimodal query data input by the user is acquired, and then a target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal query data to obtain the target answer. This target multimodal representation model is generated according to any of the model training methods described above. Therefore, by using the target multimodal representation model to perform knowledge retrieval and response content generation on the multimodal query data input by the user, richer and more accurate search results and response content can be provided to the user, thereby obtaining the target answer.

[0154] The data processing methods provided in this disclosure can be applied, but are not limited to, to data retrieval applications in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, data retrieval for e-commerce services, data retrieval for education services, and data retrieval for legal services are not limited here.

[0155] By employing the embodiments of this disclosure, multimodal query data input by the user is obtained, and then a target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal query data to obtain the target answer. The target multimodal representation model is generated according to any of the above-mentioned model training methods, thereby achieving the goal of processing multimodal data in a unified manner. This overcomes the limitations in complex retrieval scenarios, enhances the capabilities of complex retrieval scenarios, improves the efficiency and accuracy of information retrieval, reduces system overhead, and enhances the user experience. This solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limits information retrieval, and makes it unsuitable for complex retrieval scenarios.

[0156] It should be noted that the preferred implementation of this embodiment can be found in the relevant descriptions in the embodiments, and will not be repeated here.

[0157] According to an embodiment of this disclosure, a data processing method as shown in FIG7 is also provided. FIG7 is a flowchart of a data processing method according to an embodiment of this disclosure. As shown in FIG7, the method includes:

[0158] Step S71: Obtain multimodal product query data, wherein the multimodal product query data is used to provide description information of the products to be selected;

[0159] Step S72: Use the target multimodal representation model to perform knowledge retrieval and response content generation on the multimodal product query data to obtain target product information; wherein, the target multimodal representation model is generated according to any of the above-mentioned model training methods.

[0160] In this embodiment of the disclosure, multimodal product query data can be understood as multimodal product data input by the user to be queried. This multimodal product query data is used to provide descriptive information about the products to be purchased, enabling the user to quickly find products that meet their needs. Exemplarily, multimodal product query data can be single-modal, such as a query consisting solely of text or images; it can also be multimodal, such as a query combining text descriptions and images. Multimodal product query data may include, for example, multimodal data such as product name, description, images, and user reviews, and is not limited here.

[0161] After obtaining the multimodal product query data input by the user, a target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal query data to obtain target product information. The target product information can be understood as target product information closely related to the multimodal product query data. The target multimodal representation model is generated according to any of the model training methods described above, which will not be elaborated here.

[0162] In this embodiment of the disclosure, multimodal product query data is acquired, which provides descriptive information about the products to be purchased. Then, a target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal product query data to obtain target product information. The target multimodal representation model is generated according to any of the model training methods described above. Therefore, by using the target multimodal representation model to perform knowledge retrieval and response content generation on the multimodal product query data, richer and more accurate search results and response content can be provided to users, thereby obtaining target product information.

[0163] The data processing methods provided in this disclosure can be applied, but are not limited to, to data retrieval applications in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, data retrieval for e-commerce services, data retrieval for education services, and data retrieval for legal services are not limited here.

[0164] By employing embodiments of this disclosure, multimodal product query data is acquired, which provides descriptive information about the products to be purchased. Then, a target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal product query data to obtain target product information. The target multimodal representation model is generated according to any of the aforementioned model training methods. This achieves the goal of processing multimodal data in a unified manner, thereby overcoming the limitations in complex retrieval scenarios, enhancing the capabilities of complex retrieval scenarios, improving the efficiency and accuracy of information retrieval, reducing system overhead, and improving user experience. This solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limiting information retrieval and making it unsuitable for complex retrieval scenarios.

[0165] It should be noted that the preferred implementation of this embodiment can be found in the relevant descriptions in the embodiments, and will not be repeated here.

[0166] According to embodiments of this disclosure, a data processing method is also provided as shown in FIG8. FIG8 is a flowchart of a data processing method according to an embodiment of this disclosure. As shown in FIG8, the method includes:

[0167] Step S81: Obtain a data processing request through the first application programming interface, wherein the request data carried in the data processing request includes: multimodal query data;

[0168] Step S82: Return a data processing response through the second application programming interface. The response data carried in the data processing response includes: the target answer, which is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model. The target multimodal representation model is generated according to any of the above-mentioned model training methods.

[0169] The first and second application programming interfaces (APIs) mentioned above can be the same or different APIs. In one optional embodiment, the interface parameters in the first and second APIs may include, but are not limited to: a global interface identifier, an interface signing key, an interface timestamp, an interface request identifier, and a system call credential identifier. The first API can use GET or POST as the interface request method to obtain the file processing request. The second API can use JSON format to return the file processing response.

[0170] In this embodiment of the disclosure, the data processing request is a request to process the multimodal query data carried, and the data processing response is a response to the data processing request, carrying the target answer.

[0171] The system obtains a data processing request through a first application programming interface (API), which carries request data including multimodal query data. Then, it returns a data processing response through a second API, which carries response data including the target answer. The target answer is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model. The target multimodal representation model is generated according to any of the model training methods described above, which will not be elaborated here. Therefore, by using the target multimodal representation model to perform knowledge retrieval and response content generation on the user-input multimodal query data, the system can provide users with richer and more accurate search results and response content, thereby obtaining the target answer.

[0172] The data processing methods provided in this disclosure can be applied, but are not limited to, to data retrieval applications in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, data retrieval for e-commerce services, data retrieval for education services, and data retrieval for legal services are not limited here.

[0173] By employing the embodiments of this disclosure, a data processing request is obtained through a first application programming interface (API), wherein the request data carried in the data processing request includes multimodal query data. Then, a data processing response is returned through a second API, wherein the response data carried in the data processing response includes a target answer. The target answer is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model. This achieves the goal of processing multimodal data in a unified manner, thereby overcoming the limitations in composite retrieval scenarios, enhancing the capabilities of composite retrieval scenarios, improving the efficiency and accuracy of information retrieval, reducing system overhead, and improving user experience. This solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limiting information retrieval and making it unsuitable for composite retrieval scenarios.

[0174] According to an embodiment of this disclosure, a data processing method is also provided as shown in FIG9. FIG9 is a flowchart of a data processing method according to an embodiment of this disclosure. As shown in FIG9, the method includes:

[0175] Step S91: Obtain the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multimodal query data;

[0176] Step S92, in response to the data processing dialogue request, return a data processing dialogue response, wherein the information carried in the data processing dialogue response includes: the target answer, which is obtained by using a target multimodal representation model to perform knowledge retrieval and response content generation on the multimodal query data, and the target multimodal representation model is generated according to any of the above-mentioned model training methods;

[0177] Step S93: Display the target answer within the graphical user interface.

[0178] In this embodiment of the disclosure, a data processing dialogue request can be understood as a dialogue request initiated by a user when engaging in dialogue with artificial intelligence. The request data carried in the data processing dialogue request includes multimodal query data. A data processing dialogue response can be understood as a response to the data processing dialogue request. The information carried in the data processing dialogue response includes the target answer.

[0179] In this embodiment of the disclosure, a data processing dialogue request is obtained from the current input. The request data includes multimodal query data. Then, in response to the data processing dialogue request, a data processing dialogue response is returned. This response includes a target answer, which is obtained by using a target multimodal representation model to perform knowledge retrieval and response content generation on the multimodal query data. The target multimodal representation model is generated according to any of the aforementioned model training methods. Finally, the target answer is displayed in the graphical user interface to provide feedback to the user. Therefore, by using a target multimodal representation model to perform knowledge retrieval and response content generation on the user-input multimodal query data, richer and more accurate search results and response content can be provided to the user, thereby obtaining the target answer.

[0180] The data processing methods provided in this disclosure can be applied, but are not limited to, to data retrieval applications in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, data retrieval for e-commerce services, data retrieval for education services, and data retrieval for legal services are not limited here.

[0181] By employing embodiments of this disclosure, a data processing dialogue request is obtained from the current input, wherein the request data carried in the data processing dialogue request includes multimodal query data. Then, in response to the data processing dialogue request, a data processing dialogue response is returned, wherein the information carried in the data processing dialogue response includes the target answer. The target answer is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model. The target multimodal representation model is generated according to any of the aforementioned model training methods. Finally, the target answer is displayed in the graphical user interface to provide feedback to the user. This achieves the goal of processing multimodal data in a unified manner, thereby overcoming the limitations in complex retrieval scenarios, enhancing the capabilities of complex retrieval scenarios, improving the efficiency and accuracy of information retrieval, reducing system overhead, and improving the user experience. This solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limiting information retrieval and making it unsuitable for complex retrieval scenarios.

[0182] It should be noted that the preferred implementation of this embodiment can be found in the relevant descriptions in the embodiments, and will not be repeated here.

[0183] According to an embodiment of this disclosure, a data processing system as shown in FIG10 is also provided. FIG10 is a schematic diagram of the structure of a data processing system according to an embodiment of this disclosure. As shown in FIG10, the system includes:

[0184] The client is used to send multimodal query data;

[0185] The server connects to the client and is used to perform knowledge retrieval and response content generation on multimodal query data using a target multimodal representation model to obtain the target answer. The target multimodal representation model is generated according to any of the above-mentioned model training methods.

[0186] The client is also used to output the target answer.

[0187] In this embodiment of the disclosure, the data processing system includes a client and a server. The client is used to interact with the user, obtain multimodal query data input by the user, and send the multimodal query data to the server.

[0188] The server connects with the client to perform knowledge retrieval and response content generation on multimodal query data using a target multimodal representation model to obtain the target answer. The target multimodal representation model is generated according to any of the model training methods described above, which will not be elaborated upon here.

[0189] The server is also used to feed the target answer back to the client, and the client outputs the target answer back to the user. Therefore, by using a target multimodal representation model to perform knowledge retrieval and response content generation on the user-input multimodal query data, richer and more accurate search results and response content can be provided to the user, thus obtaining the target answer.

[0190] The data processing methods provided in this disclosure can be applied, but are not limited to, to data retrieval applications in fields such as e-commerce services, education services, legal services, medical services, conference services, social networking services, financial product services, logistics services, and navigation services. For example, data retrieval for e-commerce services, data retrieval for education services, and data retrieval for legal services are not limited here.

[0191] By employing the embodiments of this disclosure and the aforementioned data processing system, the goal of processing multimodal data in a unified manner is achieved, thereby overcoming the limitations in composite retrieval scenarios, enhancing the capabilities of composite retrieval scenarios, improving the efficiency and accuracy of information retrieval, reducing system overhead, and enhancing user experience. This solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limiting information retrieval and making it unsuitable for composite retrieval scenarios.

[0192] According to an embodiment of this disclosure, an apparatus embodiment for implementing the above-described model training method is also provided. FIG11 is a schematic structural diagram of a model training apparatus according to an embodiment of this disclosure. As shown in FIG11, the apparatus includes:

[0193] The first acquisition module 1101 is configured to acquire multimodal training data from the target multimodal dataset;

[0194] The hybrid module 1102 is configured to use a multimodal pre-trained model to concatenate and mix multimodal training data to obtain a multimodal representation;

[0195] The adjustment module 1103 is configured to adjust the multimodal pre-trained model using multimodal representation to obtain a target multimodal representation model, wherein the target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal query data input by the user to obtain the target answer.

[0196] Optionally, the device further includes: a processing module configured to acquire an initial multimodal dataset; perform data cleaning and preprocessing on the initial multimodal dataset to obtain an intermediate multimodal dataset; and perform data alignment and annotation on the intermediate multimodal dataset to obtain a target multimodal dataset.

[0197] Optionally, the multimodal training data includes: multiple modal data pairs, each including first modal data and second modal data. The above-mentioned mixing module 1102 is further configured to: use a multimodal pre-trained model to extract features from the first modal data to obtain a first feature vector, extract features from the second modal data to obtain a second feature vector, and concatenate and mix the first feature vector and the second feature vector to obtain a multimodal representation.

[0198] Optionally, the above-mentioned mixing module 1102 is further configured to: use a multimodal pre-trained model to concatenate and mix the first feature vector and the second feature vector to obtain the concatenated sequence output by the last transformation layer of the multimodal pre-trained model, and search for a preset marker in the concatenated sequence to obtain a multimodal representation, wherein the preset marker is the end marker bit of the concatenated sequence.

[0199] Optionally, the adjustment module 1103 is further configured to: construct positive sample pairs and negative sample pairs corresponding to multimodal representations using a preset training strategy, wherein the positive sample pairs are related sample pairs of multimodal representations and the negative sample pairs are unrelated sample pairs of multimodal representations; and adjust the multimodal pre-trained model based on the multimodal representations, positive sample pairs, and negative sample pairs to obtain the target multimodal representation model.

[0200] Optionally, the adjustment module 1103 is further configured to: use a preset training strategy to obtain positive sample pairs corresponding to multimodal representations from the same batch of sample pairs, and to determine the remaining sample pairs other than positive sample pairs in the same batch of sample pairs as negative sample pairs.

[0201] Optionally, the adjustment module 1103 is further configured to: calculate the target loss based on the multimodal representation, positive sample pairs, and negative sample pairs; update the model parameters of the multimodal pre-trained model according to the target loss to obtain the target multimodal representation model.

[0202] Optionally, the adjustment module 1103 is further configured to: calculate a first similarity between the multimodal representation and the positive sample pair, and perform an exponential transformation on the first similarity to obtain a first calculation result; calculate a second similarity between the multimodal representation and the negative sample pair, and perform an exponential transformation on the second similarity to obtain a second calculation result; and calculate the target loss based on the first calculation result and the second calculation result.

[0203] This embodiment of the invention obtains multimodal training data from a target multimodal dataset. Then, a multimodal pre-trained model is used to concatenate and mix the multimodal training data to obtain a multimodal representation. Finally, the multimodal representation is used to adjust the multimodal pre-trained model to obtain a target multimodal representation model. This target multimodal representation model is used to perform knowledge retrieval and response content generation on user-input multimodal query data to obtain the target answer. This achieves the goal of processing multimodal data in a unified manner, overcoming limitations in complex retrieval scenarios, enhancing the capabilities of complex retrieval scenarios, improving the efficiency and accuracy of information retrieval, reducing system overhead, and improving user experience. Furthermore, it solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limiting information retrieval and making it unsuitable for complex retrieval scenarios.

[0204] It should be noted that the acquisition module 1101, mixing module 1102, and adjustment module 1103 mentioned above correspond to steps S21 to S23 in the embodiments. The three modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.

[0205] According to embodiments of this disclosure, another apparatus embodiment for implementing the above-described data processing method is also provided. FIG12 is a schematic structural diagram of another data processing apparatus according to an embodiment of this disclosure. As shown in FIG12, the apparatus includes:

[0206] The second acquisition module 1201 is configured to acquire multimodal query data input by the user;

[0207] The processing module 1202 is configured to use a target multimodal representation model to perform knowledge retrieval and response content generation on multimodal query data to obtain the target answer; wherein the target multimodal representation model is generated according to any of the above-mentioned model training methods.

[0208] By employing the embodiments of this disclosure, multimodal query data input by the user is obtained, and then a target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal query data to obtain the target answer. The target multimodal representation model is generated according to any of the above-mentioned model training methods, thereby achieving the goal of processing multimodal data in a unified manner. This overcomes the limitations in complex retrieval scenarios, enhances the capabilities of complex retrieval scenarios, improves the efficiency and accuracy of information retrieval, reduces system overhead, and enhances the user experience. This solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limits information retrieval, and makes it unsuitable for complex retrieval scenarios.

[0209] It should be noted that the second acquisition module 1201 and the first processing module 1202 mentioned above correspond to steps S61 and S62 in the embodiments. The two modules and the corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware components or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.

[0210] According to embodiments of this disclosure, another embodiment of an apparatus for implementing the above-described data processing method is also provided. FIG13 is a schematic structural diagram of another data processing apparatus according to an embodiment of this disclosure. As shown in FIG13, the apparatus includes:

[0211] The third acquisition module 1301 is configured to acquire multimodal product query data, wherein the multimodal product query data is used to provide description information of the products to be purchased;

[0212] The generation module 1302 is configured to use a target multimodal representation model to perform knowledge retrieval and response content generation on multimodal product query data to obtain target product information; wherein, the target multimodal representation model is generated according to any of the above-mentioned model training methods.

[0213] By employing embodiments of this disclosure, multimodal product query data is acquired, which provides descriptive information about the products to be purchased. Then, a target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal product query data to obtain target product information. The target multimodal representation model is generated according to any of the aforementioned model training methods. This achieves the goal of processing multimodal data in a unified manner, thereby overcoming the limitations in complex retrieval scenarios, enhancing the capabilities of complex retrieval scenarios, improving the efficiency and accuracy of information retrieval, reducing system overhead, and improving user experience. This solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limiting information retrieval and making it unsuitable for complex retrieval scenarios.

[0214] It should be noted that the third acquisition module 1301 and generation module 1302 mentioned above correspond to steps S71 and S72 in the embodiments. The two modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware components or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.

[0215] According to embodiments of this disclosure, another embodiment of an apparatus for implementing the above-described data processing method is also provided. FIG14 is a schematic structural diagram of another data processing apparatus according to an embodiment of this disclosure. As shown in FIG14, the apparatus includes:

[0216] The fourth acquisition module 1401 is configured to acquire a data processing request through the first application programming interface, wherein the request data carried in the data processing request includes: multimodal query data;

[0217] The first return module 1402 is configured to return a data processing response through a second application programming interface. The response data carried in the data processing response includes: the target answer, which is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model. The target multimodal representation model is generated according to any of the above-mentioned model training methods.

[0218] By employing the embodiments of this disclosure, a data processing request is obtained through a first application programming interface (API), wherein the request data carried in the data processing request includes multimodal query data. Then, a data processing response is returned through a second API, wherein the response data carried in the data processing response includes a target answer. The target answer is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model. This achieves the goal of processing multimodal data in a unified manner, thereby overcoming the limitations in composite retrieval scenarios, enhancing the capabilities of composite retrieval scenarios, improving the efficiency and accuracy of information retrieval, reducing system overhead, and improving user experience. This solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limiting information retrieval and making it unsuitable for composite retrieval scenarios.

[0219] It should be noted that the fourth acquisition module 1401 and the first return module 1402 mentioned above correspond to steps S81 and S82 in the embodiments. The two modules and the corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware components or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.

[0220] According to embodiments of this disclosure, another embodiment of an apparatus for implementing the above-described data processing method is also provided. FIG15 is a schematic structural diagram of another data processing apparatus according to an embodiment of this disclosure. As shown in FIG15, the apparatus includes:

[0221] The fifth acquisition module 1501 is configured to acquire the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multimodal query data;

[0222] The second return module 1502 is configured to respond to a data processing dialogue request and return a data processing dialogue response. The information carried in the data processing dialogue response includes: the target answer, which is obtained by using a target multimodal representation model to perform knowledge retrieval and response content generation on the multimodal query data. The target multimodal representation model is generated according to any of the above-mentioned model training methods.

[0223] Display module 1503 is configured to display the target answer within the graphical user interface.

[0224] By employing embodiments of this disclosure, a data processing dialogue request is obtained from the current input, wherein the request data carried in the data processing dialogue request includes multimodal query data. Then, in response to the data processing dialogue request, a data processing dialogue response is returned, wherein the information carried in the data processing dialogue response includes the target answer. The target answer is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model. The target multimodal representation model is generated according to any of the aforementioned model training methods. Finally, the target answer is displayed in the graphical user interface to provide feedback to the user. This achieves the goal of processing multimodal data in a unified manner, thereby overcoming the limitations in complex retrieval scenarios, enhancing the capabilities of complex retrieval scenarios, improving the efficiency and accuracy of information retrieval, reducing system overhead, and improving the user experience. This solves the technical problem in related technologies where the use of the CLIP model for multimodal representation leads to heavy computational and storage burdens, limiting information retrieval and making it unsuitable for complex retrieval scenarios.

[0225] It should be noted that the fifth acquisition module 1501, the second return module 1502, and the display module 1503 mentioned above correspond to steps S91 to S93 in the embodiments. The instances and application scenarios implemented by the two modules and the corresponding steps are the same, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware components or software components stored in memory and processed by one or more processors. The above modules can also run in the server 10 provided in the embodiments.

[0226] It should be noted that the preferred implementation schemes involved in the above embodiments of this disclosure are the same as the schemes, application scenarios and implementation processes provided in the embodiments, but are not limited to the schemes provided in the embodiments.

[0227] Embodiments of this disclosure can provide a computing device. FIG16 is a structural block diagram of a computing device according to an embodiment of this disclosure. As shown in FIG16, the computing device A may include: one or more (only one is shown in FIG16) processors 1602, memory 1604, memory controller, and peripheral interfaces, wherein the peripheral interfaces may connect to radio frequency modules, audio modules, and displays, etc., without limitation.

[0228] The aforementioned computing device A can be understood as an integrated smart terminal, including but not limited to servers, desktop computers, PCs (Personal Computers), all-in-one model machines, etc., and the computing device may have the model described in the above embodiments of this disclosure pre-installed.

[0229] Specifically, computing device A can pre-install various types of models, including but not limited to models in natural language processing, visual processing, speech processing, code processing, and multimodal task processing, thus providing diverse model selection. In different product forms, computing device A can support one or more model usage methods, including but not limited to model training, model invocation, model fine-tuning, model deployment, model inference, and application. In some product forms, computing device A also supports model management, including but not limited to multi-type model management (supporting the management of discriminative, generative, and other model types), model version control (supporting the control of different model versions), and model evaluation (evaluating model performance and effectiveness based on model evaluation tools). In other product forms, computing device A can also create applications based on models, providing API calling capabilities, allowing models to be called into created applications through API interfaces, and providing application management tools for application management and monitoring.

[0230] Furthermore, the computing device A can also include data management (supporting the creation and management of model tuning datasets), a training center (providing abundant training resources to help users learn and master AI technology), and basic control capabilities (providing enterprise-level basic control capabilities to ensure the security and efficient operation of the system). Through the above functions, it provides a comprehensive and integrated device for AI development, training, deployment, and application.

[0231] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the methods and apparatus in the embodiments of this disclosure. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0232] The processor can invoke an executable program stored in memory via a transmission device to execute the method described in any of the above embodiments.

[0233] It will be understood by those skilled in the art that the structure shown in FIG16 is merely illustrative, and computing device A may also be a smartphone, tablet computer, handheld computer, mobile internet device (MID), PAD, or other terminal device. FIG16 does not limit the structure of the aforementioned computing device. For example, computing device A may include more or fewer components (such as network interface, display device, etc.) than shown in FIG16, or may have a different configuration than shown in FIG16.

[0234] Embodiments of this disclosure can provide an electronic device. FIG17 is a structural block diagram of an electronic device according to an embodiment of this disclosure. As shown in FIG17, the electronic device may include: an input / output device 172; a memory 174; and a processor 176, wherein the processor 176 is connected to the input / output device 172 and the memory 174 via a bus 178.

[0235] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the methods and apparatus in the embodiments of this disclosure. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0236] The processor can invoke an executable program stored in memory via a transmission device to execute the method described in any of the above embodiments.

[0237] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing the hardware related to the terminal device. The program can be stored in a computer-readable storage medium, which may include: flash drive, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.

[0238] Embodiments of this disclosure also provide a computer-readable storage medium. Optionally, in this embodiment, the computer-readable storage medium can be used to store program code executed by the model training method or data processing method provided in the above embodiments.

[0239] Optionally, in this embodiment, the computer-readable storage medium may be located in any computer terminal in a group of computer terminals in a computer network, or in any mobile terminal in a group of mobile terminals.

[0240] Embodiments of this disclosure also provide a computer program product, which includes a computer program that, when executed by a processor, implements any of the above-described model training methods or data processing methods.

[0241] In the above embodiments of this disclosure, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0242] In the several embodiments provided in this disclosure, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some interfaces; indirect couplings or communication connections between units or modules may be electrical or other forms.

[0243] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0244] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0245] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard drive, magnetic disk, or optical disk.

[0246] The above description is only a preferred embodiment of this disclosure. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principles of this disclosure, and these improvements and modifications should also be considered within the scope of protection of this disclosure.

Claims

1. A model training method, comprising: Obtain multimodal training data from the target multimodal dataset; The multimodal training data is concatenated and mixed using a multimodal pre-trained model to obtain a multimodal representation; The multimodal pre-trained model is adjusted using the multimodal representation to obtain a target multimodal representation model, wherein the target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal query data input by the user to obtain the target answer.

2. The model training method according to claim 1, wherein, The model training method also includes: Obtain the initial multimodal dataset; The initial multimodal dataset is cleaned and preprocessed to obtain an intermediate multimodal dataset; The intermediate multimodal dataset is subjected to data alignment and data annotation to obtain the target multimodal dataset.

3. The model training method according to claim 1, wherein, The multimodal training data includes multiple modality data pairs, each modality data pair including first modality data and second modality data. The multimodal training data is concatenated and mixed using the multimodal pre-trained model to obtain the multimodal representation, which includes: The first feature vector is obtained by using the multimodal pre-trained model to extract features from the first modality data, the second feature vector is obtained by extracting features from the second modality data, and the first feature vector and the second feature vector are concatenated and mixed to obtain the multimodal representation.

4. The model training method according to claim 3, wherein, The first feature vector and the second feature vector are concatenated and mixed using the multimodal pre-trained model to obtain the multimodal representation, which includes: The first feature vector and the second feature vector are concatenated and mixed using the multimodal pre-trained model to obtain the concatenated sequence output by the last transformation layer of the multimodal pre-trained model, and a preset marker is searched in the concatenated sequence to obtain the multimodal representation, wherein the preset marker is the end marker bit of the concatenated sequence.

5. The model training method according to any one of claims 1-4, wherein, The target multimodal representation model is obtained by adjusting the multimodal pre-trained model using the multimodal representation, including: Positive and negative sample pairs corresponding to the multimodal representations are constructed using a preset training strategy, wherein the positive sample pairs are the relevant sample pairs of the multimodal representations, and the negative sample pairs are the unrelated sample pairs of the multimodal representations. Based on the multimodal representation, the positive sample pairs, and the negative sample pairs, the multimodal pre-trained model is adjusted to obtain the target multimodal representation model.

6. The model training method according to claim 5, wherein, Constructing the positive sample pairs and negative sample pairs corresponding to the multimodal representations using the preset training strategy includes: Using the preset training strategy, the positive sample pairs corresponding to the multimodal representations are obtained from the same batch of sample pairs, and the remaining sample pairs in the same batch of sample pairs other than the positive sample pairs are determined as the negative sample pairs.

7. The model training method according to claim 5, wherein, Based on the multimodal representation, the positive sample pairs, and the negative sample pairs, the multimodal pre-trained model is adjusted to obtain the target multimodal representation model, which includes: The target loss is calculated based on the multimodal representation, the positive sample pairs, and the negative sample pairs; The model parameters of the multimodal pre-trained model are updated based on the target loss to obtain the target multimodal representation model.

8. The model training method according to claim 7, wherein, Calculating the target loss based on the multimodal representation, the positive sample pairs, and the negative sample pairs includes: Calculate the first similarity between the multimodal representation and the positive sample pair, and perform an exponential transformation on the first similarity to obtain a first calculation result; Calculate the second similarity between the multimodal representation and the negative sample pair, and perform an exponential transformation on the second similarity to obtain a second calculation result; The target loss is calculated based on the first calculation result and the second calculation result.

9. A data processing method, comprising: Obtain multimodal query data input by the user; A target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal query data to obtain the target answer; The target multimodal representation model is generated according to the model training method described in any one of claims 1 to 8.

10. A data processing method, comprising: Acquire multimodal product query data, wherein the multimodal product query data is used to provide descriptive information of the products to be selected; A target multimodal representation model is used to perform knowledge retrieval and response content generation on the multimodal product query data to obtain target product information; The target multimodal representation model is generated according to the model training method described in any one of claims 1 to 8.

11. A data processing method, comprising: A data processing request is obtained through a first application programming interface, wherein the request data carried in the data processing request includes: multimodal query data; The data processing response is returned through the second application programming interface, wherein the response data carried in the data processing response includes: the target answer, which is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model, and the target multimodal representation model is generated according to the model training method described in any one of claims 1 to 8.

12. A data processing method, comprising: Obtain the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multimodal query data; In response to the data processing dialogue request, a data processing dialogue response is returned, wherein the information carried in the data processing dialogue response includes: a target answer, which is obtained by performing knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model, and the target multimodal representation model is generated according to the model training method described in any one of claims 1 to 8; The target answer is displayed within the graphical user interface.

13. A data processing system, comprising: The client is used to send multimodal query data; The server, connected to the client, is used to perform knowledge retrieval and response content generation on the multimodal query data using a target multimodal representation model to obtain the target answer, wherein the target multimodal representation model is generated according to the model training method described in any one of claims 1 to 8; The client is also used to output the target answer.

14. An electronic device comprising: Memory, which stores executable programs; A processor for running the program, wherein the program, when running, performs the model training method according to any one of claims 1 to 8 or the data processing method according to any one of claims 9 to 12.

15. A computer-readable storage medium comprising a stored executable program, wherein, When the executable program is executed, it controls the device containing the computer-readable storage medium to perform the model training method according to any one of claims 1 to 8 or the data processing method according to any one of claims 9 to 12.

16. A computer program product comprising a computer program that, when executed by a processor, implements the model training method of any one of claims 1 to 8 or the data processing method of any one of claims 9 to 12.