Data processing method, model training method, electronic device and storage medium
By introducing a cross-modal attention mechanism into the target retrieval generation model, the problems of low efficiency and poor accuracy in multimodal query data retrieval and generation are solved, achieving efficient and accurate multimodal query data processing and improving the efficiency and accuracy of information retrieval and generation.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2025-12-05
- Publication Date
- 2026-06-18
AI Technical Summary
Existing technologies suffer from low retrieval efficiency and poor accuracy when retrieving multimodal query data. In particular, in multimodal understanding scenarios, traditional methods are unable to fully express user needs, which limits the accuracy and relevance of retrieval results.
A target retrieval and generation model is used to perform knowledge retrieval and information generation on multimodal query data. By introducing a cross-modal attention mechanism, target retrieval results that meet preset screening conditions are filtered, and the candidate retrieval results are sorted based on their relevance to generate the target answer.
It significantly improves the efficiency, accuracy, and user experience of information retrieval and generation, enhances the knowledge coverage and adaptability of the model, and can efficiently and accurately process multimodal query data.
Smart Images

Figure CN2025140234_18062026_PF_FP_ABST
Abstract
Description
Data processing methods, model training methods, electronic devices and storage media Technical Field
[0001] This disclosure relates to large model technology and computer technology, and more specifically, to a data processing method, a model training method, an electronic device, and a storage medium. Background Technology
[0002] With the rapid development of multimodal pre-trained language model technology and applications, user data is showing a trend towards diversification and multimodality. Traditional large-scale model-based retrieval-augmented generation (RAG) mainly relies on single-modal data, such as text retrieval relying on text descriptions. However, single-modal data often fails to fully express user needs, limiting the accuracy and relevance of retrieval results. To adapt to multimodal understanding scenarios, related technologies can convert all multimodal content into text content, then rely on text retrieval and large-scale text language models to achieve multimodal retrieval enhancement generation. However, this approach has shortcomings in information fusion, thus affecting the comprehensiveness and accuracy of retrieval results. Another approach involves using image-text retrieval schemes to decompose and process multimodal content input separately. However, this not only increases system complexity but also limits the model's ability to deeply integrate multimodal information, resulting in low retrieval efficiency and difficulty in handling complex multimodal scenarios. Therefore, multimodal retrieval systems in related technologies still face problems such as low retrieval generation efficiency and insufficient accuracy when processing large-scale, diverse data.
[0003] There is currently no effective solution to the above problems. Summary of the Invention
[0004] This disclosure provides a data processing method, a model training method, an electronic device, and a storage medium to at least solve the technical problems of low retrieval generation efficiency and poor accuracy in the related art when retrieving multimodal query data.
[0005] According to one aspect of the present disclosure, a data processing method is provided, comprising: acquiring multimodal query data; performing knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model to obtain a target answer; wherein the target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results, and the preset filtering conditions are used to filter the candidate retrieval results according to the ranking results of the relevance of the candidate retrieval results.
[0006] According to another aspect of the embodiments of this disclosure, a data processing method is also provided, comprising: acquiring multimodal product query data, wherein the multimodal product query data is used to provide descriptive information of products to be purchased; employing a target retrieval generation model to perform product knowledge retrieval and product information generation on the multimodal product query data to obtain a target product answer; wherein the target retrieval generation model is used to obtain target product retrieval results that meet preset filtering conditions from candidate product retrieval results associated with the multimodal product query data, and to generate response content contained in the target product answer based on the target product retrieval results, and the preset filtering conditions are used to filter candidate product retrieval results according to the ranking results of the relevance of the candidate product retrieval results.
[0007] According to another aspect of the embodiments of this disclosure, a data processing method is also provided, comprising: obtaining a data processing request through a first application programming interface (API), wherein the request data carried in the data processing request includes: multimodal query data; and returning a data processing response through a second API, wherein the response data carried in the data processing response includes: a target answer, wherein the target answer is obtained by performing knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model, the target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results, wherein the preset filtering conditions are used to filter the candidate retrieval results according to the ranking results of the relevance of the candidate retrieval results.
[0008] According to another aspect of the embodiments of this disclosure, a data processing method is also provided, comprising: acquiring a currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multimodal query data; responding to the data processing dialogue request and returning a data processing dialogue response, wherein the information carried in the data processing dialogue response includes: a target answer, wherein the target answer is obtained by performing knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model, the target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results, the preset filtering conditions are used to filter candidate retrieval results according to the degree of relevance of the candidate retrieval results; and displaying the target answer in a graphical user interface.
[0009] According to another aspect of the embodiments of this disclosure, a model training method is also provided, comprising: acquiring multimodal training data; adjusting a pre-trained model using a target training strategy and the multimodal training data to obtain a target retrieval generation model, wherein a cross-modal attention mechanism is introduced in the target training strategy; and deploying the target retrieval generation model to an execution device, wherein the target retrieval generation model is invoked by the execution device to execute the methods in various embodiments of this disclosure.
[0010] According to another aspect of the embodiments of this disclosure, a data processing system is also provided, including: a client for sending multimodal query data; a server connected to the client for using a target retrieval generation model to perform knowledge retrieval and information generation on the multimodal query data to obtain a target answer, wherein the target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results, and the preset filtering conditions are used to filter the candidate retrieval results according to the degree of relevance of the candidate retrieval results; the client is also used to output the target answer.
[0011] According to another aspect of the present disclosure, a computing device is also provided, including: a memory storing an executable program; and a processor for running the program, wherein the program executes the methods in various embodiments of the present disclosure when it runs.
[0012] According to another aspect of the present disclosure, an electronic device is also provided, including: a memory storing an executable program; and a processor connected to the memory via a bus for running the program, wherein the program executes the methods in various embodiments of the present disclosure during runtime.
[0013] According to another aspect of the embodiments of the present disclosure, a computer-readable storage medium is also provided, the computer-readable storage medium including a stored executable program, wherein, when the executable program is executed, it controls the device where the computer-readable storage medium is located to perform the methods of the various embodiments of the present disclosure.
[0014] According to another aspect of the embodiments of this disclosure, a computer program product is also provided, including a computer program that, when executed by a processor, implements the methods of various embodiments of this disclosure.
[0015] According to another aspect of the embodiments of this disclosure, a computer program product is also provided, including a non-volatile computer-readable storage medium storing a computer program that, when executed by a processor, implements the methods of various embodiments of this disclosure.
[0016] According to another aspect of the embodiments of this disclosure, a computer program is also provided, which, when executed by a processor, implements the methods of the various embodiments of this disclosure.
[0017] In this embodiment, multimodal query data is acquired, and then a target retrieval generation model is used to perform knowledge retrieval and information generation on the multimodal query data to obtain the target answer. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the degree of relevance. Thus, using the target retrieval generation model to process multimodal query data can significantly improve the efficiency, accuracy, and user experience of information retrieval and generation, while enhancing the model's knowledge coverage and adaptability. The target retrieval generation model can effectively filter out target retrieval results that meet preset filtering conditions from a large number of candidate retrieval results. The filtering process is based on a comprehensive analysis of multimodal query data, which is more comprehensive and accurate than relying on query understanding based on only a single modality. This achieves the goal of efficiently and accurately processing multimodal query data, thereby realizing the technical effect of improving the retrieval generation efficiency and accuracy of multimodal query data, and solving the technical problems of low retrieval generation efficiency and poor accuracy in related technologies when retrieving multimodal query data.
[0018] It is worth noting that the above general description and the following detailed description are merely for illustrative and explanatory purposes and do not constitute a limitation thereof. Attached Figure Description
[0019] The accompanying drawings, which are included to provide a further understanding of this disclosure and form part of this disclosure, illustrate exemplary embodiments of the present disclosure and are used to explain the disclosure, but do not constitute an undue limitation of the disclosure. In the drawings:
[0020] Figure 1 is a schematic diagram of an application scenario of a data processing method according to an embodiment of the present disclosure;
[0021] Figure 2 is a flowchart of a data processing method according to an embodiment of the present disclosure;
[0022] Figure 3 is a schematic diagram of a multimodal recall sub-model according to an embodiment of the present disclosure;
[0023] Figure 4 is a schematic diagram of a target retrieval generation model according to an embodiment of the present disclosure;
[0024] Figure 5 is a flowchart of another data processing method according to an embodiment of the present disclosure;
[0025] Figure 6 is a flowchart of another data processing method according to an embodiment of the present disclosure;
[0026] Figure 7 is a flowchart of another data processing method according to an embodiment of the present disclosure;
[0027] Figure 8 is a flowchart of a model training method according to an embodiment of the present disclosure;
[0028] Figure 9 is a structural block diagram of a data processing apparatus according to an embodiment of the present disclosure;
[0029] Figure 10 is a structural block diagram of another data processing apparatus according to an embodiment of the present disclosure;
[0030] Figure 11 is a structural block diagram of another data processing apparatus according to an embodiment of the present disclosure;
[0031] Figure 12 is a structural block diagram of another data processing apparatus according to an embodiment of the present disclosure;
[0032] Figure 13 is a structural block diagram of a model training device according to an embodiment of the present disclosure;
[0033] Figure 14 is a structural block diagram of a computing device according to an embodiment of the present disclosure;
[0034] Figure 15 is a structural block diagram of an electronic device according to an embodiment of the present disclosure. Detailed Implementation
[0035] To enable those skilled in the art to better understand the present disclosure, the technical solutions of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of the present disclosure, and not all embodiments. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present disclosure.
[0036] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0037] The technical solution disclosed herein is primarily implemented using large-scale model technology. Here, "large-scale model" refers to a deep learning model with a massive number of parameters, typically containing hundreds of millions, tens of billions, hundreds of billions, trillions, or even tens of trillions of parameters. Large-scale models, also known as foundation models, are pre-trained using large-scale unlabeled corpora to produce pre-trained models with hundreds of millions of parameters. These models are adaptable to a wide range of downstream tasks and exhibit good generalization ability. Examples include Large Language Models (LLMs) and multi-modal pre-training models.
[0038] It should be noted that, in practical applications, large models can be fine-tuned using a small number of samples to adapt them to different tasks. For example, large models can be widely applied in Natural Language Processing (NLP), computer vision, and speech processing. Specifically, they can be applied to computer vision tasks such as Visual Question Answering (VQA), Image Captioning (IC), and Image Generation, as well as NLP tasks such as text-based sentiment classification, text summarization, and machine translation. Therefore, the main application scenarios for large models include, but are not limited to, digital assistants, intelligent robots, search, online education, office software, e-commerce, and intelligent design. In this embodiment, the data processing using a target retrieval generation model in a multimodal query scenario is used as an example for explanation.
[0039] First, some nouns or terms that appear in the description of the embodiments of this disclosure shall be interpreted as follows:
[0040] Multimodal: Generally refers to combining multiple different types of data or information patterns. Multimodal processing can include text, images, audio, video, etc. In the field of computing, multimodal processing means utilizing these different types of data to improve task performance or achieve more complex functions.
[0041] Multimodal pre-trained large models: generally refers to large-scale machine learning models that can process multiple data modalities simultaneously. They can utilize complementary information from different types of data (such as text, images, audio, etc.) to improve the overall performance of the model and are often used for complex tasks, such as text and image generation and image understanding.
[0042] Multimodal retrieval: Multimodal retrieval is an information retrieval technology that utilizes data from multiple modalities for searching. For example, a user might provide text descriptions and images for searching, and the system will return relevant results based on the combined information from these modalities, thereby improving the accuracy of the search and the user experience.
[0043] Large Model Retrieval Augmentation (RAG) is a method that combines large language models with external information retrieval systems to improve the model's knowledge coverage and answer accuracy. This augmentation compensates for the knowledge blind spots or lack of domain-specific knowledge of large models after the training data deadline by accessing and utilizing relevant information in external databases, document repositories, or search engines in real time when generating answers.
[0044] Multimodal large model retrieval enhancement: Multimodal large model retrieval enhancement is based on large model retrieval enhancement and further extends to information retrieval and integration of multiple data modalities. This method not only utilizes text information, but also combines other types of media data to provide richer and more comprehensive answers or generate content.
[0045] In order to adapt to the scenario of multimodal understanding, the related technologies can convert all multimodal content into text content, and then rely on text retrieval and large text language models to realize the multimodal retrieval enhancement generation. The related technologies can also adopt image and text retrieval schemes to decompose the input of multimodal content and process it separately. The retrieval generation methods in the related technologies have the following disadvantages: (1) Insufficient information fusion: The schemes in the related technologies are lacking in the deep fusion of different modal data, making it difficult to fully understand and utilize the correlation information between multimodal data. (2) Low computational efficiency: The complex model structure and multimodal data processing lead to high consumption of computational resources, which limits the real-time performance and large-scale application of the retrieval system. (3) Lack of a unified pre-training mechanism: Most schemes rely on single-modal pre-trained models and lack a unified pre-training mechanism for multimodal data, resulting in poor multimodal feature extraction and fusion effects.
[0046] According to embodiments of this disclosure, a data processing method is provided. It should be noted that the steps shown in the flowcharts of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowcharts, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0047] Considering the large number of model parameters in large models and the limited computing resources of mobile terminals, the method provided in this disclosure can be applied to the application scenario shown in Figure 1, but is not limited thereto. In the application scenario shown in Figure 1, the large model is deployed on server 10. Server 10 can connect to one or more client devices 20 via a local area network (LAN), wide area network (WAN), Internet, or other types of data networks. These client devices 20 may include, but are not limited to, smartphones, tablets, laptops, PDAs, personal computers, smart home devices, and in-vehicle devices. Client devices 20 can interact with users through a graphical user interface to invoke the large model, thereby implementing the method provided in this disclosure.
[0048] In this embodiment, the system comprising a client device and a server can perform the following steps: The server obtains a currently input data processing dialogue request from the client device. The data processing dialogue request carries request data including multimodal query data. In response to the data processing dialogue request, the server returns a data processing dialogue response to the client device. The information carried in the data processing dialogue response includes a target answer. The target answer is obtained by performing knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from candidate retrieval results associated with the multimodal query data, and to generate response content included in the target answer based on the target retrieval results. The preset filtering conditions are used to filter candidate retrieval results according to their degree of relevance. After obtaining the data processing dialogue response, the client device can display the target answer within a graphical user interface.
[0049] It should be noted that with the rapid development of high-performance computing units, the methods provided in this disclosure can also be applied to integrated model machines in other application scenarios. In one optional embodiment, the integrated model machine has multiple built-in models. Users can select one model to adjust as needed to obtain their own model. The high-performance computing unit built into the integrated model machine can then directly call the adjusted model to execute the methods provided in this disclosure. In another optional embodiment, the large integrated model machine has a pre-trained model built-in. Therefore, the high-performance computing unit built into the integrated model machine can directly call this model to execute the methods provided in this disclosure.
[0050] Furthermore, when users need to train their own models, they can upload their own datasets via the client. These datasets are then sent to the server, allowing the server to adjust the pre-trained model using the dataset to obtain the user's customized model, which can then be deployed to the production environment. To facilitate users' model adjustment needs, the server provides complete adjustment tools, development frameworks, and processes, supporting multiple adjustment strategies. This allows the adjusted model to better adapt to different application domains and achieve a high degree of customization.
[0051] In the above operating environment, this disclosure provides a data processing method as shown in Figure 2. Figure 2 is a flowchart of a data processing method according to an embodiment of this disclosure. As shown in Figure 2, the method may include the following steps:
[0052] Step S21: Obtain multimodal query data;
[0053] Step S22: Use the target retrieval generation model to perform knowledge retrieval and information generation on the multimodal query data to obtain the target answer; wherein, the target retrieval generation model is used to obtain the target retrieval results that meet the preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate the response content contained in the target answer based on the target retrieval results, and the preset filtering conditions are used to filter the candidate retrieval results according to the ranking results of the relevance of the candidate retrieval results.
[0054] The aforementioned multimodal query data can be query data containing two or more different modalities of information, including but not limited to text, images, audio, and video. In practical applications, multimodal query data can be input provided by users during searches, queries, or interactions, containing multiple types of information. For example, a user might upload an image along with descriptive text and request detailed information about a product in the image.
[0055] For example, the methods for acquiring the aforementioned multimodal query data include, but are not limited to, user input collection, social media collection, mobile application logs, and database integration. Specifically, multimodal query data can be collected directly from user-provided queries. For instance, search engines or intelligent customer service systems can receive queries containing text and images, or combined voice and text queries. Images, videos, and text comments can be collected from social media platforms as sources of multimodal query data, which can be used to analyze user interests or trends. User query data can be extracted from mobile application logs, including voice input, photos, and text descriptions entered by users within the application. Records containing different modal information can be integrated from existing databases, such as product images and descriptive text in a product database. Furthermore, multimodal query data can be obtained from user interaction records of virtual assistants and smart speakers. When a smart device processes a user's voice command, if the user also provides images or videos, the device interaction records can constitute multimodal query data.
[0056] It is important to note that when acquiring multimodal query data, it is necessary to ensure the quality and diversity of the data, as well as to comply with data usage regulations to ensure compliant data use. Furthermore, preprocessing the collected user query data, such as noise reduction, format standardization, and quality checks, can further guarantee the quality of subsequent retrieval generation.
[0057] After acquiring the multimodal query data, it is input into the target retrieval generation model. The model performs knowledge retrieval and information generation on the multimodal query data to obtain the target answer. During knowledge retrieval, the model can retrieve candidate search results related to the multimodal query data from a multimodal knowledge base, which can be a large collection of documents, a database, or internet resources containing multiple modalities. During information generation, candidate search results are ranked based on their relevance to the multimodal query data, resulting in a relevance ranking. These candidate search results are then filtered according to their relevance ranking, and the target search result that meets preset filtering criteria is selected from the ranked candidate search results. For example, the top k candidate search results in the relevance ranking are selected as the target search result. Using the information in the target search result, combined with the modal characteristics of the multimodal query data, a comprehensive, multimodal target answer is generated. The response content of the target answer may include, but is not limited to, text descriptions, image examples, audio explanations, or video demonstrations to meet the user's comprehensive information needs.
[0058] For example, the target retrieval generation model can be obtained by adjusting the pre-trained model using a target training strategy and multimodal training data. A cross-modal attention mechanism can be introduced into the target training strategy. The target retrieval generation model is deployed to an execution device, which then calls the data processing method of any one of the embodiments of this disclosure to execute the data processing method.
[0059] The aforementioned multimodal training data can be obtained from various channels, including but not limited to social media, online forums, Q&A communities, news websites, educational materials, and industry-specific databases (such as medical, legal, and technological databases). Multimodal training data typically contains information from multiple modalities, such as text, images, audio, or video, and there are correlations or relationships between these modalities. Since the raw data may contain noise and irrelevant information, preprocessing is necessary, such as data cleaning, format conversion, size standardization, and text segmentation, to ensure the pre-trained model can learn effectively. For supervised learning, data can be labeled manually or automatically to indicate the correlations between different modalities, thereby helping the model learn the correct correspondences between modalities.
[0060] Preprocessed multimodal training data is input into a pre-trained model, allowing the model to be fine-tuned or further trained on the dataset. During training, target training strategies, such as cross-modal attention mechanisms, are used to guide the update of model parameters to improve the ability to process multimodal information. Introducing cross-modal attention mechanisms into the model architecture enables the model to pay attention to and integrate information from different modalities when processing input, thereby enhancing mutual understanding and connection between modalities.
[0061] The data processing methods in this disclosure can be applied, but are not limited to, scenarios such as intelligent customer service, e-commerce, news, healthcare, education, social media, and smart homes; this disclosure does not impose any limitations. These application scenarios require the system to process and understand complex multimodal information, providing accurate, timely, and relevant responses or content. By integrating query data from multiple modalities such as text, images, and audio, the accuracy and efficiency of information retrieval and content generation can be effectively improved. For example, in intelligent customer service and question-and-answer systems, when users inquire about product information or seek advice, they can simultaneously provide text descriptions and examples of images, audio, or video, requiring the system to provide accurate and comprehensive answers. In e-commerce search and recommendation systems, when users search for products on e-commerce platforms, they can upload product images and provide text descriptions to more precisely express their needs. Social media platforms, when analyzing user behavior or content trends, need to process composite information including text, images, and videos to more accurately understand user needs and interests, providing a basis for content recommendation and advertising targeting.
[0062] Based on steps S21 and S22 above, multimodal query data is acquired, and then a target retrieval generation model is used to perform knowledge retrieval and information generation on the multimodal query data to obtain the target answer. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate the response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the degree of relevance. Therefore, using the target retrieval generation model to process multimodal query data can significantly improve the efficiency, accuracy, and user experience of information retrieval and generation, while enhancing the model's knowledge coverage and adaptability. The target retrieval generation model can effectively filter out target retrieval results that meet preset filtering conditions from a large number of candidate retrieval results. The filtering process is based on a comprehensive analysis of multimodal query data, which is more comprehensive and accurate than relying on query understanding based on only a single modality. This achieves the goal of efficiently and accurately processing multimodal query data, thereby improving the technical effect of improving the retrieval generation efficiency and accuracy of multimodal query data, and solving the technical problems of low retrieval generation efficiency and poor accuracy in related technologies when retrieving multimodal query data.
[0063] The data processing methods in the embodiments of this disclosure will be further described below.
[0064] In an optional embodiment, the target retrieval generation model includes: a multimodal recall sub-model, a multimodal ranking sub-model, and a multimodal generation sub-model. In step S22, the target retrieval generation model is used to perform knowledge retrieval and information generation on the multimodal query data to obtain the target answer, including:
[0065] Step S221: Use a multimodal recall sub-model to perform recall processing on the multimodal query data to obtain candidate retrieval results;
[0066] Step S222: The candidate retrieval results are sorted using a multimodal ranking sub-model to obtain the ranking results;
[0067] Step S223: Use a multimodal generation sub-model to process the ranking results into response content to obtain the target answer.
[0068] In the process of recalling multimodal query data, a multimodal recall sub-model can be used to perform deep feature extraction on the multimodal query data, converting data from different modalities such as text, images, audio, or video into a unified feature representation. This is typically done in the embedding layer of the multimodal recall sub-model. The multimodal recall sub-model has the ability to process multimodal information and can generate high-quality multimodal feature vectors. Furthermore, based on the extracted multimodal features, vector recall is performed to retrieve candidate documents or fragments that match the multimodal feature vectors from a large-scale multimodal knowledge base, thus obtaining candidate retrieval results.
[0069] In the process of ranking candidate search results, a multimodal ranking sub-model can be used to evaluate the recalled candidate search results and assign a relevance score to each candidate search result. The relevance score is calculated by analyzing the correlation between the modal information in the candidate search results and the multimodal query data, as well as the mutual influence between modalities. Further, based on the relevance scores, the multimodal ranking sub-model is used to comprehensively rank the candidate search results, obtaining a ranking result that includes multiple candidate search results. It should be noted that the ranking algorithms used in the ranking process include, but are not limited to, ranking based on vector similarity, ranking based on multimodal attention mechanisms, or personalized ranking algorithms combining user behavior data and contextual information; this disclosure does not limit the scope of the ranking algorithm.
[0070] During the response content generation process of the ranking results, the multimodal generation sub-model processes the ranking results to integrate multimodal information. Specifically, this may include integrating relevant information from modalities such as text, images, audio, or video to comprehensively consider all available information when generating the target answer. Based on the integrated multimodal information and the original multimodal query data, the multimodal generation sub-model can generate comprehensive response content for the target answer. The target answer can be a detailed and accurate answer containing multimodal information, such as a solution including textual explanations and relevant image examples, or a guidance plan combining textual descriptions and audio explanations. After generating the response content, the multimodal generation sub-model can also adjust the response content to ensure the logical coherence and information completeness of the output content. This includes, but is not limited to, proofreading the generated content, supplementing missing information, or adjusting the format of the generated content according to the context.
[0071] Based on the above optional embodiments, a multimodal recall sub-model is used to recall multimodal query data to obtain candidate retrieval results. Then, a multimodal ranking sub-model is used to rank the candidate retrieval results to obtain ranked results. Finally, a multimodal generation sub-model is used to generate response content from the ranked results to obtain the target answer. Thus, the target retrieval generation model can effectively handle complex multimodal queries, accurately retrieve relevant data from the knowledge base, and generate a high-quality response content containing multiple modal information, which can significantly improve user experience and information acquisition efficiency.
[0072] In an optional embodiment, the multimodal recall sub-model includes: a feature encoding module, a feature mapping module, and a feature matching module. In step S221, the multimodal recall sub-model is used to perform recall processing on the multimodal query data to obtain candidate retrieval results, including:
[0073] Step S2211: The feature encoding module is used to perform feature encoding processing on the multimodal query data to obtain multimodal feature vectors;
[0074] Step S2212: The feature mapping module is used to perform feature mapping processing on the multimodal feature vectors to obtain the multimodal mapping result. The multimodal mapping result is used to represent the multimodal representation obtained by mapping the feature vectors of different modalities to the shared multimodal space.
[0075] Step S2213: The feature matching module is used to recall candidate retrieval results that match the multimodal mapping results from the target multimodal dataset.
[0076] During feature encoding, multimodal query data is input into the feature encoding module, which then performs deep feature extraction on the multimodal query data. For example, for text data, word segmentation and word embedding can be performed; for image data, image feature extraction can be performed to obtain feature maps; and for audio and video data, corresponding time-frequency features or scene features can be extracted. Furthermore, the feature encoding module can integrate the extracted feature data into a unified multimodal feature vector. This multimodal feature vector is a high-dimensional feature vector that can contain the core information and cross-modal correlations from the multimodal query data.
[0077] During the feature mapping process, the feature mapping module maps the generated multimodal feature vectors to a shared multimodal space. This multimodal space is a unified representation space containing feature vectors from different modalities. These feature vectors can be effectively compared and matched using a cross-modal attention mechanism. The result of the feature mapping process is the multimodal mapping result, which represents the multimodal representation obtained by mapping feature vectors from different modalities to the shared multimodal space. This transforms information from different modalities into a unified form suitable for similarity calculation.
[0078] Furthermore, the feature matching module calculates a similarity score between the multimodal mapping result and the feature vector of each document or fragment in the target multimodal dataset within the shared multimodal space. Based on the similarity score, the feature matching module recalls candidate retrieval results that closely match the target multimodal dataset. The recall process can employ an approximate nearest neighbor algorithm, enabling rapid identification of the most similar items on large-scale datasets. The recalled candidate retrieval results are documents or fragments that match the feature vector of the query data in the multimodal space.
[0079] Figure 3 is a schematic diagram of a multimodal recall sub-model according to an embodiment of the present disclosure. As shown in Figure 3, a feature encoding module is used to perform feature encoding processing on the multimodal query data to obtain a multimodal feature vector. A feature mapping module is used to perform feature mapping processing on the multimodal feature vector to obtain a multimodal mapping result. A feature matching module is used to recall candidate retrieval results that match the multimodal mapping result from the target multimodal dataset.
[0080] Based on the above optional embodiments, a feature encoding module is used to perform feature encoding processing on the multimodal query data to obtain multimodal feature vectors. Then, a feature mapping module is used to perform feature mapping processing on the multimodal feature vectors to obtain multimodal mapping results. Finally, a feature matching module is used to recall candidate retrieval results that match the multimodal mapping results from the target multimodal dataset. This can effectively process complex multimodal query data. Through deep feature encoding and mapping, multimodal query data is converted into a unified representation and highly relevant candidate results are quickly recalled from large-scale multimodal datasets, thereby providing high-quality input data for subsequent sorting and generation stages.
[0081] In an optional embodiment, step S2213, using the feature matching module to recall candidate retrieval results that match the multimodal mapping results from the target multimodal dataset includes: in the feature matching module, recalling candidate retrieval results that match the multimodal mapping results from the target multimodal dataset based on a preset index structure.
[0082] The aforementioned preset index structures include, but are not limited to, inverted indexes, vector indexes, or other data structures. Preset index structures can improve retrieval efficiency and accuracy, and can be flexibly selected according to the characteristics of the dataset and task requirements.
[0083] Specifically, inverted indexes are a common indexing structure when processing text data. Built upon a vocabulary or feature dictionary, an inverted index records which documents each word or feature appears in and its position within those documents. Inverted indexes can quickly locate documents containing specific words or features, making them particularly effective for text-intensive multimodal datasets. For feature vector-based retrieval, vector indexes are a more efficient choice. Vector indexes store the feature vectors of each document or segment in the dataset and quickly find the vector most similar to the query vector using an approximate nearest neighbor algorithm. Vector indexes can handle large-scale multimodal datasets and maintain good retrieval performance even in high-dimensional feature spaces.
[0084] Based on the above optional embodiments, by recalling candidate retrieval results that match the multimodal mapping results from the target multimodal dataset based on a preset index structure in the feature matching module, the feature matching module can quickly recall candidate retrieval results that match the multimodal query data based on the preset index structure, providing high-quality input for subsequent multimodal sorting and generation steps, thereby ensuring the performance and user experience of the entire RAG process.
[0085] In one optional embodiment, the data processing method in this disclosure further includes:
[0086] Obtain the initial multimodal dataset; perform data cleaning and preprocessing on the initial multimodal dataset to obtain the intermediate multimodal dataset; perform data alignment and annotation on the intermediate multimodal dataset to obtain the target multimodal dataset; perform feature encoding on the target multimodal dataset and construct a pre-defined index structure.
[0087] The aforementioned initial multimodal dataset can include data containing multimodal information such as text, images, audio, or video collected from channels such as the Internet, social media, professional databases, literature, image libraries, audio or video resources. By ensuring that the initial multimodal dataset covers a wide range of topics and domains, the generalization ability of the model can be improved.
[0088] Furthermore, data cleaning can remove duplicate, irrelevant, or low-quality data. For example, it removes text containing spam, blurry images, blank or meaningless audio clips. The aforementioned data preprocessing includes, but is not limited to, format unification and data augmentation. Format unification converts data into a uniform format for easier model processing. For example, scaling all images to the same resolution and performing word segmentation and standardization on text. Data augmentation techniques, such as image flipping, text rewriting, and audio reverberation, can increase the diversity and richness of the dataset.
[0089] After obtaining the intermediate multimodal dataset through data cleaning and preprocessing, data alignment and annotation are performed on the intermediate multimodal dataset to obtain the target multimodal dataset. Data alignment ensures that the multimodal information in the target multimodal dataset is related and consistent. For example, for image-text pairing, it ensures that the text description is relevant to the image content; for video and audio, it ensures that their timestamps are aligned. Data annotation adds labels or keywords to each multimodal entry in the intermediate multimodal dataset to indicate the relationship between queries and responses, as well as the correlation between modalities, thereby helping the model learn the correct multimodal correspondences and retrieval logic.
[0090] Feature extraction is performed on each data item in the target multimodal dataset to generate feature vectors containing multimodal information. Feature vectors from different modalities are then fused into a shared multimodal space to ensure consistency and comparability between modalities. Based on the feature-encoded data, a pre-defined index structure, such as an inverted index or vector index, is constructed to support efficient retrieval operations. The design of the index structure can be determined based on the size of the dataset, the dimension of the feature vectors, and the requirements of the retrieval task.
[0091] Based on the above optional embodiments, an initial multimodal dataset is obtained, and then data cleaning and preprocessing are performed on the initial multimodal dataset to obtain an intermediate multimodal dataset. Subsequently, data alignment and data annotation are performed on the intermediate multimodal dataset to obtain a target multimodal dataset. Finally, feature encoding is performed on the target multimodal dataset to construct a preset index structure, which is used to quickly recall candidate retrieval results that match the multimodal query data, providing high-quality input for subsequent multimodal sorting and generation steps, and further improving retrieval generation efficiency.
[0092] In one optional embodiment, recalling candidate search results that match the multimodal mapping results from the target multimodal dataset based on a preset index structure includes:
[0093] The initial search results are retrieved from the target multimodal dataset based on the preset index structure; the similarity between the multimodal mapping results and the initial search results is calculated to obtain the calculation results; based on the calculation results, candidate search results that match the multimodal mapping results are selected from the initial search results.
[0094] Specifically, a pre-defined index structure is used to query the target multimodal dataset. This index structure can quickly locate and recall a series of potentially related documents or fragments as initial search results. For each recalled initial search result, a similarity score is calculated between it and the multimodal mapping result, for example, by calculating the cosine similarity or Euclidean distance between vectors. When calculating similarity, a cross-modal attention mechanism can also be used to gain a deeper understanding of the correlation between different modalities, thereby providing more accurate matching results.
[0095] Based on the similarity calculation results, the initial search results are sorted from highest to lowest relevance score. The top K initial search results are selected from the sorted list as candidate search results. The value of K is adjusted based on actual needs and system performance to ensure the accuracy and diversity of the candidate search results.
[0096] Based on the above optional embodiments, the initial search results are retrieved from the target multimodal dataset based on a preset index structure. Then, the similarity between the multimodal mapping results and the initial search results is calculated to obtain the calculation results. Finally, based on the calculation results, candidate search results that match the multimodal mapping results are selected from the initial search results, providing high-quality input for subsequent sorting and generation steps, thereby ensuring the fast response and high relevance of the search process.
[0097] In one optional embodiment, the data processing method in this disclosure further includes:
[0098] Based on a preset prompt text format, some candidate search results in the sorting results are concatenated to obtain the target prompt text.
[0099] The aforementioned preset prompt text format is a template that defines how the sorting results are integrated into a unified text structure. The preset prompt text format can include specific text tags, instructions, or placeholders to indicate the type and location of content that the generative model focuses on.
[0100] Specifically, the top few relevant results from the retrieved rankings are concatenated using prompts in a pre-defined prompt text format to form a comprehensive input, i.e., the target prompt text. For example, the top candidate search results in the rankings are reorganized according to the pre-defined prompt text format, ensuring that each modal information is appropriately inserted into the target prompt text. The target prompt text contains multimodal concatenated data. During the concatenation process, the format of the information can be fine-tuned to ensure compatibility with the pre-defined prompt text format while maintaining the integrity and accuracy of the information.
[0101] Based on the above optional embodiments, by splicing some candidate search results in the sorting results according to a preset prompt text format, the target prompt text can be obtained. This ensures that the multimodal generation sub-model can make full use of the multimodal information in the search results when generating response content, providing richer and more accurate answers, and further improving the accuracy of the search generation results.
[0102] In an optional embodiment, in step S223, a multimodal generation sub-model is used to process the ranking results to generate response content, resulting in the target answer including:
[0103] The target prompt text is input into the multimodal generation sub-model, which then processes the response content based on the multimodal splicing data recorded in the target prompt text to obtain the target answer.
[0104] Specifically, the target prompt text is input into the multimodal generation sub-model. The multimodal generation sub-model can combine information from multiple modalities such as text and images to understand the context and generate the final target answer. The generated result not only depends on the model's generation capability but also integrates highly relevant retrieved information to ensure the accuracy and richness of the output content.
[0105] Figure 4 is a schematic diagram of a target retrieval generation model according to an embodiment of the present disclosure. As shown in Figure 4, the target retrieval generation model includes a multimodal recall sub-model, a multimodal ranking sub-model, and a multimodal generation sub-model. In the process of obtaining the target answer by performing knowledge retrieval and information generation on multimodal query data using the target retrieval generation model, the multimodal recall sub-model performs recall processing on the multimodal query data to obtain candidate retrieval results, the multimodal ranking sub-model performs ranking processing on the candidate retrieval results to obtain ranking results, and the multimodal generation sub-model performs response content generation processing on the ranking results to obtain the target answer.
[0106] This disclosure addresses multimodal retrieval enhancement scenarios. Based on a target-generated retrieval model, a unified multimodal retrieval generation architecture can be constructed, capable of simultaneously processing multiple data modalities such as text and images. Deep information fusion is achieved through a cross-modal attention mechanism. This unified architecture avoids the fragmented processing between different modalities in traditional multimodal models, improving the model's overall understanding and representation capabilities. Specifically, a recall + ranking link is constructed in the multimodal retrieval chain based on the target-generated retrieval model, achieving efficient fusion of multimodal features. Cross-modal attention and a shared representation space are utilized to achieve deep fusion and collaborative expression of features from different modalities. This fusion mechanism not only enhances the expressive power of features but also significantly improves the matching accuracy between different modal data. Utilizing a retrieval enhancement strategy based on large model features, including deep query understanding, intelligent index construction, and efficient similarity calculation, the strong representation capabilities of the target retrieval generation model are combined with efficient retrieval algorithms, significantly improving the accuracy and response speed of the retrieval system.
[0107] Figure 5 is a flowchart of another data processing method according to an embodiment of the present disclosure. As shown in Figure 5, the method may include the following steps:
[0108] Step S51: Obtain multimodal product query data, wherein the multimodal product query data is used to provide description information of the products to be selected;
[0109] Step S52: The target retrieval generation model is used to perform product knowledge retrieval and product information generation on the multimodal product query data to obtain the target product answer. The target retrieval generation model is used to obtain the target product retrieval results that meet the preset filtering conditions from the candidate product retrieval results associated with the multimodal product query data, and to generate the response content contained in the target product answer based on the target product retrieval results. The preset filtering conditions are used to filter the candidate product retrieval results according to the ranking results of the relevance of the candidate product retrieval results.
[0110] Based on steps S51 to S52 above, by acquiring multimodal product query data, and then using a target retrieval generation model to perform product knowledge retrieval and product information generation on the multimodal product query data, the target product answer is obtained. The target retrieval generation model is used to obtain target product retrieval results that meet preset filtering conditions from the candidate product retrieval results associated with the multimodal product query data, and to generate response content contained in the target product answer based on the target product retrieval results. The preset filtering conditions are used to filter the candidate product retrieval results according to the ranking results of the relevance of the candidate product retrieval results. Therefore, using the target retrieval generation model to process multimodal product query data can significantly improve the efficiency, accuracy and user experience of information retrieval and generation, while enhancing the knowledge coverage and adaptability of the model. The target retrieval generation model can effectively filter target product retrieval results that meet preset filtering conditions from a large number of candidate product retrieval results. The filtering process is based on the comprehensive analysis of multimodal product query data, which is more comprehensive and accurate than relying on a single modality query. This achieves the goal of efficiently and accurately processing multimodal product query data, thereby improving the technical effect of improving the retrieval generation efficiency and accuracy of multimodal product query data. It also solves the technical problems of low retrieval generation efficiency and poor accuracy in related technologies when retrieving multimodal query data.
[0111] Figure 6 is a flowchart of another data processing method according to an embodiment of the present disclosure. As shown in Figure 6, the method may include the following steps:
[0112] Step S61: Obtain a data processing request through the first application programming interface, wherein the request data carried in the data processing request includes: multimodal query data;
[0113] Step S62: Return a data processing response through the second application programming interface. The response data carried in the data processing response includes: the target answer, which is obtained by performing knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the degree of relevance of the candidate retrieval results.
[0114] The first and second application programming interfaces (APIs) mentioned above can be the same or different APIs. In one optional embodiment, the interface parameters in the first and second APIs may include, but are not limited to: a global interface identifier, an interface signing key, an interface timestamp, an interface request identifier, and a system call credential identifier. The first API can use GET or POST as the interface request method to obtain the file processing request. The second API can use JSON format to return the file processing response.
[0115] Based on steps S61 to S62 above, a data processing request is obtained through the first application programming interface (API). The request data carried in the data processing request includes multimodal query data. Then, a data processing response is returned through the second API. The response data carried in the data processing response includes the target answer. The target answer is obtained by performing knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the degree of relevance of the candidate retrieval results. Therefore, using the target retrieval generation model to process multimodal query data can significantly improve the efficiency, accuracy, and user experience of information retrieval and generation, while enhancing the knowledge coverage and adaptability of the model. The target retrieval generation model can effectively filter target retrieval results that meet preset filtering conditions from a large number of candidate retrieval results. The filtering process is based on the comprehensive analysis of multimodal query data, which is more comprehensive and accurate than relying on a single modality query. This achieves the goal of efficiently and accurately processing multimodal query data, thereby improving the technical effect of improving the retrieval generation efficiency and accuracy of multimodal query data. It also solves the technical problems of low retrieval generation efficiency and poor accuracy in related technologies when retrieving multimodal query data.
[0116] Figure 7 is a flowchart of another data processing method according to an embodiment of the present disclosure. As shown in Figure 7, the method may include the following steps:
[0117] Step S71: Obtain the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multimodal query data;
[0118] Step S72: In response to the data processing dialogue request, a data processing dialogue response is returned. The information carried in the data processing dialogue response includes: the target answer, which is obtained by performing knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the degree of relevance of the candidate retrieval results.
[0119] Step S73: Display the target answer within the graphical user interface.
[0120] Based on steps S71 to S73 above, by acquiring the currently input data processing dialogue request, which carries request data including multimodal query data, and then responding to the data processing dialogue request by returning a data processing dialogue reply, the information carried in the data processing dialogue reply includes: the target answer, which is obtained by using a target retrieval generation model to perform knowledge retrieval and information generation on the multimodal query data. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the degree of relevance of the candidate retrieval results. Finally, the target answer is displayed in the graphical user interface. Thus, using the target retrieval generation model to process multimodal query data can significantly improve the efficiency, accuracy, and user experience of information retrieval and generation, while enhancing the knowledge coverage and adaptability of the model. The target retrieval generation model can effectively filter target retrieval results that meet preset filtering conditions from a large number of candidate retrieval results. The filtering process is based on the comprehensive analysis of multimodal query data, which is more comprehensive and accurate than relying on a single modality query. This achieves the goal of efficiently and accurately processing multimodal query data, thereby improving the technical effect of improving the retrieval generation efficiency and accuracy of multimodal query data. It also solves the technical problems of low retrieval generation efficiency and poor accuracy in related technologies when retrieving multimodal query data.
[0121] Figure 8 is a flowchart of a model training method according to an embodiment of the present disclosure. As shown in Figure 8, the method may include the following steps:
[0122] Step S81: Obtain multimodal training data;
[0123] Step S82: Adjust the pre-trained model using the target training strategy and multimodal training data to obtain the target retrieval generation model, wherein a cross-modal attention mechanism is introduced into the target training strategy;
[0124] Step S83: Deploy the target retrieval generation model to the execution device, wherein the target retrieval generation model is invoked by the execution device to execute any of the data processing methods in the embodiments of this disclosure.
[0125] Based on steps S81 to S83 above, multimodal training data is acquired, and then the pre-trained model is adjusted using the target training strategy and multimodal training data to obtain a target retrieval and generation model. A cross-modal attention mechanism is introduced into the target training strategy. Finally, the target retrieval and generation model is deployed to the execution device. Thus, the introduced cross-modal attention mechanism enables the model to more effectively capture and integrate the correlation information between different modal data. In order to fully utilize the complementary characteristics of multimodal training data when understanding and generating content, the model's ability to process multimodal queries is significantly enhanced, improving the efficiency and accuracy of retrieval and generation. At the same time, the use of computing resources is optimized, the user experience is improved, and the information retrieval and generation tasks on the execution device are more efficient and intelligent. The target retrieval generation model can effectively filter target retrieval results that meet preset filtering conditions from a large number of candidate retrieval results. The filtering process is based on the comprehensive analysis of multimodal query data, which is more comprehensive and accurate than relying on a single modality query. This achieves the goal of efficiently and accurately processing multimodal query data, thereby improving the technical effect of improving the retrieval generation efficiency and accuracy of multimodal query data. It also solves the technical problems of low retrieval generation efficiency and poor accuracy in related technologies when retrieving multimodal query data.
[0126] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or refuse.
[0127] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this disclosure is not limited to the described order of actions, because according to this disclosure, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this disclosure.
[0128] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, they can also be implemented by hardware. Based on this understanding, the technical solutions of this disclosure, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this disclosure.
[0129] According to embodiments of this disclosure, a data processing apparatus for implementing the above-described data processing method is also provided. FIG9 is a structural block diagram of a data processing apparatus according to an embodiment of this disclosure. As shown in FIG9, the apparatus includes:
[0130] Module 901 is configured to retrieve multimodal query data;
[0131] The processing module 902 is configured to use a target retrieval generation model to perform knowledge retrieval and information generation on multimodal query data to obtain the target answer. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the degree of relevance of the candidate retrieval results.
[0132] Optionally, the processing module 902 is further configured to: use a multimodal recall sub-model to perform recall processing on the multimodal query data to obtain candidate search results; use a multimodal ranking sub-model to perform ranking processing on the candidate search results to obtain ranking results; and use a multimodal generation sub-model to perform response content generation processing on the ranking results to obtain the target answer.
[0133] Optionally, the processing module 902 is further configured to: use a feature encoding module to perform feature encoding processing on the multimodal query data to obtain multimodal feature vectors; use a feature mapping module to perform feature mapping processing on the multimodal feature vectors to obtain multimodal mapping results, wherein the multimodal mapping results are used to represent the multimodal representation obtained by mapping feature vectors of different modalities to a shared multimodal space; and use a feature matching module to recall candidate retrieval results that match the multimodal mapping results from the target multimodal dataset.
[0134] Optionally, the processing module 902 is further configured to: in the feature matching module, recall candidate retrieval results that match the multimodal mapping results from the target multimodal dataset based on a preset index structure.
[0135] Optionally, the acquisition module 901 is further configured to acquire an initial multimodal dataset; the data processing device further includes: a preprocessing module 903, configured to perform data cleaning and data preprocessing on the initial multimodal dataset to obtain an intermediate multimodal dataset; an annotation module 904, configured to perform data alignment and data annotation processing on the intermediate multimodal dataset to obtain a target multimodal dataset; and an encoding module 905, configured to perform feature encoding on the target multimodal dataset to construct a preset index structure.
[0136] Optionally, the processing module 902 is further configured to: retrieve the initial retrieval results from the target multimodal dataset based on a preset index structure; calculate the similarity between the multimodal mapping results and the initial retrieval results to obtain the calculation results; and select candidate retrieval results that match the multimodal mapping results from the initial retrieval results based on the calculation results.
[0137] Optionally, the processing module 902 is further configured to: concatenate some candidate search results in the sorting results based on a preset prompt text format to obtain the target prompt text.
[0138] Optionally, the processing module 902 is further configured to: input the target prompt text into the multimodal generation sub-model, so that the multimodal generation sub-model performs response content generation processing based on the multimodal splicing data recorded in the target prompt text, and obtains the target answer.
[0139] It should be noted that the acquisition module 901 and processing module 902 mentioned above correspond to steps S21 to S22 in the above embodiments. The two modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also be part of the device and run in the server 10 provided in the above embodiments.
[0140] Figure 10 is a structural block diagram of another data processing apparatus according to an embodiment of the present disclosure. As shown in Figure 10, the apparatus includes:
[0141] The acquisition module 1001 is configured to acquire multimodal product query data, which is used to provide descriptive information about the products to be selected.
[0142] The processing module 1002 is configured to use a target retrieval generation model to perform product knowledge retrieval and product information generation on multimodal product query data to obtain the target product answer. The target retrieval generation model is used to obtain the target product retrieval results that meet the preset filtering conditions from the candidate product retrieval results associated with the multimodal product query data, and to generate the response content contained in the target product answer based on the target product retrieval results. The preset filtering conditions are used to filter the candidate product retrieval results according to the ranking results of the relevance of the candidate product retrieval results.
[0143] It should be noted that the acquisition module 1001 and processing module 1002 mentioned above correspond to steps S51 to S52 in the above embodiments. The two modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also be part of the device and run in the server 10 provided in the above embodiments.
[0144] Figure 11 is a structural block diagram of another data processing apparatus according to an embodiment of the present disclosure. As shown in Figure 11, the apparatus includes:
[0145] The acquisition module 1101 is configured to acquire a data processing request through a first application programming interface, wherein the request data carried in the data processing request includes: multimodal query data;
[0146] Return module 1102 returns a data processing response through the second application programming interface. The response data carried in the data processing response includes: the target answer, which is obtained by performing knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the degree of relevance of the candidate retrieval results.
[0147] It should be noted that the acquisition module 1101 and return module 1102 correspond to steps S61 to S62 in the above embodiments. The two modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also be part of the device and run in the server 10 provided in the above embodiments.
[0148] Figure 12 is a structural block diagram of another data processing apparatus according to an embodiment of the present disclosure. As shown in Figure 12, the apparatus includes:
[0149] The acquisition module 1201 is configured to acquire the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multimodal query data;
[0150] Return module 1202 is configured to respond to a data processing dialogue request and return a data processing dialogue response. The data processing dialogue response carries information including: the target answer, which is obtained by using a target retrieval generation model to perform knowledge retrieval and information generation on the multimodal query data. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the degree of relevance of the candidate retrieval results.
[0151] Display module 1203 is configured to display the target answer within the graphical user interface.
[0152] It should be noted that the acquisition module 1201, return module 1202, and display module 1203 correspond to steps S71 to S73 in the above embodiments. The three modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also run as part of the device in the server 10 provided in the above embodiments.
[0153] Figure 13 is a structural block diagram of a model training device according to an embodiment of the present disclosure. As shown in Figure 13, the device includes:
[0154] Module 1302 is configured to acquire multimodal training data;
[0155] The adjustment module 1302 is configured to adjust the pre-trained model using the target training strategy and multimodal training data to obtain the target retrieval generation model, wherein a cross-modal attention mechanism is introduced into the target training strategy;
[0156] The deployment module 1303 is configured to deploy the target retrieval generation model to the execution device, wherein the target retrieval generation model is invoked by the execution device to execute the data processing method in the embodiments of this disclosure.
[0157] It should be noted that the acquisition module 1301, adjustment module 1302, and deployment module 1303 correspond to steps S81 to S83 in the above embodiments. The three modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also run as part of the device in the server 10 provided in the above embodiments.
[0158] It should be noted that the preferred embodiments involved in the above embodiments of this disclosure are the same as the solutions, application scenarios and implementation processes provided in the above embodiments, but are not limited to the solutions provided in the above embodiments.
[0159] Embodiments of this disclosure can provide a data processing system, including: a client for sending multimodal query data; a server connected to the client for using a target retrieval generation model to perform knowledge retrieval and information generation on the multimodal query data to obtain a target answer, wherein the target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results, and the preset filtering conditions are used to filter candidate retrieval results according to the degree of relevance; the client is also used to output the target answer.
[0160] Embodiments of this disclosure may provide a computing device. FIG14 is a structural block diagram of a computing device according to an embodiment of the present disclosure. As shown in FIG14, the computing device may include: one or more (only one is shown in the figure) processors 142, memory 144, memory controller, and peripheral interfaces.
[0161] The aforementioned computing device can be understood as an integrated smart terminal, including but not limited to servers, desktop computers, PCs (Personal Computers), all-in-one model machines, etc., and the computing device may have the model described in the above embodiments of this disclosure pre-installed.
[0162] Specifically, this computing device can pre-install various types of models, including but not limited to models in natural language processing, visual processing, speech processing, code processing, and multimodal task processing, thus providing diverse model selection. In different product forms, this computing device can support one or more model usage methods, including but not limited to model training, model invocation, model fine-tuning, model deployment, model inference, and application. In some product forms, this computing device also supports model management, including but not limited to multi-type model management (supporting the management of discriminative, generative, and other model types), model version control (supporting the control of different model versions), and model evaluation (evaluating model performance and effectiveness based on model evaluation tools). In other product forms, this computing device can also create applications based on models, providing API calling capabilities, allowing models to be called into created applications through API interfaces, and providing application management tools for application management and monitoring.
[0163] Furthermore, the computing device may also include data management (supporting the creation and management of model tuning datasets), a training center (providing abundant training resources to help users learn and master AI technology), and basic control capabilities (providing enterprise-level basic control capabilities to ensure the security and efficient operation of the system). Through the above functions, it provides a comprehensive and integrated device for AI development, training, deployment, and application.
[0164] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the methods and apparatus in the embodiments of this disclosure. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0165] The processor can invoke an executable program stored in memory via a transmission device to execute the method described in any of the above embodiments.
[0166] Embodiments of this disclosure can provide an electronic device. FIG15 is a structural block diagram of an electronic device according to an embodiment of this disclosure. As shown in FIG15, the electronic device may include: an input / output device 152; a memory 154; and a processor 156, wherein the processor 156 is connected to the input / output device 152 and the memory 154 via a bus 158.
[0167] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the methods and apparatus in the embodiments of this disclosure. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0168] The processor can invoke an executable program stored in memory via a transmission device to execute the method described in any of the above embodiments.
[0169] It will be understood by those skilled in the art that the structure shown in the figure is merely illustrative, and the computing device may also be a smartphone, tablet computer, PDA, mobile internet device (MID), PAD, or other terminal device. This figure does not limit the structure of the aforementioned computing device. For example, the computing device may include more or fewer components (such as network interfaces, display devices, etc.) than shown in the figure, or may have a different configuration than that shown in the figure.
[0170] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing the hardware related to the terminal device. The program can be stored in a computer-readable storage medium, which may include: flash drive, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.
[0171] Embodiments of this disclosure also provide a computer-readable storage medium. Optionally, in this embodiment, the computer-readable storage medium can be used to store program code executed by the method provided in the above embodiments.
[0172] Optionally, in this embodiment, the storage medium may be located in a computing device.
[0173] Optionally, in this embodiment, the computer-readable storage medium is configured to store an executable program, which, when the executable program is running, controls the device where the computer-readable storage medium is located to execute the method described in any of the above embodiments.
[0174] Embodiments of this disclosure also provide a computer program product. Optionally, in this embodiment, the computer program product may include a computer program that, when executed by a processor, implements the methods provided in the embodiments described above.
[0175] Embodiments of this disclosure also provide a computer program product. Optionally, the computer program product may include a non-volatile computer-readable storage medium, which can be used to store a computer program that, when executed by a processor, implements the methods provided in the embodiments described above.
[0176] Embodiments of this disclosure also provide a computer program. Optionally, in this embodiment, when the computer program is executed by a processor, it implements the method provided in the above embodiments.
[0177] In the above embodiments of this disclosure, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0178] In the several embodiments provided in this disclosure, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some interfaces; indirect couplings or communication connections between units or modules may be electrical or other forms.
[0179] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0180] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0181] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard drive, magnetic disk, or optical disk.
[0182] The above description is only a preferred embodiment of this disclosure. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principles of this disclosure, and these improvements and modifications should also be considered within the scope of protection of this disclosure.
Claims
1. A data processing method, comprising: Retrieve multimodal query data; A target retrieval and generation model is used to perform knowledge retrieval and information generation on the multimodal query data to obtain the target answer; The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the ranking results of the relevance of the candidate retrieval results.
2. The data processing method according to claim 1, wherein, The target retrieval and generation model includes a multimodal recall sub-model, a multimodal ranking sub-model, and a multimodal generation sub-model. Using this model, knowledge retrieval and information generation are performed on the multimodal query data to obtain the target answer, including: The multimodal recall sub-model is used to recall the multimodal query data to obtain the candidate retrieval results; The candidate retrieval results are sorted using the multimodal ranking sub-model to obtain the ranking result; The multimodal generation sub-model is used to process the ranking results to generate response content, thereby obtaining the target answer.
3. The data processing method according to claim 2, wherein, The multimodal recall sub-model includes a feature encoding module, a feature mapping module, and a feature matching module. The multimodal recall sub-model is used to recall the multimodal query data to obtain the candidate retrieval results, including: The feature encoding module is used to perform feature encoding processing on the multimodal query data to obtain a multimodal feature vector; The feature mapping module is used to perform feature mapping processing on the multimodal feature vectors to obtain a multimodal mapping result, wherein the multimodal mapping result is used to represent the multimodal representation obtained by mapping feature vectors of different modalities to a shared multimodal space; The feature matching module is used to recall the candidate retrieval results that match the multimodal mapping results from the target multimodal dataset.
4. The data processing method according to claim 3, wherein, The feature matching module is used to recall candidate retrieval results from the target multimodal dataset that match the multimodal mapping results, including: In the feature matching module, candidate retrieval results that match the multimodal mapping results are retrieved from the target multimodal dataset based on a preset index structure.
5. The data processing method according to claim 4, wherein, The data processing method further includes: Obtain the initial multimodal dataset; The initial multimodal dataset is cleaned and preprocessed to obtain an intermediate multimodal dataset; The intermediate multimodal dataset is subjected to data alignment and data annotation to obtain the target multimodal dataset; The target multimodal dataset is feature-encoded to construct the preset index structure.
6. The data processing method according to claim 4, wherein, Retrieving candidate search results that match the multimodal mapping results from the target multimodal dataset based on the preset index structure includes: Based on the preset index structure, the initial search results are retrieved from the target multimodal dataset; The similarity between the multimodal mapping result and the initial retrieval result is calculated to obtain the calculation result; Based on the calculation results, candidate search results that match the multimodal mapping results are selected from the initial search results.
7. The data processing method according to claim 4, wherein, The data processing method further includes: Based on a preset prompt text format, some candidate search results in the sorting results are concatenated to obtain the target prompt text.
8. The data processing method according to claim 7, wherein, The multimodal generation sub-model is used to process the ranking results into response content, resulting in the target answer, which includes: The target prompt text is input into the multimodal generation sub-model, so that the multimodal generation sub-model performs response content generation processing based on the multimodal splicing data recorded in the target prompt text to obtain the target answer.
9. The data processing method according to claim 1, wherein, The target retrieval generation model is obtained by adjusting the pre-trained model using a target training strategy and multimodal training data. The target training strategy incorporates a cross-modal attention mechanism.
10. The data processing method according to claim 2, wherein, The candidate retrieval results are sorted using the multimodal ranking sub-model to obtain the following ranking results: Calculate the relevance score corresponding to the candidate search results; The candidate retrieval results are sorted based on the relevance scores using the multimodal ranking sub-model to obtain the ranking results.
11. The data processing method according to claim 2, wherein, The candidate retrieval results are sorted using the multimodal ranking sub-model to obtain the following ranking results: The candidate retrieval results are sorted using the multimodal ranking sub-model according to a preset ranking algorithm to obtain the ranking result. The preset ranking algorithm includes one of the following: a ranking algorithm based on vector similarity, a ranking algorithm based on multimodal attention mechanism, or a personalized ranking algorithm that combines user behavior data and contextual information.
12. The data processing method according to claim 2, wherein, The multimodal generation sub-model is used to process the ranking results into response content, resulting in the target answer, which includes: The multimodal generation sub-model is used to process the sorting results to generate initial response content. The initial response content is adjusted to obtain the target answer.
13. The data processing method according to claim 4, wherein, The preset index structure may include at least one of the following: inverted index structure, vector index structure.
14. A data processing method, wherein, include: Acquire multimodal product query data, wherein the multimodal product query data is used to provide descriptive information of the products to be selected; A target retrieval and generation model is used to perform product knowledge retrieval and product information generation on the multimodal product query data to obtain the target product answer; The target retrieval generation model is used to obtain target product retrieval results that meet preset filtering conditions from the candidate product retrieval results associated with the multimodal product query data, and to generate response content contained in the target product answer based on the target product retrieval results. The preset filtering conditions are used to filter the candidate product retrieval results according to the ranking result of the relevance of the candidate product retrieval results.
15. A data processing method, wherein, include: A data processing request is obtained through a first application programming interface, wherein the request data carried in the data processing request includes: multimodal query data; A data processing response is returned through a second application programming interface, wherein the response data carried in the data processing response includes: a target answer, which is obtained by performing knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the ranking results of the relevance of the candidate retrieval results.
16. A data processing method, wherein, include: Obtain the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multimodal query data; In response to the data processing dialogue request, a data processing dialogue reply is returned, wherein the information carried in the data processing dialogue reply includes: a target answer, which is obtained by performing knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the ranking results of the relevance of the candidate retrieval results. The target answer is displayed within the graphical user interface.
17. A model training method, wherein, include: Acquire multimodal training data; The pre-trained model is adjusted using the target training strategy and the multimodal training data to obtain the target retrieval generation model, wherein a cross-modal attention mechanism is introduced into the target training strategy; The target retrieval generation model is deployed to an execution device, wherein the target retrieval generation model is invoked by the execution device to execute the data processing method according to any one of claims 1 to 16.
18. A data processing system, wherein, include: The client is used to send multimodal query data; The server, connected to the client, is used to perform knowledge retrieval and information generation on the multimodal query data using a target retrieval generation model to obtain a target answer. The target retrieval generation model is used to obtain target retrieval results that meet preset filtering conditions from the candidate retrieval results associated with the multimodal query data, and to generate response content contained in the target answer based on the target retrieval results. The preset filtering conditions are used to filter the candidate retrieval results according to the ranking results of the relevance of the candidate retrieval results. The client is also used to output the target answer.
19. An electronic device, wherein, include: Memory, which stores executable programs; A processor for running the program, wherein the program, when running, performs the data processing method according to any one of claims 1 to 16 or the model training method according to claim 17.
20. A computer-readable storage medium, wherein, The computer-readable storage medium includes a stored executable program, wherein, when the executable program is executed, it controls the device on which the computer-readable storage medium is located to perform the data processing method of any one of claims 1 to 16 or the model training method of claim 17.