Information retrieval method and apparatus, dialog method and apparatus, and device, storage medium and program product

By using a multimodal data rewriting model, input data from multiple modalities is rewritten into target text of a single text modality. This solves the accuracy problem caused by the reliance on knowledge bases in existing multimodal information retrieval technologies, and achieves highly accurate information retrieval and deep retrieval.

WO2026124171A1PCT designated stage Publication Date: 2026-06-18ALIBABA (CHINA) CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
ALIBABA (CHINA) CO LTD
Filing Date
2025-11-20
Publication Date
2026-06-18

Smart Images

  • Figure CN2025136397_18062026_PF_FP_ABST
    Figure CN2025136397_18062026_PF_FP_ABST
Patent Text Reader

Abstract

Provided are an information retrieval method and apparatus, a dialog method and apparatus, and a device, a storage medium and a program product. The information retrieval method comprises: receiving user input information, wherein the user input information comprises multimodal data; performing data modality rewriting on the multimodal data to obtain target text, wherein the target text is used for describing content represented by the multimodal data; and performing retrieval on the basis of the target text to obtain a target retrieval result. The present disclosure can improve the accuracy of the target retrieval result.
Need to check novelty before this filing date? Find Prior Art

Description

Information retrieval, dialogue methods, devices, equipment, storage media and program products

[0001] This disclosure claims priority to Chinese Patent Application No. 202411846206.6, filed with the China Patent Office on December 13, 2024, entitled "Information Retrieval, Dialogue Method, Apparatus, Device, Storage Medium and Program Product", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This disclosure relates to the field of computer technology, and in particular to an information retrieval, dialogue method, apparatus, device, storage medium, and program product. Background Technology

[0003] With the continuous development of artificial intelligence (AI) technology, its applications are becoming increasingly widespread. For example, AI technology can currently provide users with various functions such as chatbots and information retrieval. Information retrieval functions can include both single-modal input data retrieval and multi-modal input data retrieval. Specifically, for multi-modal input data retrieval, the module can perform information retrieval based on user-inputted text, images, and other modalities of data.

[0004] Existing methods for information retrieval based on multi-modal input data primarily involve converting the multi-modal input data into vectors. Then, the content with the highest similarity to the vector is retrieved from a pre-stored knowledge base as the retrieval result.

[0005] However, obtaining search results through vector similarity requires extensive pre-configuration of the knowledge base. If the knowledge base is not rich enough, the accuracy of the search results will be poor. Summary of the Invention

[0006] This disclosure provides an information retrieval, dialogue method, apparatus, device, storage medium, and program product that can improve the accuracy of retrieval results.

[0007] In a first aspect, this disclosure provides an information retrieval method, the method comprising:

[0008] Receive user input information, which includes data in multiple modalities;

[0009] Data modality rewriting is performed on the data of the multiple modalities to obtain target text; the target text is used to describe the content represented by the data of the multiple modalities.

[0010] The search is performed based on the target text to obtain the target search results.

[0011] Optionally, the step of performing data modality rewriting on the data of the multiple modalities to obtain the target text includes:

[0012] The data from the various modalities are input into a multimodal data rewriting model to obtain the target text. The multimodal data rewriting model is obtained by training a preset model using a sample dataset. The sample dataset includes samples and sample labels. The samples include sample data from various modalities, and the sample labels include sample text. The sample text is used to describe the content represented by the sample data from the various modalities.

[0013] Optionally, the multimodal data rewriting model includes: a feature encoding module and a text generation module; the step of inputting the multimodal data into the multimodal data rewriting model to obtain the target text includes:

[0014] The feature encoding module performs feature encoding on the data of the multiple modalities to obtain the encoded features of the data of each modality.

[0015] The encoding features of the data of each modality are input into the text generation module to obtain the target text.

[0016] Optionally, the step of performing a search based on the target text to obtain the target search results includes:

[0017] The target search results are obtained by performing a search based on the target text using the target search engine.

[0018] Optionally, the method is applied to a dialogue system, and the method further includes:

[0019] Based on the target retrieval results and the user input information, target prompt information is obtained;

[0020] The target prompt information is input into a preset answer generation model to obtain the answer content corresponding to the user input information;

[0021] Output the answer.

[0022] Optionally, before performing data modality rewriting on the data of the multiple modalities to obtain the target text, the method further includes:

[0023] Obtain user search pattern requirements;

[0024] When the retrieval mode requirement is used to characterize the deep retrieval mode, the data of the multiple modalities is rewritten to obtain the target text.

[0025] Optionally, obtaining the user's search pattern requirements includes:

[0026] Receive user-triggered target instructions that indicate the search mode;

[0027] Based on the target instruction, the retrieval mode requirement is determined;

[0028] or,

[0029] The user input information is input into the retrieval mode determination model to obtain the retrieval mode requirement.

[0030] Optionally, the method further includes:

[0031] In response to the retrieval mode requirement for characterizing non-deep retrieval mode, the user input information is input into the answer generation model to obtain the answer content corresponding to the user input information;

[0032] Output the answer.

[0033] Optionally, the method further includes:

[0034] Obtain a sample dataset, which includes samples and sample labels. The samples include sample data of multiple modalities, and the sample labels include sample text. The sample text is used to describe the content represented by the sample data of multiple modalities.

[0035] The preset model is trained using the sample dataset to obtain the trained multimodal data rewriting model; the multimodal data rewriting model is used to convert data of at least one modality into text modality data.

[0036] Optionally, the preset model includes: a feature encoding module and a text generation sub-model to be trained. The step of training the preset model using the sample dataset to obtain the trained multimodal data rewriting model includes:

[0037] The feature encoding module performs feature encoding on the sample to obtain the encoded features of the sample data for each modality included in the sample.

[0038] The text generation sub-model to be trained is trained using the encoding features of the sample data of each modality and the sample labels to obtain a trained text generation module; the text generation module is used to obtain the data of the text modality based on the encoding features of the data of multiple modalities.

[0039] Secondly, this disclosure provides a dialogue method, the dialogue method comprising:

[0040] Receive user input information, which includes data in at least two modalities: images, text, and audio.

[0041] The data from at least two modalities are input into a multimodal data rewriting model to obtain target text; the target text is used to describe the content represented by the data from at least two modalities.

[0042] The target search results are obtained by searching the target text using the target search engine.

[0043] Based on the target search results and the user input information, the answer content is determined;

[0044] Output the answer.

[0045] Thirdly, this disclosure provides an information retrieval device, the device comprising:

[0046] The receiving module is used to receive user input information, which includes data in multiple modalities.

[0047] The rewriting module is used to perform data modality rewriting on the data of the multiple modalities to obtain target text; the target text is used to describe the content represented by the data of the multiple modalities.

[0048] The processing module is used to perform retrieval based on the target text and obtain the target retrieval results.

[0049] Fourthly, this disclosure provides a dialogue device, the dialogue device comprising:

[0050] The receiving module is used to receive user input information, which includes data in at least two modalities: images, text, and audio.

[0051] The processing module is used to input the data from the at least two modalities into a multimodal data rewriting model to obtain target text; to perform a search based on the target text using a target search engine to obtain target search results; and to determine the answer content based on the target search results and the user input information; wherein the target text is used to describe the content represented by the data from the at least two modalities.

[0052] The output module is used to output the answer content.

[0053] Fifthly, this disclosure provides an electronic device, including: a processor and a memory; the processor is communicatively connected to the memory;

[0054] The memory stores computer instructions;

[0055] The processor executes computer instructions stored in the memory to implement the method as described in any one of the first and / or second aspects.

[0056] In a sixth aspect, this disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, are used to implement the method as described in any one of the first and / or second aspects.

[0057] In a seventh aspect, this disclosure provides a computer program product comprising a computer program that, when executed by a processor, implements the method as described in any one of the first and / or second aspects.

[0058] The information retrieval, dialogue method, apparatus, device, storage medium, and program products disclosed herein can rewrite multimodal data to obtain target text when user input includes multimodal data, thus converting multimodal data into single-text modality data. Then, the target text in this single-text modality can be used for retrieval to obtain the target search results. By using the target text for retrieval, the multimodal retrieval process is transformed into a single-text modality retrieval, thus avoiding the need to convert all multimodal input data into vectors for retrieval. This improves the flexibility of retrieval based on multimodal data and reduces reliance on a pre-set knowledge base during information retrieval based on multimodal data. Consequently, it reduces the impact of insufficient knowledge base content on the accuracy of search results, thereby ensuring the realization of information retrieval based on multimodal data while improving the accuracy of information retrieval. Attached Figure Description

[0059] To more clearly illustrate the technical solutions in this disclosure or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0060] Figure 1 is a schematic diagram of data for one image modality;

[0061] Figure 2 is a flowchart illustrating an information retrieval method provided in this disclosure;

[0062] Figure 3 is a flowchart illustrating a training method for a multimodal data rewriting model provided in this disclosure;

[0063] Figure 4 is a flowchart illustrating a dialogue method provided in this disclosure;

[0064] Figure 5 is a flowchart illustrating another dialogue method provided in this disclosure;

[0065] Figure 6 is a schematic diagram of the training process of a multimodal data rewriting model provided in this disclosure;

[0066] Figure 7 is a schematic diagram of the structure of an information retrieval device provided in this disclosure;

[0067] Figure 8 is a schematic diagram of the structure of a training device for a multimodal data rewriting model provided in this disclosure;

[0068] Figure 9 is a schematic diagram of the structure of a dialogue device provided in this disclosure;

[0069] Figure 10 is a schematic diagram of the hardware structure of an electronic device provided in this disclosure.

[0070] The accompanying drawings have illustrated specific embodiments of this disclosure, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concepts of this disclosure to those skilled in the art through reference to particular embodiments. Detailed Implementation

[0071] To make the objectives, technical solutions, and advantages of this disclosure clearer, the technical solutions of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0072] Information retrieval technology can be applied to scenarios such as intelligent assistants or search engines. Existing information retrieval technologies can be divided into unimodal information retrieval and multimodal information retrieval based on the number of data modalities included in the user's input information.

[0073] Single-modal information retrieval requires that the user's input information can only include data of one modality, such as only text modality data or only image modality data. This single-modal information retrieval method can achieve, for example, text-to-text search (where both the user's input information and the corresponding search results are text modality data), text-to-image search (where the user's input information is text modality data and the corresponding search results are all image modality data), and image-to-image search (where both the user's input information and the corresponding search results are image modality data).

[0074] However, unimodal information retrieval methods often fail to fully understand the user's intent. Figure 1 illustrates data in an image modality. Taking Figure 1 as an example, suppose a user only has the image of Figure 1 and wants to know the characteristics of the tree's growing environment and the areas where the tree is distributed. When using a unimodal information retrieval method, the search can only be conducted through the image or through text description. For example, when searching only through the image, the unimodal information retrieval method cannot know the user's question about the image, leading to poor accuracy in the search results. Similarly, when searching only through text, the unimodal information retrieval method cannot know what kind of tree the user is describing, also resulting in poor accuracy in the search results. In other words, the information entered by the user does not contain complete information in any single modality; using only a single modality as the query information for retrieval cannot accurately retrieve the knowledge the user wants to know.

[0075] Multimodal information retrieval technology can receive data in different modalities input by the user. For example, when using multimodal information retrieval, a user can input an image as shown in Figure 1, along with text descriptions such as "the characteristics of this tree's growing environment and the regions where this tree is distributed." Multimodal information retrieval technology performs information retrieval based on user-inputted text, images, and other modal data.

[0076] Currently, existing multimodal information retrieval methods mainly involve converting user-input multimodal data into vectors (embeddings). For example, images and text are converted into vectors, and then the content with the highest similarity to the vector is retrieved from a pre-configured knowledge base as the retrieval result. Specifically, each modality's data can be converted into its corresponding modality vector (after conversion, an additional multimodal fusion step is needed to align and fuse the vectors from different modalities for subsequent similarity calculations), or the data from multiple modalities can be converted into a unified vector representation (unified representation requires complex strategies to capture and fuse information from different modalities, typically resulting in poor accuracy and high complexity).

[0077] However, the above methods can only perform offline mining and require pre-configured knowledge bases (i.e., content bases). Retrieval is only possible with an existing knowledge base, so if the knowledge base is not rich enough, the accuracy of the retrieval results will be poor. Furthermore, to create a sufficiently rich knowledge base, such as in open-domain question-answering scenarios where the number of knowledge (web pages) can reach trillions, the complexity of knowledge mining, storage, and maintenance is extremely high.

[0078] Considering the aforementioned problems with existing information retrieval methods, this disclosure proposes a method that can rewrite multimodal data from user input into single-modal data for retrieval. This method converts multimodal information retrieval into single-modal information retrieval, reducing the reliance on a pre-set knowledge base and minimizing the impact of insufficient knowledge base content on the accuracy of retrieval results. Therefore, while ensuring the implementation of multimodal retrieval, it also improves the accuracy of multimodal retrieval.

[0079] Optionally, the executing entity of the information retrieval method provided in this disclosure can be any electronic device with processing capabilities, such as a server or terminal device. Alternatively, the executing entity of the information retrieval method can also be a cloud platform or cloud server, etc., and this disclosure does not limit it in this regard.

[0080] In some embodiments, the entity executing the information retrieval method is related to the application scenario of the information retrieval method. It should be understood that this disclosure does not limit the application scenario of the information retrieval method.

[0081] For example, the following is a detailed description of the technical solution of this disclosure, using the application of this information retrieval method to a dialogue system as an example, in conjunction with specific embodiments. In some embodiments, this information retrieval method can also be applied to search engines, intelligent assistants, or other open-domain question-and-answer scenarios.

[0082] The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0083] Figure 2 is a flowchart illustrating an information retrieval method provided in this disclosure. As shown in Figure 2, the method may include the following steps:

[0084] S101. Receive user input information. This user input information includes data in multiple modalities.

[0085] Optionally, the multimodal data can be data in at least two modalities. That is, the user can, for example, input data in two modalities, or input data in three or more modalities.

[0086] It should be understood that this disclosure does not limit which modalities are referred to above. For example, the data in the multiple modalities may include data in at least two modalities, such as text modal data, image modal data, and audio modal data.

[0087] This dialogue system can be implemented via a user terminal (such as a mobile phone, computer, or tablet), responding to the user's search commands and receiving the aforementioned user input information. For example, the dialogue system can receive user input information through an Application Programming Interface (API) or a Graphical User Interface (GUI).

[0088] In some embodiments, the user input information may include only one modality of data, such as only text modality data, or only image modality data. Optionally, when the user input information includes only one modality of data, the method for information retrieval based on the user input information can refer to any existing information retrieval method, which will not be elaborated here.

[0089] S102. Perform data modality rewriting on data of multiple modalities to obtain the target text.

[0090] The target text can be used to describe the content represented by the data in these multiple modalities.

[0091] For example, taking the above-mentioned multi-modal data, including image data in the image modality shown in Figure 1 and text data in the text modality, such as "What are the characteristics of the growing environment of this tree and what areas are where this tree is distributed?", the target text could be, for example,: What are the characteristics of the growing environment of coconut trees, and what areas are where coconut trees are distributed? This target text can describe the coconut tree represented in the image data of the image modality shown in Figure 1, and the problem represented in the text data of the text modality.

[0092] Optionally, the dialogue system can, for example, input the data from the aforementioned multiple modalities into a multimodal data rewriting model to obtain the target text through the multimodal data rewriting model.

[0093] Alternatively, the dialogue system could first determine, for example, whether text-based data exists within the aforementioned multi-modal data. Then, if text-based data is present, the dialogue system could input data from other modalities (e.g., image modalities) into a data modality rewriting model to obtain the first text corresponding to those other modalities. The dialogue system could then use this first text along with the text-based data from the multi-modal data as the target text.

[0094] S103. Perform a search based on the target text to obtain the target search results.

[0095] For example, the dialogue system can invoke a search engine to perform a retrieval based on the target text and obtain search results. Optionally, the dialogue system can directly use the search results as the target retrieval result. Alternatively, the dialogue system can obtain the target retrieval result based on the search results and a preset knowledge base. Furthermore, the dialogue system can generate the target retrieval result based on the search results using a preset language model.

[0096] Alternatively, taking the application of this information retrieval method to a search engine as an example, the search engine can use the target text to perform a retrieval, and the resulting search results can be used as the target retrieval results.

[0097] In some embodiments, after obtaining the target search result, the dialogue system may, for example, output the target search result. For instance, the dialogue system may output the target search result through a user terminal. For example, the dialogue system may display the target search result through a user terminal, or it may broadcast the target search result via voice, thereby outputting the target search result.

[0098] Alternatively, in some embodiments, the dialogue system may, for example, generate and output a response based on the target search result after obtaining the target search result.

[0099] In this embodiment, when user input includes data in multiple modalities, the data can be rewritten to obtain the target text, thus converting the multi-modal data into a single text modality. Then, the target text in this single text modality can be used for retrieval to obtain the target search results. By using the target text for retrieval, the multi-modal retrieval process is transformed into a single text modality retrieval, thus avoiding the need to convert all multi-modal input data into vectors for retrieval. This improves the flexibility of retrieval based on multi-modal data and reduces reliance on a pre-defined knowledge base during information retrieval based on multi-modal data. Consequently, it reduces the impact of insufficient knowledge base content on the accuracy of search results. Therefore, while ensuring the implementation of information retrieval based on multi-modal data, it also improves the accuracy of information retrieval.

[0100] The following section provides a detailed explanation of how this dialogue system performs data modality rewriting on data from multiple modalities to obtain the target text:

[0101] As one possible implementation, the dialogue system could, for example, input the multimodal data into a multimodal data rewriting model to obtain the target text mentioned above.

[0102] For example, the multimodal data rewriting model can be pre-stored in the dialogue system.

[0103] In some embodiments, the multimodal data rewriting model can be obtained by training a preset model using a sample dataset.

[0104] The following is an example illustrating the training method for multimodal data rewriting models:

[0105] Figure 3 is a flowchart illustrating a training method for a multimodal data rewriting model provided in this disclosure. As shown in Figure 3, this disclosure also provides a training method for a multimodal data rewriting model. It should be understood that the execution entity of this multimodal data rewriting model training method can be any electronic device with processing capabilities, such as a server or a terminal device. Optionally, the electronic device used to execute the multimodal data rewriting model training method and the electronic device used to execute the information retrieval method can be the same electronic device or different electronic devices; this disclosure does not limit this.

[0106] As shown in Figure 3, the training method for this multimodal data rewriting model may include the following steps:

[0107] S201. Obtain the sample dataset. This sample dataset may include: samples and sample labels. The samples may include: sample data from multiple modalities. The sample labels may include: sample text.

[0108] The above sample text can be used to describe the content represented by the sample data of the multiple modalities included in the sample corresponding to the sample label.

[0109] In some embodiments, the sample can also be sample data of a modality, so that a multimodal data rewriting model trained on the sample dataset can rewrite data of a modality into target text.

[0110] For example, taking a sample dataset that includes multiple samples, the dataset also includes sample labels for each sample. Optionally, the number of modalities included in different samples can be the same or different.

[0111] For example, among the multiple samples included in this sample dataset, some samples may contain only one modality of sample data, while others may contain sample data containing multiple modalities. For instance, sample 1 may contain sample data of one modality; sample 2 may contain sample data of multiple modalities.

[0112] Taking a sample containing multiple modalities as an example, the modalities of the sample data included in different samples may be the same or different, and this disclosure does not limit this. For example, sample 3 may include sample data of image modality and sample data of text modality; sample 4 may include sample data of text modality and sample data of audio modality.

[0113] Furthermore, for any type of modality of sample data, the number of samples including that modality may be the same or different across different samples, and this disclosure does not impose any limitation on this. For example, assuming that samples 5 and 6 both include image modality sample data, the number of images included in sample 5 may be the same or different from the number of images included in sample 6. Similarly, assuming that samples 7 and 8 both include text modality sample data, the length of the text included in sample 5 may be the same or different from the length of the text included in sample 6.

[0114] For example, the data format of the sample and its corresponding sample label can be as follows:

[0115] {"input":[image1, image2, text1, ...],"output":rewritten result}.

[0116] The field following input represents the sample, and the field following output is the sample label corresponding to that sample.

[0117] Optionally, the electronic device may obtain the aforementioned sample dataset input by the user, for example, through an API or a GUI. Alternatively, the sample dataset may be pre-stored in the electronic device. That is, the electronic device can obtain the sample dataset from its own stored data.

[0118] S202. Using the sample dataset, train the preset model to obtain the trained multimodal data rewriting model.

[0119] The trained multimodal data rewriting model can be used to convert data of at least one modality into text modality data. That is, the data input to the multimodal data rewriting model can be unimodal or multimodal, and this disclosure does not limit it.

[0120] For example, the aforementioned preset model can be a pre-trained network model, such as a pre-trained model that has been pre-trained on a large amount of mixed text and image data and has a certain ability to understand language and images.

[0121] For example, during any training round, the aforementioned pre-defined model can be used to transform sample data from different modalities into feature representations (tokens) in the forward propagation process, i.e., feature extraction. Then, the pre-defined model can, for example, fuse these tokens from the sample data of different modalities to achieve multimodal fusion and obtain fused features. For example, this multimodal fusion method can be: fusing tokens from sample data of different modalities through a cross-attention mechanism or other fusion methods to achieve multimodal information fusion. Then, based on the obtained fused features, the pre-defined model can perform text prediction generation, outputting predicted text, thus generating prediction results according to task requirements.

[0122] Then, the pre-defined model can calculate the loss based on the predicted text and sample labels. For example, the loss function used for training this pre-defined model could be a Causal Language Modeling (CLM) loss function or similar. During backpropagation of the loss, the electronic device can calculate the gradient based on the loss value.

[0123] The loss value can be calculated using the loss function described above. Based on this loss value, the gradient of the parameters of the preset model can be calculated. Furthermore, gradient clipping can be performed to prevent gradient explosion. During weight updates, optimization algorithms such as AdamW (Adam with Weight Decay) or Stochastic Gradient Descent (SGD) can be used to update the model parameters. In some embodiments, learning rate adjustment can also be performed. By dynamically adjusting the learning rate, the training process can be optimized to improve the accuracy of the preset model training, thereby improving the accuracy of the multimodal data rewriting model.

[0124] In this embodiment, a multimodal data rewriting model is obtained by training a preset model using sample data including multiple modalities and sample labels including sample text describing the sample data of these multiple modalities. Through training with the aforementioned sample dataset, the multimodal data rewriting model acquires the ability to rewrite data of multiple modalities into text modal data. Therefore, the information retrieval method provided in this disclosure can obtain the aforementioned target text by inputting data of multiple modalities into the multimodal data rewriting model, laying the foundation for subsequent target retrieval results based on the target text. By rewriting data of multiple modalities into target text through the multimodal data rewriting model, the accuracy of the rewritten target text is ensured, thereby improving the accuracy of subsequent retrieval based on the target text.

[0125] In some embodiments, the aforementioned preset model may include, for example, a feature encoding module and a text generation sub-model to be trained. Training the preset model using the aforementioned sample dataset can be performed by training the text generation sub-model to be trained. Accordingly, by training the aforementioned text generation sub-model to be trained, a trained text generation module can be obtained, which is used to generate target text during information retrieval.

[0126] Optionally, during the execution of step S202 above, the electronic device may, for example, first use the feature encoding module to encode the sample to obtain the encoded features of the sample data of each modality included in the sample.

[0127] For example, the feature encoding mode may include multiple encoders. Different encoders are used to encode data of different modes to obtain the encoded features corresponding to the data of that mode.

[0128] For example, taking the feature encoding module as including an image feature encoder and a text feature encoder, the image feature encoder can perform feature encoding on the sample data of the image modality included in the sample to obtain the encoded features of the sample data of the image modality (i.e., image encoded features). The text feature encoder can perform feature encoding on the sample data of the text modality included in the sample to obtain the encoded features of the sample data of the text modality (i.e., text encoded features).

[0129] Then, the electronic device can train the text generation sub-model to be trained using the encoding features and sample labels of the sample data of each of the above modalities, and obtain the trained text generation module.

[0130] Optionally, during any training round of the text generation sub-model to be trained, the encoded features of the sample data for each modality can be used as input to the text generation sub-model. The text generation sub-model can then predict the output based on the encoded features of the sample data for each modality, i.e., output predicted text. The electronic device can then train the text generation sub-model based on the predicted text and sample labels.

[0131] Optionally, the text generation sub-model to be trained can be trained by means of the encoding features and sample labels of the sample data of each of the above modalities, for example, by referring to the method of the above embodiments, which will not be repeated here.

[0132] The above method can be used to obtain a trained text generation module, which can then generate text modal data based on the encoding features of data from multiple modalities.

[0133] Accordingly, in some embodiments, the multimodal data rewriting model may include the aforementioned feature encoding module and text generation module. When the dialogue system executes the information retrieval method, it can, for example, use the aforementioned feature encoding module to encode the features of the multimodal data to obtain the encoded features of each modality. Then, the dialogue system can input the encoded features of each modality into the text generation module to obtain the aforementioned target text.

[0134] Optionally, still taking the example of the feature encoding modality including multiple encoders, the dialogue system can first determine which modalities the user input information belongs to. Then, based on the modality to which the data belongs, the dialogue system can input the data from different modalities into the encoders of the corresponding modalities, so as to obtain the encoded features of the data of each modality through the encoders corresponding to different modalities.

[0135] In this embodiment, the feature encoding module in the multimodal data rewriting model can acquire the encoding features of data from multiple modalities. The text generation module can then generate target text based on these encoding features, thus rewriting multimodal data into text-modal data, i.e., the target text. Through the aforementioned feature encoding module, feature extraction from multiple modalities is achieved, ensuring that the target text is generated based on the features of that modal data. This improves the accuracy of the target text in describing the content represented by the multiple modalities, thereby further enhancing the accuracy of obtaining target retrieval results based on the target text.

[0136] The following section provides a detailed explanation of how this dialogue system performs retrieval based on the target text and obtains the target retrieval results:

[0137] As one possible implementation, the dialogue system could, for example, use a target search engine to perform a search based on the target text and obtain the target search results.

[0138] Optionally, the target search engine mentioned above can be, for example, the default search engine of a chat system. It should be understood that this disclosure does not limit the selection of the target search engine, and the target search engine can be any existing search engine.

[0139] Optionally, the dialogue system can invoke the target search engine by calling the command of the search engine, and use the target text as the search term (or sentence) to search using the target search engine to obtain the target search results.

[0140] Alternatively, the dialogue system can extract keywords from the target text to obtain its keywords. Then, the dialogue system can use these keywords as search terms to perform a search using the target search engine, thus obtaining the desired search results.

[0141] Alternatively, before extracting keywords from the target text, the dialogue system could first determine the byte length of the target text. If the byte length of the target text is greater than or equal to a preset length, the dialogue system can extract keywords from the target text, obtain the keywords, and use these keywords as search terms to search using the target search engine, thus obtaining the target search results. If the byte length of the target text is less than the preset length, the dialogue system can, for example, use the target text as search terms to search using the target search engine, thus obtaining the target search results.

[0142] In some embodiments, the content obtained by searching through the target search engine, which is the target retrieval result, can also be referred to as external knowledge or background knowledge (i.e., the background knowledge required to obtain the target retrieval result) compared to the preset knowledge base.

[0143] Taking the above information retrieval method as an example in a dialogue scenario with a user, in some embodiments, the dialogue system may first obtain target prompt information based on the target retrieval result and the user input information, and then input the target prompt information into a preset answer generation model to obtain the answer content corresponding to the user input information. Then, the dialogue system can output the answer content.

[0144] Optionally, the dialogue system can, for example, concatenate the target retrieval results with the user input information to obtain the target prompt information. In some embodiments, compared to this newly generated target prompt information, the user input information can also be referred to as the user's initial question.

[0145] Optionally, the above-described answer generation model can be used to generate an answer corresponding to the input prompt information. Therefore, through this answer generation model, the answer content can be obtained based on the target prompt information. Optionally, this answer generation model can refer to any existing model capable of answering user questions, which will not be elaborated upon here. In some embodiments, the answer generation model can be a large model.

[0146] Using the above method, target prompt information is obtained based on the target retrieval results and user input information. This target prompt information includes the target retrieval results obtained from the target search engine, thus enriching the content of the target prompt information. This allows the answer generation model to generate answers based on the target prompt information that includes the target retrieval results. Since the target retrieval results obtained through the search engine can be open-domain knowledge, obtaining answer content through the target retrieval results increases the richness of the background knowledge referenced in generating the answer content, reduces the possibility of inaccurate answer content due to insufficient background knowledge, and thus improves the accuracy of the answer content and enhances the user experience.

[0147] In some embodiments, the dialogue system may add the target retrieval results to a pre-defined knowledge base to enrich its content and improve its overall richness. Then, the dialogue system may convert the target text into a vector and match this vector with the data in the enriched pre-defined knowledge base to obtain the content with the highest matching degree, which serves as the response. Alternatively, the dialogue system may convert all modalities of the user input information into vectors and match these vectors with the data in the enriched pre-defined knowledge base to obtain the content with the highest matching degree, which serves as the response.

[0148] In some embodiments, the dialogue system may also use search results obtained by searching the target text through a target search engine as the answer content. This method improves the efficiency of answer content determination, that is, it improves the response efficiency of outputting answer content to user dialogues, thereby enhancing the user experience.

[0149] In some embodiments, before performing step S102 above, which rewrites data from multiple modalities to obtain the target text, the dialogue system may, for example, first determine whether it is necessary to perform data modal rewriting to obtain the target retrieval result, and when it is determined that data modal rewriting is necessary to obtain the target retrieval result, perform step S102 above.

[0150] For example, as a possible implementation, the dialogue system can obtain the user's retrieval pattern requirements before performing data modality rewriting on data from multiple modalities to obtain the target text.

[0151] When the retrieval mode requirement is used to represent the deep retrieval mode, the dialogue system can perform data modality rewriting of data from multiple modalities to obtain the target text and subsequent steps.

[0152] It should be understood that the aforementioned deep search mode may refer to a user's search needs, such as requiring search results to be highly accurate or rich and comprehensive.

[0153] The above method enables the rewriting of multi-modal data into target text when the user's search request is in deep search mode, and then the acquisition of target search results based on this target text. Because multi-modal data is rewritten into target text, and the target search results obtained based on this target text are highly accurate, the user's deep search needs can be met, thus improving the user experience.

[0154] In some embodiments, the dialogue system may receive a target instruction triggered by a user to indicate a search mode, and determine the search mode requirement based on the target instruction.

[0155] For example, the dialogue system may display a search mode selection control on the search interface. The user can use this control to select a search mode, thereby triggering the aforementioned target instruction. Optionally, the target instruction may include an identifier of the search mode selected by the user.

[0156] Optionally, the dialogue system can parse the target instruction to obtain the user's retrieval pattern requirements.

[0157] In some embodiments, taking the application of this information retrieval method to an intelligent voice assistant as an example, the intelligent voice assistant can, for instance, use the voice acquisition device of a dialogue system to acquire the user's target voice indicating the retrieval mode. The intelligent voice assistant can then parse the target voice to obtain the target instruction for indicating the retrieval mode.

[0158] In this embodiment, the user can trigger a target instruction to indicate a search pattern, enabling the dialogue system to obtain the user's search pattern requirements. Through this method, information retrieval can be performed based on the user's search pattern requirements, thus improving the flexibility of information retrieval and enhancing the user experience.

[0159] Alternatively, in some embodiments, the dialogue system may input the aforementioned user input information into a retrieval pattern determination model to obtain the retrieval pattern requirement.

[0160] For example, the retrieval pattern determination model can be pre-stored in the dialogue system.

[0161] Optionally, the retrieval pattern determination model can be used, for example, to determine the retrieval pattern required when performing a retrieval based on the data input to the retrieval pattern determination model. In some embodiments, the retrieval pattern determination model can be obtained by training a neural network model based on sample search text and the retrieval pattern requirements corresponding to the sample search text. It should be understood that this disclosure does not limit how the network is trained to obtain the retrieval pattern determination model.

[0162] By using the above method, the user's search pattern requirements are obtained based on the user's input information, making these requirements relevant to the questions raised by the user. This allows the information retrieval method to flexibly use different search patterns based on different questions, thus improving the flexibility of the information retrieval method.

[0163] In some embodiments, the dialogue system may, for example, obtain the user's historical search pattern usage habits before obtaining the user's search pattern needs. Then, the dialogue system may, for example, determine the user's search pattern needs based on the user's historical search pattern usage habits.

[0164] For example, after each information retrieval by a user, the dialogue system can add the identifier of the retrieval pattern used in each retrieval to a mapping relationship between the username and the retrieval pattern, and store this mapping relationship. The dialogue system can then, for example, obtain the user's historical retrieval pattern usage habits based on the user's username and this mapping relationship.

[0165] For example, a dialogue system can take the most frequently used search pattern from a user's historical search patterns as the user's search pattern requirement.

[0166] In some embodiments, if the retrieval pattern requirement is used to characterize a non-deep retrieval pattern, indicating that the user does not need to perform a deep retrieval of the user input information, the dialogue system may, for example, respond to the retrieval pattern requirement being used to characterize a non-deep retrieval pattern by inputting the aforementioned user input information into the aforementioned answer generation model to obtain the answer content corresponding to the user input information. Then, the dialogue system can output the answer content.

[0167] Using the above method, the dialogue system does not need to convert user input information into target text before retrieval, thus improving the efficiency and flexibility of the information retrieval method in outputting response content.

[0168] Optionally, the dialogue system can, for example, input the user input information as a prompt into the answer generation model to obtain the aforementioned answer content. Alternatively, in some embodiments, the implementation of the dialogue system obtaining the answer content based on the aforementioned user input information can refer to any existing method for dialogue based on user input information, which will not be elaborated here.

[0169] Taking the user input information as including at least two modalities of images, text, and audio as an example, Figure 4 is a flowchart illustrating a dialogue method provided in this disclosure. The executing entity of this method can be a dialogue system as described in any of the foregoing embodiments. As shown in Figure 4, the method may include the following steps:

[0170] S301. Receive user input information. The user input information may include data in at least two modalities: image, text, and audio.

[0171] For example, the dialogue system can receive the aforementioned user input information via an API or GUI.

[0172] It should be understood that the data of the above-described image modality may, for example, include at least one image. In some embodiments, the data of the image modality may also be video data, which may include multiple frames of images.

[0173] The data for the aforementioned audio modality may include, for example, at least one audio segment. This disclosure does not limit the content of the audio. For example, the audio content may be user voice or other sounds, such as bird calls.

[0174] S302. Input data from at least two modalities into a multimodal data rewriting model to obtain target text. This target text can be used to describe the content represented by the data from the at least two modalities.

[0175] Optionally, the dialogue system can obtain the target text by rewriting the model based on the multimodal data, which can be achieved by referring to the method in the foregoing embodiments, and will not be repeated here.

[0176] S303. Using the target search engine, perform a search based on the target text to obtain the target search results.

[0177] For example, the dialogue system can call the target search engine, use the target text as a search term (or sentence) to search using the target search engine, and obtain the target search results.

[0178] S304. Determine the answer content based on the target search results and user input information.

[0179] For example, the dialogue system can concatenate the target retrieval result with the user input information to obtain target prompt information. Then, the dialogue system can input the target prompt information into the answer generation model as described above to obtain the answer content.

[0180] S305. Output the answer.

[0181] Optionally, the dialogue system can output the answer in a manner that, for example, can refer to the method for outputting the target retrieval results in the foregoing embodiments, which will not be described in detail here.

[0182] In this embodiment, the dialogue system can receive data in at least two modalities, including images, text, and audio, and perform data modality rewriting on these multi-modal data to obtain target text. By rewriting the multi-modal data into target text, this target text can be used to perform retrieval through a target search engine to obtain background knowledge for obtaining the answer content. Then, based on the target retrieval results and user input information, the answer content is determined and output. By using the target text for retrieval, multi-modal retrieval does not rely on matching vectors with a preset knowledge base, and background knowledge can be obtained through a target search engine. Therefore, it reduces the dependence on a preset knowledge base during information retrieval based on multi-modal data and improves the richness of the knowledge used to determine the answer content. This reduces the impact of insufficient knowledge base content on the accuracy of the answer content, ensuring the realization of information retrieval based on multi-modal data while improving the accuracy of the output answer content.

[0183] Figure 5 is a flowchart illustrating another dialogue method provided in this disclosure. Taking the application of this dialogue method to a dialogue system as an example, as shown in Figure 5, the dialogue system can receive user input information, such as data in both text and image modalities. The dialogue system can then perform a search determination, that is, determine the user's retrieval pattern requirements. The way the dialogue system determines the user's retrieval pattern requirements can, for example, refer to the method in the aforementioned embodiments, and will not be repeated here.

[0184] If a search is determined, i.e., the retrieval mode is set to deep retrieval, then when the user input contains multimodal information (text + image) and requires external knowledge to answer (usually determined by the search decision model), the dialogue system can input the entire user input into a trained multimodal data rewriting model. This model rewrites the input into text modality suitable for the query, i.e., the aforementioned target text. Then, by invoking a search engine, the system can retrieve the background knowledge (or external knowledge) needed to answer the question. The dialogue system can then combine this external knowledge with the original user question to create new prompts, which are then input into a large multimodal model (such as the aforementioned answer generation model) to generate an answer and output the target retrieval result.

[0185] Optionally, Figure 6 is a schematic diagram of a multimodal data rewriting model training process provided in this disclosure. As shown in Figure 6, the preset model can consist of multiple sub-models, where the Vision Encoder (image encoder) is used to encode images into image feature representations (i.e., tokens). It should be understood that the text encoder is not shown in Figure 6. The LM Decoder can be the aforementioned text generation sub-model to be trained. The LM Decoder can iteratively predict the next token based on the input tokens (including image tokens and text tokens) until the complete rewritten text is generated. For example, this process can refer to existing model training methods, such as iterative prediction, i.e., using all current tokens to predict the probability distribution of the next token, selecting the token with the highest probability to add to the current token sequence, and continuing to generate the next token, until a special token is generated, for example... <eos>This indicates the end of the generation process. By training this preset model, a trained model can be obtained, which can then be used as a multimodal data rewriting model.

[0186] In this embodiment, by uniformly converting user-inputted data from multiple modalities into a single text modal for search queries, relevant knowledge is obtained more accurately. Compared to existing methods that use only a single modal input for retrieval, the target text rewritten based on multimodal information in this disclosure has more complete semantic information. Compared to existing methods that convert multimodal data into vector embeddings, because it can be uniformly converted into text modal data to call a mature search engine to obtain open-domain knowledge, the richness of background knowledge is improved, thereby improving the accuracy of search results. Through the above method, the responsiveness of the dialogue system to user needs is improved, the background knowledge of downstream large models (such as the aforementioned answer generation model) is enriched, and therefore, more comprehensive and accurate answer generation can be performed, thereby improving the user experience and meeting the user's deep search needs.

[0187] Figure 7 is a schematic diagram of the structure of an information retrieval device provided in this disclosure. As shown in Figure 7, the information retrieval device may include: a receiving module 71, a rewriting module 72, and a processing module 73.

[0188] The receiving module 71 is used to receive user input information. The user input information includes data in multiple modalities.

[0189] The rewriting module 72 is used to rewrite data from multiple modalities to obtain target text. The target text describes the content represented by the data from multiple modalities.

[0190] Processing module 73 is used to perform retrieval based on the target text and obtain the target retrieval results.

[0191] Optionally, the rewriting module 72 is specifically used to input multimodal data into a multimodal data rewriting model to obtain target text. The multimodal data rewriting model is obtained by training a preset model using a sample dataset. The sample dataset includes samples and sample labels. The samples include sample data from multiple modalities, and the sample labels include sample text. The sample text describes the content represented by the sample data from multiple modalities.

[0192] Taking a multimodal data rewriting model including a feature encoding module and a text generation module as an example, optionally, the rewriting module 72 is specifically used to encode the features of multiple modalities of data through the feature encoding module to obtain the encoded features of each modality of data; and input the encoded features of each modality of data into the text generation module to obtain the target text.

[0193] Optionally, the processing module 73 is specifically used to perform a search based on the target text through the target search engine to obtain the target search results.

[0194] Taking the method applied to a dialogue system as an example, optionally, the processing module 73 can also be used to obtain target prompt information based on the target retrieval results and user input information; input the target prompt information into a preset answer generation model to obtain the answer content corresponding to the user input information. Optionally, the information retrieval device may also include an output module 74 for outputting the answer content.

[0195] Optionally, the rewriting module 72 can also be used to obtain the user's retrieval mode requirements before performing data modality rewriting on data of multiple modalities to obtain the target text; when the retrieval mode requirements are used to characterize the deep retrieval mode, the data of multiple modalities is rewritten to obtain the target text.

[0196] Optionally, the receiving module 71 can also be used to receive a target instruction triggered by the user to indicate the search mode. Optionally, the rewriting module 72 can also be used to determine the search mode requirement based on the target instruction. Alternatively, the rewriting module 72 can also be used to input user input information into the search mode determination model to obtain the search mode requirement.

[0197] Optionally, the processing module 73 can also be used to respond to retrieval mode requirements to characterize non-deep retrieval modes, inputting user input information into the answer generation model to obtain the answer content corresponding to the user input information. Optionally, the output module 74 is used to output the answer content.

[0198] The information retrieval device provided in this disclosure is used to execute the aforementioned information retrieval method embodiments. Its implementation principle and technical effect are similar, and will not be described in detail here.

[0199] Figure 8 is a schematic diagram of the structure of a training device for a multimodal data rewriting model provided in this disclosure. As shown in Figure 8, the training device for the multimodal data rewriting model may include: an acquisition module 81 and a training module 82.

[0200] The acquisition module 81 is used to acquire the sample dataset. The sample dataset includes samples and sample labels. The samples include sample data of multiple modalities, and the sample labels include sample text. The sample text is used to describe the content represented by the sample data of multiple modalities.

[0201] Training module 82 is used to train a pre-defined model using a sample dataset to obtain a trained multimodal data rewriting model. This multimodal data rewriting model is used to convert data of at least one modality into text modality data.

[0202] Taking a pre-defined model including a feature encoding module and a text generation sub-model to be trained as an example, the training module 82 is specifically used to encode the features of the samples through the feature encoding module, obtaining the encoded features of the sample data for each modality; using the encoded features of the sample data for each modality and the sample labels, the text generation sub-model to be trained is trained to obtain the trained text generation module. The text generation module is used to obtain text modality data based on the encoded features of data from multiple modalities.

[0203] The training apparatus for the multimodal data rewriting model provided in this disclosure is used to execute the aforementioned training method embodiment for the multimodal data rewriting model. Its implementation principle and technical effect are similar, and will not be described in detail here.

[0204] Figure 9 is a schematic diagram of the structure of a dialogue device provided in this disclosure. As shown in Figure 9, the dialogue device may include: a receiving module 91, a processing module 92, and an output module 93.

[0205] The receiving module 91 is used to receive user input information. The user input information includes data in at least two modalities: image, text, and audio.

[0206] Processing module 92 is used to input data from at least two modalities into a multimodal data rewriting model to obtain target text; to perform a search based on the target text using a target search engine to obtain target search results; and to determine the answer content based on the target search results and user input information. The target text describes the content represented by the data from at least two modalities.

[0207] Output module 93 is used to output the answer content.

[0208] Figure 10 is a schematic diagram of the hardware structure of an electronic device provided in this disclosure. The electronic device 50 shown in Figure 10 includes a memory 51, a processor 52, and a communication interface 53. The memory 51, processor 52, and communication interface 53 are communicatively connected to each other. For example, the memory 51, processor 52, and communication interface 53 can be connected via a network. Alternatively, the electronic device 50 may also include a bus 54. The memory 51, processor 52, and communication interface 53 are communicatively connected to each other via the bus 54. Figure 10 shows an electronic device 50 in which the memory 51, processor 52, and communication interface 53 are communicatively connected to each other via a bus 54.

[0209] The memory 51 may be a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 51 may store programs, and when the programs stored in the memory 51 are executed by the processor 52, the processor 52 and the communication interface 53 are used to perform at least one of the information retrieval, multimodal data rewriting model training, and dialogue methods described in any of the foregoing embodiments. The memory may also store data required for at least one of the information retrieval, multimodal data rewriting model training, and dialogue methods.

[0210] The processor 52 can be a general-purpose CPU, microprocessor, application-specific integrated circuit (ASIC), graphics processing unit (GPU), or one or more integrated circuits.

[0211] Processor 52 can also be an integrated circuit chip with signal processing capabilities. In implementation, at least one of the information retrieval, multimodal data rewriting model training, and dialogue methods of this disclosure can be accomplished through integrated logic circuits in the hardware or instructions in software form within processor 52. The aforementioned processor 52 can also be a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components, capable of implementing or executing the methods, steps, and logic block diagrams disclosed in the following embodiments of this disclosure. A general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the following embodiments of this disclosure can be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The storage medium is located in memory 51. Processor 52 reads the information in memory 51 and, in conjunction with its hardware, completes at least one of the information retrieval, training of multimodal data rewriting model, and dialogue method disclosed herein.

[0212] Communication interface 53 uses transceiver modules, such as, but not limited to, transceivers, to enable communication between electronic device 50 and other devices or communication networks. For example, data sets can be acquired through communication interface 53.

[0213] When the aforementioned electronic device 50 includes a bus 54, the bus 54 may include a path for transmitting information between various components of the electronic device 50 (e.g., memory 51, processor 52, communication interface 53).

[0214] This disclosure also provides a computer-readable storage medium, which may include various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk. Specifically, the computer-readable storage medium stores program instructions that are used in the methods described in the above embodiments.

[0215] This disclosure also provides a program product including executable instructions stored in a readable storage medium. At least one processor of an electronic device can read the executable instructions from the readable storage medium, and the at least one processor executes the executable instructions to cause the electronic device to perform at least one of the information retrieval, multimodal data rewriting model training, and dialogue methods provided in the various embodiments described above.

[0216] The term "multiple" in this document refers to two or more. The term "and / or" in this document is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. Furthermore, the character " / " in this document generally indicates an "or" relationship between the preceding and following related objects; in formulas, the character " / " indicates a "division" relationship between the preceding and following related objects. Additionally, it should be understood that in the descriptions of this disclosure, words such as "first" and "second" are used only for descriptive purposes and should not be construed as indicating or implying relative importance or order.

[0217] It is understood that the various numerical designations used in the embodiments of this disclosure are merely for descriptive convenience and are not intended to limit the scope of the embodiments of this disclosure.

[0218] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this disclosure, and are not intended to limit them. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this disclosure.< / eos>

Claims

1. An information retrieval method, wherein, The method includes: Receive user input information, which includes data in multiple modalities; The data from the multiple modalities is rewritten to obtain the target text; the target text is used to describe the content represented by the data from the multiple modalities. The search is performed based on the target text to obtain the target search results.

2. The method according to claim 1, wherein, The process of performing data modality rewriting on the data of the multiple modalities to obtain the target text includes: The data from the various modalities are input into a multimodal data rewriting model to obtain the target text. The multimodal data rewriting model is obtained by training a preset model using a sample dataset. The sample dataset includes samples and sample labels. The samples include sample data from various modalities, and the sample labels include sample text. The sample text is used to describe the content represented by the sample data from the various modalities.

3. The method according to claim 2, wherein, The multimodal data rewriting model includes a feature encoding module and a text generation module; the step of inputting the multimodal data into the multimodal data rewriting model to obtain the target text includes: The feature encoding module performs feature encoding on the data of the multiple modalities to obtain the encoded features of the data of each modality. The encoding features of the data of each modality are input into the text generation module to obtain the target text.

4. The method according to any one of claims 1-3, wherein, The search based on the target text, to obtain the target search results, includes: The target search results are obtained by performing a search based on the target text using the target search engine.

5. The method according to claim 4, wherein, The method is applied to a dialogue system, and the method further includes: Based on the target retrieval results and the user input information, target prompt information is obtained; The target prompt information is input into a preset answer generation model to obtain the answer content corresponding to the user input information; Output the answer.

6. The method according to any one of claims 1-3, wherein, Before performing data modality rewriting on the data of the multiple modalities to obtain the target text, the method further includes: Obtain user search pattern requirements; The process of performing data modality rewriting on the data of the multiple modalities to obtain the target text includes: When the retrieval mode requirement is used to characterize the deep retrieval mode, the data of the multiple modalities is rewritten to obtain the target text.

7. The method according to claim 6, wherein, The requirements for obtaining user search patterns include: Receive user-triggered target instructions that indicate the search mode; Based on the target instruction, the retrieval mode requirement is determined; or, The user input information is input into the retrieval mode determination model to obtain the retrieval mode requirement.

8. The method according to claim 6, wherein, The method further includes: In response to the retrieval mode requirement for characterizing non-deep retrieval mode, the user input information is input into the answer generation model to obtain the answer content corresponding to the user input information; Output the answer.

9. The method according to claim 2 or 3, wherein, The method further includes: Obtain a sample dataset, which includes samples and sample labels. The samples include sample data of multiple modalities, and the sample labels include sample text. The sample text is used to describe the content represented by the sample data of multiple modalities. The preset model is trained using the sample dataset to obtain the trained multimodal data rewriting model; the multimodal data rewriting model is used to convert data of multiple modalities into text modal data.

10. The method according to claim 9, wherein, The preset model includes a feature encoding module and a text generation sub-model to be trained. The process of training the preset model using the sample dataset to obtain the trained multimodal data rewriting model includes: The feature encoding module performs feature encoding on the sample to obtain the encoded features of the sample data for each modality included in the sample. The text generation sub-model to be trained is trained using the encoding features of the sample data of each modality and the sample labels to obtain a trained text generation module; the text generation module is used to obtain the data of the text modality based on the encoding features of the data of multiple modalities.

11. A dialogue method, wherein, The dialogue method includes: Receive user input information, which includes data in at least two modalities: images, text, and audio. The data from at least two modalities are input into a multimodal data rewriting model to obtain target text; the target text is used to describe the content represented by the data from at least two modalities. The target search results are obtained by searching the target text using the target search engine. Based on the target search results and the user input information, the answer content is determined; Output the answer.

12. An information retrieval device, wherein, The device includes: The receiving module is used to receive user input information, which includes data in multiple modalities. The rewriting module is used to perform data modality rewriting on the data of the multiple modalities to obtain target text; the target text is used to describe the content represented by the data of the multiple modalities. The processing module is used to perform retrieval based on the target text and obtain the target retrieval results.

13. The apparatus according to claim 12, wherein, The rewriting module is specifically used for: The data from the various modalities are input into a multimodal data rewriting model to obtain the target text; The multimodal data rewriting model is obtained by training a preset model using a sample dataset. The sample dataset includes samples and sample labels. The samples include sample data of multiple modalities, and the sample labels include sample text. The sample text is used to describe the content represented by the sample data of multiple modalities.

14. The apparatus according to claim 13, wherein, The multimodal data rewriting model includes a feature encoding module and a text generation module; the rewriting module is specifically used for: The feature encoding module performs feature encoding on the data of the multiple modalities to obtain the encoded features of the data of each modality. The encoding features of the data of each modality are input into the text generation module to obtain the target text.

15. The apparatus according to any one of claims 12-14, wherein, The processing module is specifically used for: The target search results are obtained by performing a search based on the target text using the target search engine.

16. The apparatus according to claim 15, wherein, The device is applied to a dialogue system, and the processing module is further used for: Based on the target retrieval results and the user input information, target prompt information is obtained; The target prompt information is input into a preset answer generation model to obtain the answer content corresponding to the user input information; The information retrieval device further includes an output module, which is used to output the answer content.

17. A dialogue device, wherein, The dialogue device includes: The receiving module is used to receive user input information, which includes data in at least two modalities: images, text, and audio. The processing module is used to input the data from the at least two modalities into a multimodal data rewriting model to obtain target text; to perform a search based on the target text using a target search engine to obtain target search results; and to determine the answer content based on the target search results and the user input information; wherein the target text is used to describe the content represented by the data from the at least two modalities. The output module is used to output the answer content.

18. An electronic device, wherein, include: Processor and memory; The processor is communicatively connected to the memory; The memory stores computer instructions; The processor executes computer instructions stored in the memory to implement the method as described in any one of claims 1-11.

19. A computer-readable storage medium, wherein, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-11.

20. A computer program, wherein, When the computer program is executed in a computer, it causes the computer to perform the steps of the method according to any one of claims 1 to 11.