Question and answer processing method, device, storage medium, and program product

By retrieving external knowledge from the knowledge base and using multimodal knowledge and a visual denoising model for filtering and denoising, the problem of inaccurate answers from customer service AI robots has been solved, resulting in more accurate and reliable question and answer generation, which improves service efficiency and user satisfaction.

CN122198152APending Publication Date: 2026-06-12ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2026-05-12
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing multimodal models of customer service robots are prone to problems such as outdated information, inaccurate content, logical contradictions, and even generating illusory content when answering user questions, making it difficult to meet the requirements of practical applications for accuracy, professionalism, and reliability in answering questions.

Method used

By retrieving external knowledge from the knowledge base and using multimodal knowledge and visual denoising models to filter and denoise the initial knowledge fragments and images, key information strongly related to the current question is retained and input into the question-answering generation model to generate answers, thereby enhancing the relevance and accuracy of question-answering.

Benefits of technology

It improved the accuracy and reliability of customer service chatbot responses, reduced the cost of human intervention, and enhanced service efficiency and user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122198152A_ABST
    Figure CN122198152A_ABST
Patent Text Reader

Abstract

The embodiment of the application provides a question and answer processing method and device, a storage medium and a program product, relates to the technical field of artificial intelligence, can perform knowledge base retrieval based on a question text and an initial image, and performs fine screening and compression on the initial image and the retrieved initial knowledge segment, retains target knowledge segments and target images which are strongly related to the current question, and then inputs the question text, the initial image, the initial knowledge segment, the target knowledge segment and the target image into a question and answer generation model to generate a question and answer, so that a visual answer closed loop process of multi-modal retrieval, multi-modal denoising and knowledge enhancement is constructed, and intelligent customer service can accurately call knowledge on the basis of understanding user intention, thereby providing more accurate, reliable and consistent automatic service experience for users.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a question-and-answer processing method, device, storage medium, and program product. Background Technology

[0002] Against the backdrop of accelerated digital transformation and increasingly sophisticated consumer service demands, intelligent customer service robots have been widely applied in various industries such as e-commerce, finance, telecommunications, and home appliances, becoming an important tool for improving service efficiency and reducing labor costs. In the workflow of an intelligent customer service robot, after a user raises a question, the system uses a multimodal model to understand and generate natural language, and the intelligent customer service robot provides an immediate response.

[0003] However, in practical applications, user questions are often closely coupled with specific application scenarios and rules. While current multimodal models excel in general semantic understanding and generation, relying solely on the model's general capabilities for answers can easily lead to problems such as outdated information, inaccurate content, logical contradictions, and even generating illusory content, making it difficult to meet the requirements of accuracy, professionalism, and reliability in practical applications. Summary of the Invention

[0004] This application provides a question-and-answer processing method, device, storage medium, and program product that can improve the relevance and accuracy of customer service robot answers.

[0005] This application provides a question-answering processing method, comprising: acquiring a question text and an initial image corresponding to the question text; retrieving an initial knowledge fragment from a target knowledge base based on the question text and the initial image, wherein the initial image includes the target object described by the question text; inputting the question text, the initial image, and the initial knowledge fragment into a multimodal knowledge and visual denoising model; using the question text as a first guiding condition to filter the initial knowledge fragment to obtain a target knowledge fragment; and using the initial knowledge fragment and the question text as a second guiding condition to perform cross-modal semantic alignment and conditional denoising on the initial image to obtain a target image; and inputting the question text, the initial image, the target image, the initial knowledge fragment, and the target knowledge fragment into a question-answering generation model; using the target image and the target knowledge fragment as factual basis and the initial image and the initial knowledge fragment as contextual auxiliary information to generate a question-answer, thereby obtaining the answer information corresponding to the question text.

[0006] This application also provides a computing device, including: a memory and a processor; wherein, the memory stores executable code, and when the executable code is executed by the processor, the processor performs the steps in the question-and-answer processing method.

[0007] This application also provides a computer-readable storage medium storing executable code, which, when executed by a processor of a computing device, causes the processor to perform the steps in the question-and-answer processing method.

[0008] This application also provides a computer program product, including: a computer program / instructions, which, when executed by a processor, enable the processor to implement the steps in the question-and-answer processing method.

[0009] The method provided in this application can acquire question text and its corresponding initial image, and retrieve initial knowledge fragments from a target knowledge base based on the question text and the initial image. This allows for the acquisition of the latest personalized knowledge from the target knowledge base, enhancing the professional understanding and response capabilities of the question-answering generation model. Furthermore, the question text, initial image, and initial knowledge fragments can be input into a multimodal knowledge and visual denoising model. Using the question text as a first guiding condition, the initial knowledge fragments are filtered to obtain target knowledge fragments. Then, using the initial knowledge fragments and question text as a second guiding condition, cross-modal semantic alignment and conditional denoising are performed on the initial image to obtain the target image. This allows for refined suppression of redundancy and low-relevance information in the initial knowledge fragments and initial image, retaining key information strongly relevant to the current question, and avoiding the direct input of lengthy and noisy information into the question-answering generation model. Then, the question text, initial image, target image, initial knowledge fragment, and target knowledge fragment are input into the question-answering generation model. The target image and target knowledge fragment are used as factual basis, and the initial image and initial knowledge fragment are used as contextual auxiliary information to generate questions and answers, so as to obtain the answer information corresponding to the question text, achieving visual question answering with high accuracy, low illusion and strong interpretability. Attached Figure Description

[0010] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings: Figure 1 A flowchart illustrating a question-and-answer processing method provided for an exemplary embodiment of this application; Figure 2 A flowchart illustrating yet another question-and-answer processing method provided as an exemplary embodiment of this application; Figure 3 A schematic diagram of the structure of a multimodal knowledge and visual denoising model provided for an exemplary embodiment of this application; Figure 4 A schematic diagram of the structure of a relational reasoning model provided in an exemplary embodiment of this application; Figure 5 A schematic diagram of the structure of a question-answering generation model provided for an exemplary embodiment of this application; Figure 6 A schematic diagram of the structure of a question-and-answer processing apparatus provided in an exemplary embodiment of this application; Figure 7 This is a schematic diagram of the structure of a computing device provided for an exemplary embodiment of this application. Detailed Implementation

[0011] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0012] It should be noted that, in the cases involving user information in the embodiments of this application, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in the embodiments of this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use, and processing of related data must comply with the relevant laws, regulations, and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse. In addition, the various models involved in this application (including but not limited to language models or large models) comply with relevant laws and standards.

[0013] Additionally, it should be noted that when user interaction operations or triggering operations are involved in the embodiments of this application, these operations include, but are not limited to, various interaction methods such as touch operations, gesture operations, voice operations, head movement operations, and eye movement operations. Touch operations include, but are not limited to, click operations, double-click operations, long-press operations, swipe operations, pinch operations, or mouse hover operations. Swipe operations include, but are not limited to, straight-line swipes and curved-line swipes.

[0014] Against the backdrop of accelerated digital transformation and increasingly sophisticated consumer service demands, intelligent customer service robots have been widely applied in various industries such as e-commerce, finance, telecommunications, and home appliances, becoming an important tool for improving service efficiency and reducing labor costs. In the workflow of an intelligent customer service robot, after a user raises a question, the system uses a multimodal model to understand and generate natural language, and the intelligent customer service robot provides an immediate response.

[0015] However, in practical applications, user questions are often closely coupled with specific application scenarios and rules, involving everything from product specifications and functionalities to after-sales policies. While current multimodal models excel in general semantic understanding and generation, relying solely on the model's general capabilities for answers can easily lead to problems such as outdated information, inaccurate content, logical contradictions, and even generating illusory content, making it difficult to meet the requirements of accuracy, professionalism, and reliability in practical applications.

[0016] In some embodiments of this application, external knowledge can be retrieved from a knowledge base and injected into the multimodal model to enhance its professional understanding and response capabilities. However, directly concatenating these lengthy and noisy search results into the multimodal model could easily lead to information interference, thus weakening the model's reasoning and response quality.

[0017] In other embodiments of this application, a knowledge-enhanced KBVQA (Knowledge-based Visual Question Answering) process is constructed: after retrieving external knowledge, multimodal knowledge denoising and visual denoising are performed on the image and retrieved text knowledge to retain key information strongly related to the current question. Then, the denoised knowledge information and visual information are injected into the question-answering generation model for KBVQA inference and answer generation. This enables the customer service robot to accurately invoke knowledge based on understanding the user's intent, improve the relevance, reliability, and accuracy of the answer, reduce the manpower and time costs required for manual intervention, improve overall service efficiency, provide users with a more accurate, reliable, and consistent automated service experience, and provide a scalable knowledge enhancement solution for the intelligent customer service system.

[0018] The technical solutions provided by the various embodiments of this application are described in detail below with reference to the accompanying drawings.

[0019] Figure 1 A flowchart illustrating a question-and-answer processing method provided for an exemplary embodiment of this application is shown below. Figure 1 As shown, the method includes: Step 11: Obtain the question text and the corresponding initial image. Based on the question text and the initial image, retrieve the initial knowledge fragments from the target knowledge base. The initial image includes the target object described in the question text.

[0020] Step 12: Input the question text, initial image, and initial knowledge fragment into the multimodal knowledge and visual denoising model. Using the question text as the first guiding condition, filter the initial knowledge fragment to obtain the target knowledge fragment. Using the initial knowledge fragment and question text as the second guiding condition, perform cross-modal semantic alignment and conditional denoising on the initial image to obtain the target image.

[0021] Step 13: Input the question text, initial image, target image, initial knowledge fragment, and target knowledge fragment into the question-answering generation model. Use the target image and target knowledge fragment as factual basis and the initial image and initial knowledge fragment as contextual auxiliary information to generate the question and answer, so as to obtain the answer information corresponding to the question text.

[0022] The above-described question-and-answer processing method can be applied to various scenarios requiring interaction with customer service robots, such as product consultation and after-sales service in the e-commerce industry, wealth management customer service in the financial industry, or health consultant scenarios in the healthcare industry. This embodiment does not impose any limitations. The target object can be any object, including but not limited to: product objects, cultural and tourism resource objects, or digital artworks, etc. For example, it can be animals, plants, buildings, people, or various services, etc., or various goods that can be sold or displayed on e-commerce platforms, such as vehicles, mobile phones, tables, sofas, clothes, or sporting goods, etc. This embodiment does not impose any limitations.

[0023] Question text is a query statement entered by the user in natural language, expressing their information needs about a target object. It can include the current question and historical dialogue context. In an e-commerce scenario, the question text is a specific question about a product. The initial image is the raw visual input semantically or contextually related to the question text, including the target object described in the question text. For example, it could be a product image retrieved from a product details page, user-uploaded images, or the system's default image library. Initial knowledge fragments are collections of textual information retrieved from the target knowledge base based on the question text and initial image. These are candidate knowledge that has not undergone relevance filtering or noise reduction and may contain redundant, weakly related, or indirectly related information. For example, in an e-commerce scenario, initial knowledge fragments include, but are not limited to: basic product information and image / text details, specifications, instructions for use, size / material specifications, FAQ (Frequently Asked Questions) documents, after-sales policies, and application rules. The target image is a high-quality visual representation obtained by performing content-aware enhancement and denoising on the initial image based on the question text and initial knowledge fragments. This processing focuses on visual regions relevant to the question, improving their clarity and recognizability. The target knowledge fragments are a subset of knowledge fragments selected after relevance scoring and semantic matching of the initial knowledge fragments based on the question text. Their content is semantically strongly related to the question text, such as explicit product parameters, functional descriptions, or after-sales terms.

[0024] The electronic device responsible for question-and-answer processing can acquire the question text and initial image. Optionally, a question-and-answer page can be provided to the user, allowing the user to ask questions about the target object, thereby acquiring the question text and initial image for the target object based on the question-and-answer page.

[0025] refer to Figure 2 After obtaining the question text and initial image, a joint search can be performed on the question text and initial image in the target knowledge base to obtain initial knowledge fragments. The search process can be implemented in various ways, including but not limited to: vector similarity matching, keyword inverted index, entity relationship reasoning based on knowledge graph, multimodal hash retrieval, or a combination of the above methods, etc. There are no restrictions on the search method.

[0026] For example, cross-modal retrieval can be performed using vector similarity matching. The aforementioned "retrieval of initial knowledge fragments in the target knowledge base based on question text and initial image" can be implemented as follows: the initial image is encoded using a fourth visual encoder to obtain a fourth visual vector; the question text is encoded using a fourth text encoder to obtain a fifth text vector; candidate knowledge fragments are retrieved in the target knowledge base based on the fourth visual vector and the fifth text vector; and the candidate knowledge fragments are retrieved using keywords from the question text to obtain the initial knowledge fragments.

[0027] In this example, on one hand, a fourth visual encoder is used to extract features from the initial image, outputting a corresponding fourth visual vector. The visual encoder converts the input image into a representation that can be understood and processed by a machine, i.e., a feature map or feature vector. The fourth visual encoder can be used to segment the initial image into multiple image patches, and then these patches are encoded to ultimately generate a series of feature representations, i.e., the fourth visual vector. On the other hand, a fourth text encoder is used to perform deep semantic parsing on the question text, outputting a corresponding fifth text vector. The fourth text encoder can be used to segment the question text into multiple tokens, assigning each token a corresponding word embedding, position embedding, and possible paragraph embedding, thereby generating a high-dimensional text vector sequence semantically aligned with the question text, i.e., the fifth text vector.

[0028] This application does not limit the specific implementation of the fourth visual encoder and the fourth text encoder. For example, the fourth visual encoder can be a CNN (Convolutional Neural Networks) or a ViT (Vision Transformer), etc., and the fourth text encoder can be a language model based on a Transformer architecture, etc. The fourth visual encoder and the fourth text encoder can belong to the same multimodal model, or they can be independently trained different models; this is not limited. A multimodal model is a model capable of processing and fusing data from multiple different information sources (i.e., modalities).

[0029] Then, the fourth visual vector and the fifth text vector are fused to form a fused multimodal representation. Fusion methods include, but are not limited to, concatenation, weighted fusion, or cross-attention mechanisms. Vector similarity matching is performed on the multimodal representation in the target knowledge base. Through vector similarity calculation (e.g., cosine similarity or Euclidean distance), one or more candidate knowledge fragments related to the multimodal representation are retrieved from the target knowledge base. To further improve retrieval accuracy, keywords from the question text can be combined to perform secondary keyword matching on the candidate knowledge fragments, filtering out content more relevant to the user's question as initial knowledge fragments for subsequent processing or answer generation.

[0030] Considering that initial knowledge fragments are often broad in scope and numerous, and that the relevance of different fragments to the current initial image and question text varies significantly, after obtaining the initial knowledge fragments, the question text, initial image, and initial knowledge fragments can be input into a multimodal knowledge and visual denoising model. This model filters out irrelevant image regions and weakly related initial knowledge fragments, reducing noise and redundancy, and generating compact, high-signal information. The multimodal knowledge and visual denoising model has the following capabilities: while preserving key semantic information, it performs cross-modal alignment and collaborative learning on raw data from different modalities, executes visual denoising and knowledge filtering in parallel, and effectively identifies and suppresses noise and redundancy in each modality. Regarding the text, the multimodal knowledge and visual denoising model uses the question text as the primary guiding condition, identifies and extracts key information highly relevant to the question text from lengthy or noisy initial knowledge fragments, masks useless or weakly related content, and condenses the retrieved initial knowledge fragments into highly saliency target knowledge fragments. In terms of vision, the multimodal knowledge and visual denoising model uses the question text and initial knowledge fragments as the second guiding conditions. By performing cross-modal semantic alignment and conditional denoising on the initial image, it dynamically identifies salient regions in the image that are related to multimodal semantics and suppresses irrelevant image regions, thereby highlighting the image regions supported by the initial knowledge fragments and question text, and generating the target image.

[0031] Based on filtered target knowledge fragments and target images, question-answering generation models can be used to generate questions and answers from question text. In some implementations, the question text, target image, and target knowledge fragments can be input into the question-answering generation model together. The model is driven by both textual facts provided by the target knowledge fragments and clear visual evidence provided by the target image, resulting in an accurate and verifiable answer. After completing the knowledge-enhanced visual question answering, the generated answer can be returned to the user.

[0032] In other implementations, the question text, initial image, target image, initial knowledge fragment, and target knowledge fragment can be jointly input into the question-answering generation model. The model uses the target image and target knowledge fragment as primary guiding conditions, anchoring the core factual basis and key visual evidence of the answer to ensure that the generated content accurately aligns with the user's intent and the product's true attributes. Simultaneously, the initial image and initial knowledge fragment serve as contextual auxiliary information, providing visual background and knowledge boundaries in a real-world scenario, helping the model understand interfering factors, perform disambiguation judgments, or enhance the interpretability and robustness of the answer. Through this multimodal knowledge-enhanced visual question-answering mechanism, which leverages both the precision of the target knowledge and the comprehensiveness of the initial knowledge, the final answer generated by the model not only strictly conforms to the clear visual features presented by the target image and the highly relevant textual information contained in the target knowledge fragment, but also possesses a reasonable perception of the actual user's perspective and the complete product context, thus outputting a more robust, accurate, credible, and realistic intelligent customer service response. The initial knowledge fragment and target knowledge fragment are explicit knowledge.

[0033] In other implementations, implicit knowledge can be generated first based on the question text, target image, and target knowledge fragments. This implicit knowledge can then be used as supplementary information to input into the question-answering generation model, thereby enhancing the consistency of multimodal semantics and the coherence of the answers.

[0034] By performing knowledge base retrieval, filtering the initial knowledge fragments obtained from the retrieval, and denoising the initial images, simplified target knowledge fragments and target images are obtained. Then, the target knowledge fragments and target images are input into the question-answering generation model, which can reduce the impact of redundant noise in the retrieval results and significantly improve the accuracy and robustness of intelligent customer service question answering.

[0035] This application does not limit the parameter size of the question-answering generation model, multimodal knowledge model, and visual denoising model used. For example, they can be relatively large deep learning models. Here, "large model" is just one example. This application does not limit the number of model parameters supported by the deep learning model used, with the goal of meeting actual needs. For example, the question-answering generation model can be an LLM (Large Language Model), and the multimodal knowledge and visual denoising model can be an MLLM (Multimodal Large Language Model).

[0036] In some exemplary embodiments, step 12 in the foregoing embodiments can be implemented based on the following steps S1-S6.

[0037] S1, the question text is encoded using a first text encoder to obtain a first text vector sequence, which includes the vectors of each question word in the question text.

[0038] In this embodiment, the first text encoder of the multimodal knowledge and visual denoising model is used to capture the semantic information of natural language and map it into a high-dimensional vector representation, namely the semantic vector of the question text, which is then used as a text conditional embedding. This application does not limit the specific implementation of the first text encoding. For example, after the question text is input into the first text encoder, the first text encoder can segment the question text at the sub-word or token level to obtain a series of question token sequences. Subsequently, each question token is assigned a corresponding word embedding, position embedding, and possible paragraph embedding, and input into a deep neural network for context modeling, thereby generating a first text vector sequence semantically aligned with the question text.

[0039] S2, the initial knowledge fragment is encoded using a second text encoder to obtain a second text vector sequence, which includes the vectors of each knowledge word in the initial knowledge fragment.

[0040] The second text encoder of the multimodal knowledge and visual denoising model can segment the initial knowledge fragment at the sub-word or lexical level to obtain a series of knowledge lexical sequences. Subsequently, each knowledge lexical is assigned a corresponding word embedding, position embedding, and possible paragraph embedding, which are then input into a deep neural network for context modeling, thereby generating a second text vector sequence semantically aligned with the initial knowledge fragment. The second text encoder can use the same structure as the first text encoder or a different structure; there is no limitation on this.

[0041] S3, the initial image is encoded using a first visual encoder to obtain a first visual vector sequence, which includes the vectors of each image block in the initial image.

[0042] In this embodiment, the first visual encoder of the multimodal knowledge and visual denoising model is used to convert the input initial image into a representation that can be understood and processed by the machine, namely, a feature map or feature vector. This application does not limit the specific implementation of the first image encoder. For example, after the initial image is input into the first visual encoder, the first visual encoder can divide the initial image into several image blocks and, through multi-layer attention encoding, output a feature vector representing the semantic content of the initial image, called the first visual vector sequence.

[0043] S4, concatenate the first text vector sequence and the second text vector sequence to obtain the third text vector sequence.

[0044] The encoded first and second text vector sequences are concatenated according to a specific format to prepare for semantic fusion between the question text and external knowledge. This concatenation operation preserves the semantic boundaries of both the question and the knowledge, enabling the subsequent attention mechanism to clearly distinguish between the two semantic sources, while also providing a structured input basis for jointly modeling the interaction between them.

[0045] S5. Input the third text vector sequence into the text attention layer, use the multi-head attention mechanism to generate the problem knowledge relevance matrix, and select the target knowledge fragment from the initial knowledge fragments based on the problem knowledge relevance matrix.

[0046] The text attention layer of the multimodal knowledge and visual denoising model is used to achieve fine-grained and interpretable knowledge filtering of initial knowledge fragments, thereby providing refined and high signal-to-noise ratio knowledge support for question-answering generation. This application does not limit the specific implementation of the text attention layer. Exemplarily, the text attention layer models the dependencies between question words and knowledge words within a third text vector sequence based on a multi-head attention mechanism, and extracts the attention intensity distribution of question words to knowledge words, forming a question-knowledge relevance matrix. This question-knowledge relevance matrix reflects the semantic association between each knowledge unit and the question intent in the current question-answering context. Based on this relevance matrix, words or sub-fragments in the initial knowledge fragment can be quantitatively scored, and irrelevant content can be removed according to a preset strategy (such as threshold filtering or ranking selection), outputting the target knowledge fragment focused on the question itself.

[0047] S6. Input the third text vector sequence and the first visual vector sequence into the visual attention layer. Guided by the third text vector sequence, perform semantic alignment and denoising on the first visual vector sequence based on cross-modal attention mechanism to obtain the target image.

[0048] The visual attention layer of the multimodal knowledge and visual denoising model is used to achieve visual content denoising guided by text semantics. This application does not limit the specific implementation of the visual attention layer. Exemplarily, the visual attention layer uses a cross-modal attention mechanism to dynamically guide the weighted fusion of the first visual vector sequence, using a third text vector sequence as a query signal. During this process, image regions highly relevant to the text semantics are enhanced, while irrelevant background regions are suppressed, thereby completing image denoising based on multimodal semantic consistency. The output target image retains the visual evidence required to support question-answering reasoning.

[0049] Multimodal knowledge and visual denoising models are executed in parallel for visual denoising and knowledge filtering. By modeling the inherent relevance of attention mechanisms, semantic alignment between text and text, and between text and visual information is achieved. This ensures that both image denoising results and knowledge filtering results are guided by the same question-answering objective, thereby improving overall consistency and robustness.

[0050] The following is combined Figure 3 The architecture and processing flow of the above-mentioned multimodal knowledge and visual denoising model are illustrated by example.

[0051] like Figure 3 As shown, the multimodal knowledge and visual denoising model includes a first text encoder, a second text encoder, a first visual encoder, a visual attention layer, and a text attention layer.

[0052] The first visual encoder is responsible for converting the input initial image into a first visual vector sequence rich in semantic information. The first visual encoder divides the initial image into multiple fixed-size non-overlapping image patches. Each patch is linearly projected and mapped into a multi-dimensional vector, forming an initial visual vector sequence of length N. Each visual vector contains local semantic information of the initial image. Subsequently, the positional encoding module of the first visual encoder adds learnable two-dimensional positional encodings to the multiple image patch vectors to preserve the original spatial structure information. The resulting positional visual vector sequence is input into a backbone network composed of multiple stacked Transformer encoders. Each encoder layer includes a multi-head self-attention mechanism and a feedforward neural network (FNN). Through the multi-head self-attention mechanism and the feedforward neural network, local details and global semantics of the image can be gradually extracted, resulting in a first visual vector sequence that represents the local and global visual semantic information of the initial image.

[0053] The first text encoder segments the question text into multiple question tokens and encodes each token to obtain a first text vector sequence, thereby transforming the question text into a semantic representation that the model can understand and process. This first text vector sequence represents the core semantic intent and keywords of the question text, serving as the basis for subsequent relevance assessments and attention-guided queries.

[0054] The second text encoder segments the initial knowledge fragment into multiple knowledge units and encodes each knowledge unit to obtain a second text vector sequence. This second text vector sequence represents the semantic state of each language unit in the initial knowledge fragment after incorporating its internal context. The second text encoder not only returns the ID (Identifier) ​​of each knowledge unit but also the start and end positions of each knowledge unit within the initial knowledge fragment, forming a knowledge-side offset mapping. This allows the model output to be mapped back to the character positions in the original text, thereby accurately locating the target knowledge fragment.

[0055] The first text vector sequence (denoted as T) and the second text vector sequence (denoted as Q) are concatenated to obtain the third text vector sequence (denoted as X), represented as X=[CLS]T[SEP]Q[SEP]. Here, the classification label [CLS] indicates the starting boundary, and the separator [SEP] divides the index range of knowledge terms and question terms. Due to the maximum input length limit of the visual attention layer, if the concatenated third text vector sequence exceeds the input length limit, it can be truncated, prioritizing the retention of question text and truncating the end of the knowledge segment. The valid sets of knowledge term indices and question term indices are recorded for subsequent weight calculation and attention analysis. The offset mapping records corresponding to the sub-words removed due to truncation are discarded; thus, the retained offset mappings are always aligned with the subsequently used knowledge terms to accurately map back to the character range of the initial knowledge segment.

[0056] The multimodal knowledge and visual denoising model comprises two parallel processing flows: On the image side, a cross-attention mechanism is used to measure the semantic matching degree between each image region and text terms, generating a knowledge- and question-guided image region mask to suppress irrelevant regions. On the text side, a question-knowledge relevance matrix is ​​generated based on a multi-head attention mechanism. Phrases relevant to the question are selected from the retrieved initial knowledge fragments, and a subset of knowledge highly relevant to the current question is automatically identified and extracted from lengthy or noisy initial knowledge fragments.

[0057] On the image side, guided by the third text vector sequence, semantic alignment and denoising based on a cross-modal attention mechanism are performed on the first visual vector sequence to obtain the target image. This includes: performing cross-modal correlation analysis on the third text vector sequence and the first visual vector sequence based on the cross-modal attention mechanism, and generating an initial attention map based on the cross-modal correlation analysis results; performing conditional denoising on the initial attention map at the text word dimension to obtain an image block-level mask; and performing image filtering on each image block in the initial image based on the image block-level mask to obtain the target image.

[0058] The third text vector sequence and the first visual vector sequence are input together into the visual attention layer. The visual attention layer consists of several stacked Transformer layers and a masking module for post-processing. Each Transformer layer contains text self-attention, visual self-attention, and cross-attention. Text self-attention models the semantic interaction between question terms and knowledge terms, forming a joint language understanding. This provides more accurate context-enhanced query vectors for subsequent cross-attention, enabling image denoising to be based on the fused question and knowledge semantics, rather than the original isolated terms. Visual self-attention models the spatial and semantic relationships between regions within the image, improving the semantic richness of each image patch. This allows cross-attention to more accurately determine whether an image patch is related to the text, avoiding misjudgments due to local feature ambiguity. Cross-attention maps the fused text semantics to the visual space, generating an attention map for image denoising.

[0059] In the cross-attention submodule, text terms and image patches interact through a cross-modal attention mechanism, enabling text terms to focus on visual patches via this mechanism. In each layer of cross-attention, text terms are used as queries, and image patches as keys and values. The relevance scores between text terms and each image patch in the initial image are calculated to dynamically measure the semantic relevance of each image region to the question and knowledge. Question terms and knowledge terms are collectively referred to as text terms.

[0060] For example, in the cross-attention mechanism, for any text word, a learnable visual attention projection matrix is ​​used to transform the text word to obtain a visual query vector; each image patch generates a corresponding key vector; the query vector of the text word and the key vector of each image patch are multiplied by a dot product, and after processing by a softmax (normalization exponent) function, the relevance score of each image patch is output. Assuming there are P image patches, the P relevance scores corresponding to the text word can form an attention distribution of size 1×P. By stacking the P relevance scores of a text word as a row and the attention distributions of L text words in row order, an initial attention map of size L×P can be formed. The initial attention map represents the relevance between each image region and each text word. The initial attention map is equivalent to an L×P initial relevance score matrix or attention distribution matrix. The relevance score reflects the intensity of attention a text word receives to a particular image patch given a question and knowledge. Overall, image patches highly relevant to the question receive higher attention, indicating that they are more likely to contain the key visual content needed to answer the question. In other words, image regions that are highly relevant to the semantics of the problem (such as target objects and key attributes) will receive higher attention weights, while visual noise such as backgrounds, borders, and decorative textures will be assigned lower weights.

[0061] After obtaining the initial attention map, conditional denoising can be performed. For example, weighted processing can be applied to the rows corresponding to each text word in the initial attention map based on their relevance scores, or different retention strategies can be set based on word type (such as distinguishing between question keywords and stop phrases in knowledge), thereby weakening the noise response caused by low-relevance or redundant words. The denoised attention rows corresponding to each text word are then fused column by column in the image patch dimension (e.g., by taking the maximum value, weighted summation, or logical combination) to generate an image patch-level mask. This image patch-level mask is used to identify key regions in the image that are consistent with multimodal semantics. Based on the image patch-level mask, operations such as retention, suppression, or weighted reconstruction are performed on the corresponding image patches in the initial image to obtain the target image.

[0062] Optionally, the contribution degrees of different terms in the question text and the knowledge text to visual localization may vary. An adaptive term weight adjustment mechanism can be used to scale the correlation scores between each text token in the initial attention map and the corresponding image patch to enhance the contribution of the keywords. Exemplarily, in the dimension of text tokens, a conditional denoising operation is performed on the initial attention map to obtain an image patch-level mask, including: for any text token, generating a normalized weight of the text token based on the weight of the text token within the token group it belongs to and the adaptive factor in the token population; the token group is the question token group or the knowledge token group, and the token population includes the question token group and the knowledge token group; based on the normalized weights of each text token, scaling the correlation scores between each text token in the initial attention map and the corresponding image patch to obtain a target attention map; in the dimension of text tokens, performing a conditional denoising operation on the target attention map to obtain an image patch-level mask.

[0063] In practical applications, for any text token, the average value of the correlation scores of the text token for each image patch can be calculated, that is, the average value of a certain row in the initial attention map. The higher the average value, the more the text token as a whole focuses on the image. In the knowledge token index set, the average values of each knowledge token are normalized to obtain the within-group weights of each knowledge token, so as to compare which knowledge token is more important within the knowledge token group; in the question token index set, the average values of each question token are normalized to obtain the within-group weights of each question token, so as to compare which question token is more important within the question token group. For example, the weights of stop words such as "of" and "according to" will be suppressed through within-group comparison.

[0064] Then, the adaptive factors for question words and knowledge words are calculated at the group level. Specifically, the average values ​​of each knowledge word are summed to obtain the total intra-group relevance strength of the knowledge word group; the average values ​​of each question word are summed to obtain the total intra-group relevance strength of the question word group. Based on the total intra-group relevance strengths of the knowledge word group and the question word group, the adaptive factors at the group level are dynamically calculated through normalization, resulting in the adaptive factors of the knowledge word group and the question word group, the sum of which is 1. The normalization process includes: using the total intra-group relevance strengths of the knowledge word group and the question word group as inputs, and through exponential mapping and proportional allocation operations, ensuring that word groups with higher relevance strengths receive larger balance factors. In this way, based on the overall relevance strengths of the knowledge word group and the question word group, the total discourse power of the two groups is dynamically allocated, achieving group balance and preventing long knowledge paragraphs from obscuring the problem. For each text word, the normalized weight of each text word is obtained by multiplying the group-level adaptive factor by the intra-group weight. By scaling the relevance score of each row of the initial attention map using normalized weights, a calibrated target attention map is obtained, thereby enhancing the influence of important words, weakening the interference of minor words, and making the attention map more focused on the visual regions that truly support question-answering reasoning.

[0065] When external knowledge fragments are long, the accumulated attention of a large number of knowledge terms may dominate the entire visual attention distribution, causing the keywords of the question itself to be submerged, thus incorrectly suppressing key image regions or retaining irrelevant background. By introducing an adaptive term weight adjustment mechanism based on group perception, the relative importance of each term is first measured within the two semantic groups of knowledge and question to avoid noise interference within the group; then, through contribution balancing at the group level, it is ensured that the question intent is not suppressed by redundant knowledge. In this way, avoiding long text suppressing short text can significantly improve the semantic focusing ability of the attention map, enabling subsequent percentile thresholding and mask synthesis to more accurately identify visual evidence regions that truly support question-answering reasoning.

[0066] Then, the initial attention map or the target attention map is input into the mask processing module. In the mask processing module, thresholding, logical operations, and upsampling are performed to identify semantically salient regions in the image based on the attention map and suppress non-salient regions, thereby achieving visual denoising of the image. For example, thresholds can be set individually for each text word to filter out significantly relevant image patches, and then the masks corresponding to multiple text words can be merged into a single visual attention region mask, representing which regions of the image the entire text is most interested in.

[0067] For example, each text word can be subjected to word-by-word thresholding based on percentiles. Taking the target attention map input mask processing module as an example, in the text word dimension, a conditional denoising operation is performed on the target attention map to obtain an image block-level mask, including: for any row of the target attention map, based on the relevance score distribution of any row, using the relevance score corresponding to a preset percentile ratio as the dynamic threshold for any row; performing binarization processing on the relevance score of any row based on the dynamic threshold to obtain a binary word mask corresponding to any row, the size of the binary word mask being the same as the number of image blocks; and performing a logical OR operation on each binary word mask column by column in the image block dimension based on the target attention map to obtain an image block-level mask.

[0068] Each row of the target attention map corresponds to one text word. For any text word, the multiple relevance scores corresponding to the row containing that text word in the target attention map form a 1×P attention distribution. The percentile of this distribution is calculated according to the percentile ratio and used as the dynamic threshold for that text word. Image patches in the row containing that text word that are greater than or equal to the dynamic threshold are marked as 1 to indicate high relevance; image patches that are less than the dynamic threshold are marked as 0 to indicate low relevance. This results in a binary vector of length P. For example, if the attention distribution of a text word is [0.01, 0.05, 0.8, 0.9], and the percentile is set to 75%, the value corresponding to the 75% position is 0.8. Then the dynamic threshold of that text word is 0.8, and the binary word mask after binarization is [0, 0, 1, 1].

[0069] Adaptive threshold calculation and judgment are performed on each text word to obtain a binary word mask for each text word. The binary word masks of each text word are then merged column-by-column along the image patch dimension using a logical OR operation to obtain an image patch-level mask of length P. Here, if any text word considers an image patch significant, the value of that image patch in the final image patch-level mask is 1. For example, the binary word mask for text word A is [1,0,0,1,0], and the binary word mask for text word B is [0,0,1,1,0], resulting in a final image patch-level mask of [1,0,1,1,0]. In other words, adaptive threshold pruning is applied to the attention distribution of each text word, effectively suppressing low-response noise in the target attention map for each text word while retaining the local visual regions it is more interested in, thus achieving explicit weighting of irrelevant visual content.

[0070] The image block-level mask is reshaped into a 2D grid, and then upsampled to the initial image resolution using methods such as bilinear interpolation. During visualization, regions with a mask value of 0 (non-salient regions) are replaced with a white background, while regions with a mask value of 1 (salient regions) are retained, thus generating a denoised attention visualization map, i.e., the target image. The target image clearly shows which regions in the initial image are semantically significant and relevant to the current multimodal question answering task after fusing the question text and initial knowledge fragments. It effectively filters out irrelevant noise interference, retains fine-grained key objects, and provides a cleaner and more discriminative visual input for downstream tasks, enabling subsequent question answering generation models to automatically focus on key visual content.

[0071] On the image side, an attention weight distribution is generated through a cross-modal attention mechanism. Adaptive percentile thresholding based on text units is then applied to generate a binary unit mask for each text unit. These masks are then fused into a block-level image mask via a logical OR operation. The initial attention map is transformed into a semantically driven, noise-robust visual saliency mask, achieving image denoising, sparsification, and visualization based on a joint guidance of question and knowledge. The initial knowledge fragments provide external factual evidence upon which the question depends, enabling the multimodal knowledge and visual denoising model to identify visual elements in the initial image that correspond to the facts. This anchors the attention distribution to more accurate visual evidence, significantly improving semantic alignment and fine-grained localization accuracy.

[0072] On the text side, the multimodal knowledge and visual denoising model determines whether each knowledge fragment is relevant to the current problem, identifies which words or phrases truly carry useful information, and performs weighted compression accordingly to remove redundant or ambiguous expressions, thereby selecting phrases relevant to the problem as high-precision text prompts. For example, a third text vector sequence is input into a text attention layer, and a multi-head attention mechanism is used to generate a question knowledge relevance matrix. Based on this matrix, target knowledge segments are selected from the initial knowledge segments. This process includes: inputting the third text vector into the text attention layer for multi-head attention calculation to obtain the attention weight matrix of question words to knowledge words output by each attention head of the text attention layer; determining the importance weight of each attention head based on its sensitivity signal to the question-answering task; weighting and aggregating the attention weight matrices of each attention head based on their importance weights to generate a question knowledge relevance matrix, which represents the comprehensive relevance between question words and knowledge words; calculating the average attention level of each knowledge word by multiple question words based on the question knowledge relevance matrix to obtain a relevance score of each knowledge word to the overall semantics of the question; and selecting target knowledge segments related to the question semantics from the initial knowledge segments based on the relevance scores of each knowledge word to the overall semantics of the question.

[0073] The text attention layer employs a multi-layered stacked self-attention structure. Each attention sublayer contains multiple parallel attention heads used to model the contextual dependencies between words within the input text sequence. Each attention head independently calculates the relevance weights between words, and the outputs of multiple attention heads are concatenated before being input into the next layer. Through this multi-layered self-attention mechanism, the text attention layer enables full interaction between question semantics and knowledge semantics, allowing each knowledge word to perceive the question's intent, while the question can dynamically focus on key information within the knowledge, thus achieving bidirectional semantic alignment.

[0074] In any multi-head self-attention mechanism, the self-attention probability distribution among all text words in the input third text vector sequence is calculated. For example, for any attention head, the attention level of each text word in the third text vector sequence (including itself) when acting as a queryer is calculated. These attention levels are normalized to form a square matrix called the attention weight matrix of that attention head. The rows of the matrix represent query words, i.e., words that actively pay attention to other words; the columns of the matrix represent the query words, i.e., the objects that are paid attention to by other words.

[0075] To quantify the support strength of each knowledge term for the question, this embodiment constructs an interaction matrix between the question and knowledge by modulating the self-attention probability using a sensitivity signal and an aggregation head. Different attention heads focus on different types of semantic relationships, and each attention head is assigned an importance weight. In one implementation, for each attention head, an importance weight is assigned based on its sensitivity signal for the question-answering task; the sensitivity signal is used to measure the effectiveness of the attention head in distinguishing between relevant and irrelevant knowledge. For example, the gradient magnitude corresponding to the attention head is used as the sensitivity signal of that attention head. In another implementation, a set of learnable scalar parameters is introduced, each parameter corresponding to an attention head; during the knowledge selection process, these parameters are used to weight and fuse the attention heads; these parameters are automatically optimized during end-to-end training, so that attention heads with strong relevance receive higher weights.

[0076] From each attention weight matrix, the attention weight matrix for question terms to knowledge terms is extracted. This matrix represents the partial attention weights corresponding to the question term as the query term and the knowledge term as the query object. The rows represent the rows containing the question term, and the columns represent the columns containing the knowledge term. The attention weight of a question term to a knowledge term directly reflects the importance that the question term considers to be of the knowledge term in the current context.

[0077] The attention weight matrices of question terms and knowledge terms extracted by each attention head are weighted and aggregated according to their respective importance weights to generate a question-knowledge relevance matrix. Each row in the question-knowledge relevance matrix corresponds to a question term, each column corresponds to a knowledge term, and each item represents the degree of attention a question term pays to a knowledge term, or the strength of its relevance. For any given knowledge term, all values ​​in its corresponding column in the question-knowledge relevance matrix are averaged, i.e., the average strength of the attention paid to that knowledge term by multiple question terms is calculated, thus obtaining the relevance score of that knowledge term to the overall semantics of the current question. The relevance score reflects the degree of semantic association between the knowledge term and the question; the higher the score, the more likely the knowledge term contains the key information needed to answer the question.

[0078] Based on the relevance scores of each knowledge term to the overall semantics of the question, a selection process is performed according to a target strategy. This target strategy includes, but is not limited to, retaining knowledge terms with scores greater than or equal to a set threshold and removing those with scores below the threshold; or, selecting a number of top-ranking knowledge terms and removing a number of bottom-ranking ones. The selected knowledge terms are extracted in their original order within the third text vector sequence to form a new, shorter knowledge vector subsequence, which is then converted back into a natural language form of target knowledge fragments via the inverse transformation of a word segmenter.

[0079] Since the selected lexical units may be segmented into multiple parts, adjacent or overlapping intervals can be merged into a smaller interval to form readable phrases. For example, for selected lexical units, the offset mapping information recorded during the encoding stage is used to obtain the start and end positions of each selected lexical unit in the initial knowledge fragment, thus obtaining a set of discrete character intervals. Since a single semantic unit may be segmented into multiple consecutive lexical units by the segmenter (e.g., disc brake is segmented into disc, style, brake, car), and different high-scoring lexical units may be adjacent or partially overlapping in the original text, the character intervals are merged: First, all character intervals are sorted in ascending order by their start positions; then, the sorted interval list is traversed, and if the interval between the current interval and the previous interval does not exceed a preset number of characters, they are merged into a larger consecutive interval; finally, based on the merged character intervals, the corresponding substrings are directly extracted from the original knowledge text to form several coherent and readable natural language fragments; these natural language fragments constitute the final target knowledge fragment, used for subsequent answer generation, visualization highlighting, or user explanation.

[0080] For ease of explanation, the foregoing embodiments describe text processing and visual processing as logically separate modules. However, those skilled in the art will understand that the aforementioned visual attention layer and text attention layer can also be integrated and implemented within a single multimodal Transformer encoder. For example, knowledge filtering can be performed at a shallow layer; and attention maps for image denoising can be generated at a deeper layer.

[0081] The aforementioned multimodal knowledge and visual denoising approach uses the question representation as the main query and combines image and knowledge representations to construct a cross-modal attention mechanism. Attention weights characterize the relevance of each knowledge fragment and each image region to the current question. The attention weights are normalized and thresholded, treating text units and visual regions with low weights and limited contribution to the final answer as noise, and explicitly reducing or removing them during aggregation. This achieves two main goals: firstly, intra-sentence denoising, condensing lengthy or ambiguous knowledge text into the most relevant semantic essence; and secondly, intra-image denoising, accurately extracting more relevant visual evidence from a large number of irrelevant or background image patches. All multimodal data is finely compressed and preserved, providing rich, accurate, and low-redundancy multimodal context for subsequent answer generation.

[0082] The question-and-answer generation process that incorporates implicit knowledge is explained below.

[0083] In an exemplary embodiment, a first prompt word is generated based on the question text and the target knowledge fragment. The first prompt word is used to guide the relational reasoning model to output analytical text representing the correlation of multimodal information. The first prompt word and the target image are input into the relational reasoning model, and a first multimodal sequence is generated based on the first prompt word and the target image. Using the first multimodal sequence as a third guiding condition, cross-modal joint reasoning is performed on the question text, the target image, and the target knowledge fragment to obtain implicit knowledge. The implicit knowledge describes the correlation between the question text, the target image, and the target knowledge fragment in the form of natural language. The aforementioned step 13 can be implemented as follows: the question text, the initial image, the target image, the initial knowledge fragment, the target knowledge fragment, and the implicit knowledge are input into the question-answering generation model. The target image and the target knowledge fragment are used as factual basis, the initial image and the initial knowledge fragment are used as contextual auxiliary information, and the implicit knowledge is used as supplementary basis to generate a question and answer to obtain the answer information.

[0084] The relational reasoning model has the following capabilities: it can parse the input first prompt word and target image, transform the reasoning instructions carried by the first prompt word into generation constraints, and on this basis, perform deep understanding and cross-modal joint reasoning on text content and image content to generate non-responsive intermediate reasoning content, i.e. implicit knowledge, expressed in natural language.

[0085] The first prompt word is an instructional text composed of the question text and the target knowledge fragment according to a preset template. The first prompt word and the target image are input into a relational reasoning model. The model converts the target image into a representation sequence within the same semantic space as the text, forming a unified multimodal sequence with the question text and knowledge fragment in the first prompt word. Based on this multimodal sequence, the relational reasoning model dynamically establishes cross-modal associations, gradually generating implicit knowledge. This implicit knowledge indicates the logical connection between the question intent, image evidence, and knowledge facts, including but not limited to: the identification result of the potential intent behind the user's question; the semantic description of the visual evidence related to the question in the target image; the judgment of the support or contradiction between the factual statements in the target knowledge fragment and the question's concerns; and the comprehensive analysis conclusions on the logical consistency, evidence completeness, or information missing status among the question text, target image, and target knowledge fragment.

[0086] Optionally, a prompt word template corresponding to the relational reasoning model can be pre-designed. For ease of description and differentiation, this template is referred to as the first prompt word template. The first prompt word template includes first task description information, which describes the role played by the relational reasoning model, the functions it needs to perform, and the constraints that must be met to perform those functions. The first prompt word template also includes a region to be filled. The question text and target knowledge fragment are filled into this region to obtain the first prompt word. The first prompt word includes the first task description information, the question text, and the target knowledge fragment.

[0087] The first task description information can suppress the direct response behavior of the relational reasoning model during the reasoning process. For example, it includes, but is not limited to, the following descriptions: "You are the reasoning engine of an e-commerce intelligent customer service. Please do not answer the user directly, but generate an intermediate reasoning content for subsequent answers", "Point out the true intention behind the question", "Compare the image content with the knowledge fragment to see if there is consistency, complementarity or conflict".

[0088] In this application's embodiments, the specific implementation of the relational reasoning model is not limited; for example, it can be MLLM. The following section combines... Figure 4 The architecture and processing flow of the relational reasoning model are illustrated by example.

[0089] like Figure 4 As shown, relational reasoning includes a fifth text encoder, a fifth visual encoder, and a decoder. The target image is input into the fifth visual encoder for block embedding and positional encoding, and after multi-layer self-attention transformation, a fifth visual vector is output. The first cue word is input into the fifth text encoder, and after word segmentation and encoding, a sixth text vector is output. The fifth visual vector and the sixth text vector are concatenated along the sequence dimension to form the first multimodal sequence.

[0090] The first multimodal sequence is input into the decoder of the relational reasoning model for forward inference. The decoder consists of multiple cascaded Transformer decoding layers, each of which sequentially includes MMHA (Masked Multi-Head Self-Attention), residual connections and layer normalization, and a feedforward neural network.

[0091] The masked multi-head self-attention mechanism applies a causal mask to the hidden state sequence of the current layer input, ensuring that each position can only focus on itself and its preceding positions. This mechanism includes multiple parallel attention heads, each independently computing the query, key, and value vectors, and aggregating contextual information through a scaled dot product attention function. Since visual and text vectors share the same input sequence and embedding space, this self-attention mechanism supports cross-modal interaction, allowing any text position to dynamically associate with relevant visual regions when calculating attention weights. Residual connections and layer normalization add the output of the masked multi-head self-attention mechanism to the input hidden state, forming a residual connection, followed by layer normalization to stabilize gradient propagation and accelerate inference convergence. The feedforward neural network, composed of fully connected layers and intermediate activation functions, independently performs nonlinear feature transformations on the hidden states at each position in the decoding layer to enhance the model's local expressive power. The output of the feedforward neural network is again processed through residual connections and layer normalization to obtain the final output of the decoding layer.

[0092] The above process is executed layer by layer in multiple decoding layers, refining high-level semantic representations step by step to obtain a context-aware hidden state sequence. Subsequently, an autoregressive decoding process is initiated based on the hidden state sequence: in each step, based on the currently generated prefix and the original multimodal context, the probability distribution of the next output lexical is calculated, and a new lexical is selected according to a preset sampling strategy, until a terminator is generated or the maximum length limit is reached. The generated complete lexical sequence is then restored to natural language text, thereby obtaining implicit knowledge.

[0093] The role of tacit knowledge is twofold. On the one hand, it further compresses and filters the target knowledge fragments semantically, refining them into natural language reasoning text that focuses on the core of the question, thereby improving the accuracy of subsequent reasoning and suppressing the propagation of errors. On the other hand, it integrates the explicit guidance information provided by the target image with the textual semantics from the question text and the target knowledge fragments to generate a logically coherent and semantically clear intermediate expression, which reveals the logical connection, evidence consistency, or information gap between the question, image, and knowledge, making it easier for the downstream answer generation module to understand and utilize.

[0094] Once implicit knowledge is obtained, it can be input into the question-answering generation model along with other information. In an exemplary embodiment, a question-and-answer generation model is input with question text, an initial image, a target image, an initial knowledge fragment, a target knowledge fragment, and implicit knowledge. The model uses the target image and target knowledge fragment as factual basis, the initial image and initial knowledge fragment as contextual auxiliary information, and implicit knowledge as supplementary basis to generate a question and answer, thereby obtaining an answer. The process includes: Step R1, generating a second prompt word based on the question text, target knowledge fragment, initial knowledge fragment, and implicit knowledge, whereby the second prompt word defines the reasoning context of the question-and-answer generation model in text form; Step R2, inputting the second prompt word, initial image, and target image into the question-and-answer generation model, performing semantic extraction on the second prompt word, initial image, and target image respectively to obtain question semantics, visual semantics, and knowledge semantics; generating a second multimodal sequence based on the second prompt word, initial image, and target image, using the second multimodal sequence as a fourth guiding condition, using the target image and target knowledge fragment as factual basis, the initial image and initial knowledge fragment as contextual auxiliary information, and implicit knowledge as supplementary basis, performing cross-modal alignment on the question semantics, visual semantics, and knowledge semantics, and generating an answer based on the alignment result.

[0095] In this embodiment, the question-answering generation model has the following capabilities: it can perform fine-grained semantic analysis on the input initial image and target image, and introduce target knowledge fragments as authoritative priors and initial knowledge fragments as contextual auxiliary information; on this basis, it integrates the question text, initial image, target image, initial knowledge fragments, target knowledge fragments, and implicit knowledge into a second multimodal sequence as a guiding condition for joint reasoning; then, based on the second multimodal sequence, it performs deep semantic understanding and cross-modal alignment on the target object in the image under the constraints of explicit and implicit knowledge, and generates a natural language answer that is consistent with the image content and product knowledge and has context awareness, thereby achieving highly accurate and highly interpretable knowledge-enhanced visual question answering.

[0096] The second prompt word is composed of the question text, initial knowledge fragments, target knowledge fragments, and implicit knowledge, concatenated in a specific order. The knowledge fragments are wrapped using the format tags `<knowledge>` and `< / knowledge>` and inserted into the second prompt word to transform them into conditional prompts understandable by the language model. To distinguish between different knowledge sources, the initial knowledge fragment can be tagged with `<|initial_knowledge|>`, and the target knowledge fragment with `<|target_knowledge|>`. This allows the question-answering generation model to explicitly identify the target knowledge fragment during reasoning and prioritize its use during generation, significantly improving the accuracy and compliance of the answers.

[0097] Optionally, a prompt word template corresponding to the question-answering generation model can be pre-designed. For ease of description and differentiation, this is called the second prompt word template. This second prompt word template includes a second task description, which describes the role of the question-answering generation model, the functions it needs to perform, and the constraints that must be met to perform those functions. The second prompt word template also includes a region to be filled, into which the question text, initial knowledge fragments, target knowledge fragments, and implicit knowledge are filled to obtain the second prompt word. For example, the second task description could be, "You are a professional e-commerce customer service assistant. Please answer the questions strictly based on the official product information and the images provided by the user."

[0098] After inputting the first prompt word, initial image, and target image into the question-answering generation model, the model encodes the prompt word as textual semantic features and the initial and target images as visual semantic features. These visual semantic features are then concatenated with the textual features to form a second multimodal sequence. Guided by this second multimodal sequence, the model dynamically focuses on the question keywords, relevant regions in the image, and corresponding statements in explicit and implicit knowledge. Through its internal cross-modal attention mechanism, it aligns the semantic relationships between the user's question, image content, and knowledge content to generate an answer. This answer not only conforms to visual observation but also strictly adheres to official product information, thus achieving highly accurate and compliant intelligent customer service responses.

[0099] This application does not limit the specific implementation of the question-answer generation model. In an exemplary embodiment, step R1 in the aforementioned embodiment can be implemented as follows: encoding the initial image using a second visual encoder to obtain a second visual vector; encoding the target image using a third visual encoder to obtain a third visual vector; encoding the second prompt word using a third text encoder to obtain a fourth text vector; projecting the second visual vector and the third visual vector into the embedding space of the language model of the question-answer generation model; concatenating the projected second visual vector and the third visual vector with the fourth text vector in the embedding space of the language model of the question-answer generation model to generate a second multimodal sequence; using the language model, with the second multimodal sequence as the fourth guiding condition, using the target image and target knowledge fragment as factual basis to determine the answer backbone, using the initial image and initial knowledge fragment as contextual auxiliary information to provide scene context support, using implicit knowledge as supplementary basis to enhance semantic coherence, performing cross-modal alignment of question semantics, visual semantics and knowledge semantics, and generating answer information based on the alignment results.

[0100] The following is combined Figure 5 The architecture and processing flow of the question-answer generation model are illustrated by example.

[0101] like Figure 5 As shown, the question-answering generation model includes a second visual encoder, a third visual encoder, a multimodal alignment module, and a language model.

[0102] A visual encoder is used to convert an input image into a representation that can be understood and processed by a machine, namely a feature map or feature vector. This application does not limit the specific implementation of the second and third visual encoders.

[0103] For example, the second visual encoder divides the initial image into fixed-size, non-overlapping image patches, each of which is linearly projected into a multi-dimensional vector. Each visual vector contains semantic information of the local image. Subsequently, the positional encoding module of the second visual encoder adds learnable two-dimensional positional encodings to the multiple image patch vectors to preserve the original spatial structure information. The resulting sequence of positional visual vectors is input into a backbone network consisting of multiple stacked Transformer encoders, each layer containing a multi-head self-attention mechanism and a feedforward neural network. Through the multi-head self-attention mechanism and the feedforward neural network, the local details and global semantics of the image can be gradually extracted, ultimately outputting the second visual vector of the initial image.

[0104] The processing flow of the third-vision encoder is similar to that of the second-vision encoder, although there may be differences in depth or parameters, which will not be elaborated here. After processing by the third-vision encoder, the third-vision vector of the target image can be output.

[0105] In implementations supporting multiple image inputs, if multiple initial images are input, each initial image is encoded independently, and the multiple visual vector sequences are concatenated in the input order to form a second visual vector. Similarly, after multiple target images are encoded independently, the multiple visual vector sequences are concatenated in the input order to form a third visual vector.

[0106] The multimodal alignment module is the core bridge connecting the visual and language modalities, responsible for aligning the visual features output by the visual encoder with the text semantic space of the language model. For example, the multimodal alignment module may include a learnable linear projection layer whose input dimension matches the output dimension of the visual encoder. The projection layer transforms the second and third visual vectors into vectors with the same embedding dimension as the language model through a learnable linear (or non-linear) mapping. This process is equivalent to translating visual information into language that the language model can understand, thereby enabling visual information to participate in the reasoning process of the language model.

[0107] The language model is responsible for outputting response information that conforms to a predefined format. The language model may include a pre-defined tokenizer (third text encoder), an embedding layer, a positional encoding module, and a self-attention layer. The second prompt word is segmented into a sequence of tokens by the language model's tokenizer and converted into corresponding text embedding vectors. These token encodings are fed into the language model's embedding layer, which maps the token encodings into continuous high-dimensional vectors, resulting in a fourth text vector. This transforms the second prompt word into a numerical semantic representation that the language model can understand and process.

[0108] In the embedding space of the language model, the second and third visual vectors are concatenated with the fourth text vector along the sequence dimension to form a second multimodal sequence. The positional encoding module of the language model adds positional information to the concatenated second multimodal sequence, ensuring that the position of each token (whether from text or image) in the sequence can be perceived. In this second multimodal sequence, target knowledge fragments are explicitly marked in a recognizable manner, enabling the question-answering generation model to distinguish them from ordinary dialogue text during reasoning and assign them higher semantic weights.

[0109] The self-attention layer of the language model consists of multiple stacked Transformer decoders, each containing a multi-head self-attention mechanism and a feedforward neural network. A second multimodal sequence is input into the language model's self-attention layer, and the self-attention mechanism dynamically calculates the dependencies between positions within the second multimodal sequence, achieving deep fusion of visual information and linguistic context in each layer. During attention calculation, tokens from the image and tokens from the text participate in attention weight allocation, forming a unified cross-modal representation space, thus supporting fine-grained alignment of question semantics, visual semantics, and knowledge semantics. Specifically, keywords in the question semantics are used as queries, actively retrieving corresponding parameter descriptions from the knowledge semantics and features of the target image's corresponding region (such as the shape of a charging port) from the visual semantics within the attention mechanism. These three elements share attention weights to achieve semantic consistency verification, ensuring that the answer conforms to both linguistic logic and visual evidence and structured knowledge.

[0110] During this process, because the model has been exposed to a large number of authoritative information samples wrapped with the tag <|target_knowledge|> during the instruction fine-tuning phase, its self-attention mechanism has learned to assign higher attention weights to tokens within this knowledge tag during training. When processing keywords in the question text, the model's multi-head self-attention layer significantly enhances the attention weights on relevant semantic units within the target knowledge fragment, treating the target knowledge fragment as a high-confidence prior condition and dynamically focusing on its internal relevant semantic units for knowledge-driven reasoning. This mechanism enables the model to perform deterministic reasoning guided by knowledge, rather than guessing answers from images or generalized language patterns, using the target knowledge fragment as a structured fact anchor.

[0111] When processing the target knowledge fragment, the regions in the visual tokens of the target image that are relevant to the question are activated in parallel. This enables knowledge-guided image-text joint verification within the same attention calculation, allowing the core of the answer to be determined based on the target knowledge fragment and the target image. If the target knowledge fragment and the target image corroborate each other, a high-confidence affirmative answer is output; if the image is blurry or missing, the answer is based on the target knowledge fragment, demonstrating the robust design of knowledge enhancement.

[0112] Furthermore, implicit knowledge, as a high-level semantic bridge, further strengthens the logical connection between questions, images, and knowledge. It is given a supplementary but non-dominant weight in the attention mechanism to bridge the semantic gap between explicit knowledge and multimodal perception. It provides a key semantic bridge in scenarios involving multi-hop reasoning, causal explanation, or functional derivation, further improving the coherence and logical integrity of answers in complex reasoning scenarios, and enhancing the interpretability and user comprehension of answers.

[0113] The initial image retains the actual product presentation as seen by the user, primarily providing the scene context from the user's perspective. Its visual tokens typically receive a low but non-zero weight in the attention mechanism, used to generate explanatory statements and improve the user experience relevance of the answer. Although the initial knowledge fragments contain a large amount of weakly relevant information, they are suppressed as a whole in the attention mechanism and used as knowledge context boundaries to support negative reasoning (such as confirming that a certain function is not mentioned by any knowledge entry) or to provide comparative references (such as the current generation supporting functions not supported in the previous generation), thereby enhancing the system's robustness and the comprehensiveness of the answer.

[0114] Throughout the autoregressive generation process, the language model predicts the next most likely token at each step based on the existing multimodal context (tokens to the left and itself), and iterates until the end token is generated, ultimately outputting a complete, accurate, and interpretable natural language answer. The entire reasoning process uses target knowledge and the target image as the factual backbone, implicit knowledge as semantic enhancement, and the initial image and initial knowledge as contextual auxiliary information. Knowledge-guided image-text joint verification is achieved within the same attention computation, thus realizing a visual question answering system with high accuracy, low illusion, and strong interpretability through knowledge enhancement.

[0115] In the question-answering generation model's reasoning process, explicit knowledge-guided attention mechanisms suppress irrelevant image regions and encourage the model to focus on key visual cues related to the question. Implicit knowledge, rather than serving as an independent source of fact, participates in multi-head self-attention computation as supplementary evidence, assigned appropriate attention weights to strengthen the logical consistency between the question, image, and knowledge. Especially in complex question-answering scenarios involving functional derivation, technical matching, or cross-modal causal explanations, or when answering questions requires additional common sense / domain prior knowledge or implicit relational reasoning beyond the retrieved fragment, implicit knowledge effectively bridges the gap between explicit knowledge items and visual perception, providing a coherent semantic bridge and thus improving the logical rigor, technical accuracy, and user comprehensibility of the answer. The combination of explicit and implicit knowledge, as complementary sources of answer prediction, can generate more robust answers even in noisy retrieval and complex input conditions.

[0116] To better understand, a specific scenario example is described below. In an e-commerce scenario, a user asks a question to the intelligent customer service on a terminal device through a front-end interface. The terminal device sends the question text and an initial image to the server. For example, a user asks, "Can this shirt be machine washed?" The initial image includes at least one product image (such as a main image, a detail image, or a washing label image). Based on the question text and the initial image, the server retrieves information related to the product from the target knowledge base, obtaining the following initial knowledge fragments: ① This T-shirt is made of 100% pure cotton fabric. Hand washing or machine washing on a gentle cycle is recommended. Avoid prolonged soaking. ② Pure cotton clothing is prone to shrinkage. Please use a neutral detergent and keep the water temperature below 30℃ when washing. ③ This store supports 7-day no-reason returns and exchanges. Please feel free to purchase.

[0117] In the multimodal knowledge and visual denoising stage, initial knowledge fragment ① was deemed highly relevant because it directly mentioned machine washing; initial knowledge fragment ②, while involving washing, did not explicitly answer whether it could be machine washed, thus having moderate relevance; initial knowledge fragment ③ was unrelated to washing and was identified as noise and removed. The initial image contained multiple regions: a front view of the T-shirt, a close-up of the care label (containing the "machine washable" icon), a model wearing the garment, and promotional banners. The close-up of the care label was highly relevant to the question because it contained washing symbols; the front view of the T-shirt, the model image, and the promotional banners, lacking washing information, were deemed low-relevance regions and filtered out. This yielded the target image and target knowledge fragments.

[0118] During the relational reasoning phase, the implicit knowledge obtained can be: the user is actually concerned about whether the product supports machine washing; a close-up area of ​​the care label in the target image shows a machine wash symbol, indicating that the manufacturer allows machine washing; the knowledge fragment indicates that the fabric is 100% cotton and explicitly recommends machine washing on a gentle cycle, while also mentioning water temperature and detergent requirements; the image evidence and knowledge description corroborate each other, jointly supporting the conclusion that the product is machine washable but requires certain conditions, and no information conflict was found. During the question-and-answer generation phase, the model can output "Machine washable, but gentle cycle is required." Furthermore, the server can return the generated answer to the terminal device, allowing the terminal device to display the answer to the user on the front-end interface.

[0119] In practical applications, after the question-and-answer session ends, all information related to the question and answer session can be recorded, including but not limited to: the question text and initial image, the initial knowledge fragments returned by the retrieval, the target knowledge fragments and target images actually retained by the multimodal knowledge and visual denoising model, and the generated answer.

[0120] This information can be used for subsequent offline analysis and model optimization, including: evaluating whether the multimodal knowledge and visual denoising model correctly identifies and retains key knowledge, and further optimizing the denoising threshold or training objective; discovering missing or inaccurate content in the knowledge base to assist in knowledge maintenance and updates; and analyzing failure cases to provide a basis for fine-tuning and prompting engineering of the question answering generation model.

[0121] In summary, the question-answering processing method provided in this application introduces an external knowledge base, which can retrieve explicit knowledge to enhance the professional understanding and response capabilities of the question-answering generation model. It can also promptly align with the latest personalized knowledge and finely manage knowledge sources and evidence paths. Furthermore, it refines the suppression of redundancy and low-relevance information in the search results, transforming the originally lengthy and noisy search text results and the original image features containing significant background interference into a concise multimodal representation specific to the current question. This alleviates the problem of high noise and redundancy in search results without altering the search system, providing high-quality input for subsequent knowledge-enhanced visual question-answering reasoning. In addition, implicit knowledge is introduced to achieve more reliable reasoning. In question-answering scenarios relying on product knowledge and image content, it can significantly improve the accuracy and stability of knowledge-enhanced visual question answering, reduce human intervention, and improve overall service efficiency.

[0122] It should be noted that the execution subject of each step of the method provided in the above embodiments can be the same device, or the method can be executed by different devices. For example, the execution subject of steps 11 to 13 can be device E; or the execution subject of step 11 can be device E, and the execution subject of steps 11 to 12 can be device F, etc.

[0123] In some of the processes described in the above embodiments and accompanying drawings, multiple operations are included that appear in a specific order. However, it should be clearly understood that these operations may not be executed in the order they appear herein, or they may be executed in parallel. The sequence numbers of the operations are merely used to distinguish different operations and do not represent any execution order. Furthermore, these processes may include more or fewer operations, and these operations may be executed sequentially or in parallel. It should be noted that the descriptions such as "first," "second," etc., in this document are used to distinguish different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit "first" and "second" to different types.

[0124] Figure 6 This is a schematic diagram of a question-and-answer processing device provided in an embodiment of this application. Figure 6 As shown, the device includes an acquisition module 61, a noise reduction module 62, and a question-and-answer generation module 63.

[0125] The acquisition module 61 is used to acquire the question text and the initial image corresponding to the question text, and retrieve the initial knowledge fragment from the target knowledge base based on the question text and the initial image. The initial image includes the target object described by the question text. The denoising module 62 is used to input the question text, the initial image, and the initial knowledge fragment into a multimodal knowledge and visual denoising model. Using the question text as a first guiding condition, the initial knowledge fragment is filtered to obtain a target knowledge fragment. Using the initial knowledge fragment and the question text as a second guiding condition, the initial image is subjected to cross-modal semantic alignment and conditional denoising to obtain a target image. The question-answering generation module 63 is used to input the question text, the initial image, the target image, the initial knowledge fragment, and the target knowledge fragment into a question-answering generation model. Using the target image and the target knowledge fragment as factual basis and the initial image and the initial knowledge fragment as contextual auxiliary information, question-answering is generated to obtain the answer information corresponding to the question text.

[0126] The detailed implementation methods and beneficial effects of each step in this embodiment have been described in detail in the foregoing embodiments, and will not be elaborated here.

[0127] Figure 7 This is a schematic diagram of the structure of a computing device provided in an embodiment of this application. Figure 7 As shown, the computing device includes a memory 71 and a processor 72.

[0128] Memory 71 is used to store computer programs and can be configured to store various other data to support operation on the computing platform. Examples of this data include instructions for any application or method operating on the computing platform, data structures, contact data, phone book data, messages, pictures, videos, etc.

[0129] Processor 72, coupled to memory 71, is configured to execute a computer program in memory 71 for: acquiring question text and an initial image corresponding to the question text; retrieving an initial knowledge fragment from a target knowledge base based on the question text and the initial image, the initial image including the target object described by the question text; inputting the question text, the initial image, and the initial knowledge fragment into a multimodal knowledge and visual denoising model; using the question text as a first guiding condition to filter the initial knowledge fragment to obtain a target knowledge fragment; and using the initial knowledge fragment and the question text as a second guiding condition to perform cross-modal semantic alignment and conditional denoising on the initial image to obtain a target image; inputting the question text, the initial image, the target image, the initial knowledge fragment, and the target knowledge fragment into a question-answering generation model; using the target image and the target knowledge fragment as factual basis and the initial image and the initial knowledge fragment as contextual auxiliary information to generate a question-answer to obtain the answer information corresponding to the question text.

[0130] Furthermore, such as Figure 7 As shown, the computing device also includes other components such as a communication component 73 and a power supply component 74. Figure 7 The diagram only shows some components and does not mean that the computing device includes only these components. Figure 7 The components shown.

[0131] The aforementioned memory can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random-Access Memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0132] The aforementioned communication component is configured to facilitate wired or wireless communication between the device containing the communication component and other devices. The device containing the communication component can access wireless networks based on communication standards, such as 2G, 3G, 4G / LTE, 5G, or combinations thereof. In one exemplary embodiment, the communication component receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel.

[0133] The aforementioned power supply components provide power to various components within the device in which they reside. These power supply components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power to the device in which they reside.

[0134] Accordingly, embodiments of this application also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, enables the processor to implement the steps in the above-described method embodiments. The computer-readable storage medium includes volatile or non-volatile components, or a combination thereof, and can be removable or non-removable. Examples of computer-readable storage media include, but are not limited to, phase-change random access memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random-access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), flash memory or other memory technologies, CD-ROM, digital video disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium.

[0135] Accordingly, this application also provides a computer program product, which includes a computer program or instructions that, when executed by a processor, cause the processor to implement the steps in the above method embodiments. It should be understood that each step or combination of steps in the above method flow can be implemented by the computer program or instructions. Furthermore, these computer programs or instructions can be applied to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing device, enabling the processor of the general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to function as an apparatus for implementing the corresponding functions in the above method embodiments.

[0136] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary hardware. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product, or it can be embodied in the process of data migration. The computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of this application.

[0137] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0138] The above are merely embodiments of this application and are not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A question-and-answer processing method, characterized in that, include: Obtain the question text and the initial image corresponding to the question text; retrieve the initial knowledge fragment from the target knowledge base based on the question text and the initial image; the initial image includes the target object described by the question text. The question text, the initial image, and the initial knowledge fragment are input into a multimodal knowledge and visual denoising model. The question text is used as the first guiding condition to filter the initial knowledge fragment to obtain a target knowledge fragment. The initial knowledge fragment and the question text are used as the second guiding condition to perform cross-modal semantic alignment and conditional denoising on the initial image to obtain the target image. The question text, the initial image, the target image, the initial knowledge fragment, and the target knowledge fragment are input into the question-answering generation model. The target image and the target knowledge fragment are used as factual basis, and the initial image and the initial knowledge fragment are used as contextual auxiliary information to generate the question and answer, so as to obtain the answer information corresponding to the question text.

2. The method according to claim 1, characterized in that, The question text, the initial image, and the initial knowledge fragment are input into a multimodal knowledge and visual denoising model. Using the question text as a first guiding condition, the initial knowledge fragment is filtered to obtain a target knowledge fragment. Then, using the initial knowledge fragment and the question text as a second guiding condition, cross-modal semantic alignment and conditional denoising are performed on the initial image to obtain the target image, including: The question text, the initial image, and the initial knowledge fragment are input into the multimodal knowledge and visual denoising model, and the following steps are performed in the multimodal knowledge and visual denoising model: The question text is encoded using a first text encoder to obtain a first text vector sequence, the first text vector sequence including the vectors of each question word in the question text; The initial knowledge fragment is encoded using a second text encoder to obtain a second text vector sequence, the second text vector sequence including the vectors of each knowledge word in the initial knowledge fragment; The initial image is encoded using a first visual encoder to obtain a first visual vector sequence, the first visual vector sequence including the vectors of each image block in the initial image; The first text vector sequence and the second text vector sequence are concatenated to obtain the third text vector sequence; The third text vector sequence is input into the text attention layer, and a question knowledge relevance matrix is ​​generated using a multi-head attention mechanism. The target knowledge fragment is then selected from the initial knowledge fragment based on the question knowledge relevance matrix. The third text vector sequence and the first visual vector sequence are input into the visual attention layer. Guided by the third text vector sequence, the first visual vector sequence is subjected to semantic alignment and denoising driven by a cross-modal attention mechanism to obtain the target image.

3. The method according to claim 2, characterized in that, Guided by the third text vector sequence, the first visual vector sequence undergoes semantic alignment and denoising driven by a cross-modal attention mechanism to obtain the target image, including: Based on the cross-modal attention mechanism, a cross-modal correlation analysis is performed on the third text vector sequence and the first visual vector sequence. Based on the cross-modal correlation analysis results, an initial attention map is generated. The initial attention map represents the correlation between each image block and each text word. The text words include question words and knowledge words. At the text lexical dimension, a conditional denoising operation is performed on the initial attention map to obtain an image block-level mask; The target image is obtained by performing image filtering on each image block in the initial image based on the image block-level mask.

4. The method according to claim 3, characterized in that, The step of performing conditional denoising on the initial attention map at the text lexical dimension to obtain an image block-level mask includes: For any text word, a normalized weight is generated based on the weight of the text word within its word group and the adaptive factor in the word group. The word group is either a question word group or a knowledge word group, and the word group includes both question word groups and knowledge word groups. Based on the normalized weights of each text word, the correlation scores between each text word and the corresponding image block in the initial attention map are scaled to obtain the target attention map; At the text lexical dimension, a conditional denoising operation is performed on the target attention map to obtain an image block-level mask.

5. The method according to claim 4, characterized in that, At the text lexical dimension, conditional denoising is performed on the target attention map to obtain an image block-level mask, including: For any row of the target attention map, based on the relevance score distribution of any row, the relevance score corresponding to the preset percentile ratio is used as the dynamic threshold of any row. The relevance score of any row is binarized based on the dynamic threshold to obtain a binary word mask corresponding to any row, wherein the size of the binary word mask is the same as the number of image blocks. Based on the target attention map, each binary word mask is subjected to a logical OR operation column by column in the image block dimension to obtain an image block-level mask.

6. The method according to any one of claims 2-5, characterized in that, The third text vector sequence is input into the text attention layer, and a question knowledge relevance matrix is ​​generated using a multi-head attention mechanism. Based on the question knowledge relevance matrix, the target knowledge fragment is selected from the initial knowledge fragment, including: The third text vector is input into the text attention layer for multi-head attention calculation to obtain the attention weight matrix of the question word to the knowledge word output by each attention head of the text attention layer; The importance weight of each attention head is determined based on the sensitivity signal of each attention head to the question-answering task; Based on the importance weight of each attention head, the attention weight matrix of each attention head is weighted and aggregated to generate a question knowledge relevance matrix, which represents the comprehensive relevance between question words and knowledge words. Based on the question knowledge relevance matrix, the degree to which each knowledge word is followed by multiple question words on average is calculated, and the relevance score of each knowledge word to the overall semantics of the question is obtained. Based on the relevance scores of each knowledge lexical unit to the overall semantics of the question, target knowledge segments that are semantically related to the question are selected from the initial knowledge segments.

7. The method according to any one of claims 1-5, characterized in that, Also includes: A first prompt word is generated based on the question text and the target knowledge fragment. The first prompt word is used to guide the relational reasoning model to output analytical text that represents the correlation of multimodal information. The first prompt word and the target image are input into the relational reasoning model. A first multimodal sequence is generated based on the first prompt word and the target image. The first multimodal sequence is used as a third guiding condition to perform cross-modal joint reasoning on the question text, the target image and the target knowledge fragment to obtain implicit knowledge. The implicit knowledge describes the relationship between the question text, the target image and the target knowledge fragment in the form of natural language. The question text, the initial image, the target image, the initial knowledge fragment, and the target knowledge fragment are input into the question-answering generation model. The target image and the target knowledge fragment are used as factual basis, and the initial image and the initial knowledge fragment are used as contextual auxiliary information to generate the question and answer, thereby obtaining the answer information corresponding to the question text, including: The question text, the initial image, the target image, the initial knowledge fragment, the target knowledge fragment, and the implicit knowledge are input into the question-answering generation model. The target image and the target knowledge fragment are used as factual basis, the initial image and the initial knowledge fragment are used as contextual auxiliary information, and the implicit knowledge is used as supplementary basis to generate the question and answer, so as to obtain the answer information.

8. The method according to claim 7, characterized in that, The question text, the initial image, the target image, the initial knowledge fragment, the target knowledge fragment, and the implicit knowledge are input into the question-answering generation model. The target image and the target knowledge fragment are used as factual evidence, the initial image and the initial knowledge fragment are used as contextual auxiliary information, and the implicit knowledge is used as supplementary evidence to generate the question and answer, resulting in the answer information, including: A second prompt word is generated based on the question text, the target knowledge fragment, the initial knowledge fragment, and the implicit knowledge. The second prompt word limits the reasoning context of the question-answering generation model in text form. The second prompt word, the initial image, and the target image are input into the question-answering generation model. Semantic extraction is performed on the second prompt word, the initial image, and the target image to obtain question semantics, visual semantics, and knowledge semantics. A second multimodal sequence is generated based on the second prompt word, the initial image, and the target image. The second multimodal sequence is used as a fourth guiding condition. The target image and the target knowledge fragment are used as factual evidence. The initial image and the initial knowledge fragment are used as contextual auxiliary information. The implicit knowledge is used as supplementary evidence. Cross-modal alignment is performed on the question semantics, the visual semantics, and the knowledge semantics. The answer information is generated based on the alignment results.

9. The method according to claim 8, characterized in that, The second prompt word, the initial image, and the target image are input into the question-answering generation model. Semantic extraction is performed on the second prompt word, the initial image, and the target image to obtain question semantics, visual semantics, and knowledge semantics. A second multimodal sequence is generated based on the second prompt word, the initial image, and the target image. The second multimodal sequence is used as a fourth guiding condition. The target image and the target knowledge fragment are used as factual evidence. The initial image and the initial knowledge fragment are used as contextual auxiliary information. The implicit knowledge is used as supplementary evidence. Cross-modal alignment is performed on the question semantics, visual semantics, and knowledge semantics. The answer information is generated based on the alignment result, including: The second prompt word, the initial image, and the target image are input into the question-and-answer generation model, and the following steps are performed in the question-and-answer generation model: The initial image is encoded using a second visual encoder to obtain a second visual vector; The target image is encoded using a third visual encoder to obtain a third visual vector; The second prompt word is encoded using a third text encoder to obtain a fourth text vector; Project the second visual vector and the third visual vector into the embedding space of the language model of the question-answering generation model; In the embedding space of the language model of the question-answering generation model, the projected second and third visual vectors are concatenated with the fourth text vector to generate the second multimodal sequence. Using a language model, the second multimodal sequence is used as the fourth guiding condition. The target image and the target knowledge fragment are used as factual basis to determine the answer backbone. The initial image and the initial knowledge fragment are used as contextual auxiliary information to provide scene context support. The implicit knowledge is used as a supplementary basis to enhance semantic coherence. Cross-modal alignment is performed on question semantics, visual semantics and knowledge semantics, and the answer information is generated based on the alignment results.

10. The method according to any one of claims 1-5, characterized in that, Initial knowledge fragments are retrieved from the target knowledge base based on the question text and the initial image, including: The initial image is encoded using a fourth visual encoder to obtain a fourth visual vector; The question text is encoded using a fourth text encoder to obtain a fifth text vector; Based on the fourth visual vector and the fifth text vector, candidate knowledge fragments are obtained by searching the target knowledge base. The candidate knowledge fragments are retrieved by using keywords from the question text to obtain the initial knowledge fragments.

11. A computing device, characterized in that, include: A memory and a processor; wherein the memory stores executable code, which, when executed by the processor, causes the processor to perform the method as described in any one of claims 1 to 10.

12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores executable code that, when executed by a processor of a computing device, causes the processor to perform the method as described in any one of claims 1 to 10.

13. A computer program product, characterized in that, include: A computer program / instruction that, when executed by a processor, causes the processor to perform the steps of the method according to any one of claims 1 to 10.