Question and answer method, edge device, storage medium, and computer program product

By building a knowledge base on edge devices and extracting and associating multimodal information, the deployment problem of RAG technology on resource-constrained devices is solved, and efficient and accurate question-and-answer responses are achieved.

CN122240757APending Publication Date: 2026-06-19ZTE CORP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZTE CORP
Filing Date
2025-12-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing question-answering systems based on retrieval augmentation generation (RAG) technology cannot be effectively deployed on resource-constrained edge devices, and suffer from high computational resource requirements, low retrieval efficiency, and inability to meet the requirements of professional scenarios for fast response and low energy consumption.

Method used

A knowledge base is built on edge devices. By extracting and associating multimodal information, related text information is generated. An adaptive degradation parsing strategy and target logical identifiers are adopted to achieve deep association of multimodal information and knowledge unification, thereby optimizing retrieval efficiency.

Benefits of technology

Efficient multimodal information search and answer provision were achieved on resource-constrained edge devices, improving the response speed and accuracy of user questions and answers.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240757A_ABST
    Figure CN122240757A_ABST
Patent Text Reader

Abstract

This application provides a question-answering method, an edge device, a storage medium, and a computer program product. The method includes: in response to receiving a user's question information, the edge device searches for answer information corresponding to the question information in a knowledge base. Thus, this application embodiment enables the search for corresponding answer information for user question information in a resource-constrained edge device.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence, and more specifically, to a question-answering method, an edge device, a storage medium, and a computer program product. Background Technology

[0002] With the continuous development of artificial intelligence (AI) technology, question-answering systems based on large language models (LLM) are being used more and more widely.

[0003] In related technologies, existing LLMs are mainly built based on Retrieval-augmented Generation (RAG) technology. However, question-answering systems based on RAG technology have high requirements for computing resources and cannot be used in resource-constrained scenarios. Summary of the Invention

[0004] This application provides a question-answering method, an edge device, a storage medium, and a computer program product to at least solve the problem that question-answering systems based on RAG technology in related technologies cannot be applied to resource-constrained scenarios.

[0005] According to one embodiment of this application, a question-and-answer method is provided, including:

[0006] In response to receiving a user's question, the system searches the knowledge base for the answer information corresponding to the question.

[0007] According to yet another embodiment of this application, a computer-readable storage medium is also provided, wherein a computer program is stored in the computer program, and the computer program is configured to execute the steps in any of the above-described question-and-answer method embodiments when it is run.

[0008] According to yet another embodiment of this application, an edge device is also provided, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the steps in any of the above-described question-answering method embodiments.

[0009] According to yet another embodiment of this application, a computer program product is also provided, including a computer program that, when executed by a processor, implements the steps in any of the above-described question-and-answer method embodiments.

[0010] In this embodiment, the edge device, upon receiving a user's question, searches for the corresponding answer in the knowledge base. Thus, this embodiment enables the search for the answer to a user's question on a resource-constrained edge device. Attached Figure Description

[0011] Figure 1 This is a hardware structure block diagram of a computer terminal used in an embodiment of the method of this application;

[0012] Figure 2 The steps of the question-and-answer method in this application are as follows: Figure 1 ;

[0013] Figure 3 This is a flowchart illustrating the steps of a knowledge base construction method according to an embodiment of this application;

[0014] Figure 4 This is a logical diagram illustrating a semantic block division according to an embodiment of this application;

[0015] Figure 5 The steps of the question-and-answer method in this application are as follows: Figure 2 ;

[0016] Figure 6 This is a schematic diagram of the structure of a vector knowledge base according to an embodiment of this application;

[0017] Figure 7 The steps of the question-and-answer method in this application are as follows: Figure 3 ;

[0018] Figure 8 This is a schematic diagram of the logical architecture of a question-answering system according to an embodiment of this application;

[0019] Figure 9 This is a structural block diagram of an edge device according to an embodiment of this application. Detailed Implementation

[0020] The embodiments of this application will be described in detail below with reference to the accompanying drawings and examples. It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

[0021] In recent years, with the rapid development of Large Language Models (LLMs) and other technologies, Retrieval-augmented Generation (RAG) has become a mainstream solution for building accurate and reliable knowledge question-answering systems. RAG retrieves text fragments relevant to user questions from external knowledge bases and provides them as context to the Large Language Model, enabling it to generate fact-based and verifiable answers, effectively alleviating the problems of knowledge obsolescence and content illusion. However, in practical applications, especially when processing large amounts of specialized domain knowledge documents, RAG technology faces several technical bottlenecks.

[0022] Specifically, RAG technology faces significant challenges in computing power adaptation and retrieval efficiency when deployed at the edge. Edge devices (such as industrial production line terminals and in-vehicle systems) typically have strict resource constraints, including limited computing power, small memory capacity, and power sensitivity. However, in related technologies, achieving multimodal content understanding and question answering usually relies on deploying large-scale multimodal language models, which have extremely high computing resource requirements, making direct deployment on the aforementioned resource-constrained edge devices extremely difficult or even impractical.

[0023] In multi-document retrieval scenarios, traditional RAG-based question-answering methods typically rely on global vector search. This leads to an excessively large search scope, introducing a large amount of irrelevant or low-relevance content, resulting in low retrieval efficiency and increased computational burden on subsequent processing. Furthermore, the sorting or filtering mechanisms in related technologies are not optimized for the low-computing-power environments of edge devices, failing to meet the requirements of rapid response and low energy consumption in professional scenarios. Therefore, RAG-based question-answering systems cannot be implemented on edge devices, meaning they are unsuitable for resource-constrained scenarios.

[0024] To address the aforementioned issues, embodiments of this application provide a question-answering method, an edge device, a storage medium, and a computer program product. In response to receiving a user's question, the edge device searches for corresponding answer information in a knowledge base. Thus, embodiments of this application enable the search for corresponding answer information based on user question information within a resource-constrained edge device.

[0025] The methods and embodiments provided in this application can be executed on a mobile terminal, computer terminal, or similar computing device (or electronic device, etc.). Taking running on a computer terminal as an example, Figure 1 This is a hardware structure block diagram of a computer terminal used in an embodiment of the method of this application. Figure 1 As shown, a computer terminal may include one or more ( Figure 1Only one is shown in the diagram. A processor 102 (which may include, but is not limited to, a microprocessor (MCU) or a field-programmable gate array (FPGA)) and a memory 104 for storing data are also shown. The computer terminal may further include a transmission device 106 for communication functions and an input / output device 108. Those skilled in the art will understand that... Figure 1 The structure shown is for illustrative purposes only and does not limit the structure of the computer terminal described above. For example, the computer terminal may also include components that are more complex than those described above. Figure 1 The more or fewer components shown, or having the same Figure 1 The different configurations shown.

[0026] The memory 104 can be used to store computer programs, such as application software programs and modules, like the computer program corresponding to the question-and-answer method in this embodiment. The processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, thus implementing the aforementioned method. The memory 104 may include high-speed random access memory and non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory remotely located relative to the processor 102, and these remote memories can be connected to a computer terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0027] The transmission device 106 is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider for the computer terminal. In one example, the transmission device 106 includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 106 may be a Radio Frequency (RF) module used for wireless communication with the Internet.

[0028] This application provides a question-and-answer method that can run on the aforementioned computer terminal. Figure 2 The steps of the question-and-answer method in this application are as follows: Figure 1 This question-and-answer method is applied to edge devices, such as... Figure 2 As shown, this question-and-answer method includes the following steps:

[0029] Step S201: In response to receiving the user's question information, search the knowledge base for the answer information corresponding to the question information.

[0030] In this embodiment, the edge device can refer to a terminal device in a resource-constrained scenario, specifically including mobile phones, tablets, vehicle terminals, or edge servers, etc. This embodiment does not limit the specific type of edge device. The knowledge base can refer to a pre-built database for knowledge retrieval. Responding to user-inputted questions, the edge device can search the knowledge base for corresponding answers, enabling responses to users even in resource-constrained scenarios.

[0031] This application provides another question-and-answer method that runs on the aforementioned computer terminal. Figure 3 This is a flowchart illustrating the steps of a knowledge base construction method according to an embodiment of this application, as follows: Figure 3 As shown, the method for constructing this knowledge base includes the following steps:

[0032] Step S301: Extract multimodal information from multiple knowledge documents.

[0033] Multimodal information includes at least one of text information, table information, and image information.

[0034] In this embodiment, a knowledge document can refer to the original document used to construct a knowledge base. The number of knowledge documents can be one or more; the following description primarily uses multiple knowledge documents as an example. Multimodal information can refer to information of different modalities included in multiple knowledge documents. This multimodal information can include one or more of text information, table information, and image information. Of course, the multimodal information can also include information of other modalities, which is not limited in this embodiment. The computer terminal can specifically employ natural language processing, text recognition, image recognition, and / or table recognition to extract multimodal information. The specific method can be flexibly selected based on actual needs, and this embodiment does not limit the specific extraction method for multimodal information.

[0035] Step S302: In response to the fact that the multimodal information includes non-textual information, determine the associated textual information corresponding to the non-textual information, and associate and bind the associated textual information with the multimodal information to obtain multiple target documents corresponding to multiple knowledge documents.

[0036] Non-text information includes image information and / or table information.

[0037] In this embodiment, non-text information includes at least one of image information or table information, that is, other modal information besides text information in multimodal information. Associated text information can refer to text information generated based on the association of non-text information. For example, the associated text information can refer to the image storage path of image information or the summary text of table information, etc. This embodiment does not limit the actual type of associated text information. The target document can refer to the document obtained after parsing, extracting, associating, and formatting a knowledge document. This target document can be used for subsequent knowledge base construction.

[0038] Specifically, after extracting the multimodal information from the knowledge document, the computer terminal can generate associated text information corresponding to the non-textual information, and then associate and bind this associated text information with the multimodal information in the knowledge document to obtain the target document corresponding to the knowledge document. This ensures the logical association of multimodal information in the target document, avoids the semantic fragmentation of multimodal information, and achieves deep association of multimodal information and knowledge unification.

[0039] Step S303: Construct a knowledge base based on multiple target documents.

[0040] The knowledge base includes: the document name of each target document, multiple text blocks corresponding to each document name, and text block metadata corresponding to each text block.

[0041] In this embodiment, after extracting multimodal information, generating associated text information, and binding the associated text information with the multimodal information to obtain the target document, the computer terminal can construct a knowledge base based on multiple target documents. In this way, by constructing a knowledge base, the computer terminal can improve the efficiency of subsequent knowledge retrieval, thereby increasing the response speed of user questions and answers.

[0042] In one exemplary embodiment, step S301, which extracts multimodal information from multiple knowledge documents, can be achieved as follows:

[0043] Multimodal information is extracted from knowledge documents using a structured parser to obtain candidate multimodal information;

[0044] In response to the fact that the extraction accuracy of the candidate multimodal information is greater than or equal to a preset accuracy threshold, the candidate multimodal information is used as the multimodal information in the knowledge document; or, in response to the failure of the structured parser to parse or the extraction accuracy being less than a preset accuracy threshold, the multimodal information in the knowledge document is extracted by the target visual recognition model to obtain the multimodal information in the knowledge document.

[0045] In this embodiment, the candidate multimodal information can refer to the multimodal information obtained by initially extracting knowledge documents through a structured parser. Extraction accuracy can refer to the accuracy of comparing the content of the candidate multimodal information with the original knowledge document. The preset accuracy threshold can refer to a pre-set accuracy threshold, such as 80% or 90%.

[0046] In this step, the computer terminal employs an adaptive degradation parsing strategy to uniformly parse and extract multimodal information from the knowledge document. Specifically, the computer terminal can first use a structured parser (such as open-source tools like PyMuPDF) to parse the knowledge document and obtain candidate multimodal information. Then, the computer terminal can compare the content differences between the knowledge document and the candidate multimodal information to determine the extraction accuracy corresponding to the candidate multimodal information.

[0047] If the extraction accuracy of the candidate multimodal information is greater than or equal to the preset accuracy threshold, the computer terminal can determine that the candidate multimodal information is the multimodal information of the current knowledge document. If the extraction accuracy of the candidate multimodal information is less than the preset accuracy threshold, or the structured parser fails to parse it (e.g., the knowledge document is in a non-standard format, the knowledge document is of low quality (e.g., the handwriting is blurry), or the knowledge document has a complex format (e.g., there is hierarchical nesting), the computer terminal can parse the knowledge document through a target visual recognition model. This target visual recognition model can perform visual layout analysis on the knowledge document. Specifically, it can parse the knowledge document based on image recognition technologies such as Optical Character Recognition (OCR) to obtain multimodal information.

[0048] In this embodiment, the computer terminal first attempts to extract knowledge documents based on a structured parser. When the structured parser has a low extraction accuracy or fails to extract the knowledge documents, a target visual recognition model is used to extract the knowledge documents. This ensures high robustness in extracting multimodal information from knowledge documents, improves the adaptability and flexibility of parsing and extracting different knowledge documents, and thus ensures the accuracy of knowledge base construction.

[0049] It should be noted that since knowledge documents may be in different formats, such as doc, docx, pdf, txt or md, computer terminals can perform format standardization processing on knowledge documents of different formats, and convert knowledge documents of different formats into knowledge documents of the target format. The target format may refer to the lightweight markup language Markdown (.md) format, but this application embodiment does not limit this.

[0050] The following uses image information and table information as examples to illustrate the determination of associated text information and the process of associating and binding associated text information with multimodal information in the embodiments of this application:

[0051] 1. Non-text information is image information.

[0052] In one exemplary embodiment, in response to the multimodal information including text information and image information, step S302 determines the associated text information corresponding to the non-text information and associates and binds the associated text information with the multimodal information, which may specifically include the following steps:

[0053] Determine the image storage path corresponding to the image information; determine the original location of the image corresponding to the image information in the text information, and modify the image information at the original location to the image storage path.

[0054] In this embodiment, the computer terminal can separate text, table, and image information within multimodal information. The computer terminal can save the image information to a preset storage location, then determine the original position of the image information within the text information (i.e., the original location where the image information appears in the text information). The computer terminal can then delete the image information from that original position and insert the corresponding image storage path at that original position. In this way, by embedding character-level absolute storage paths, the computer terminal can achieve structured binding of information sources, determine multimodal knowledge unification, and facilitate subsequent vector encoding and knowledge base construction.

[0055] In one exemplary embodiment, after associating and binding the associated text information with the multimodal information, the question-answering method may further include the following steps:

[0056] Determine the corresponding image caption information and the position of the reference paragraph for the image information in the text information; then, move the image storage path and image caption information to the reference paragraph position within the text information.

[0057] In this embodiment, the caption information can refer to the descriptive text corresponding to the image information. The reference paragraph position can refer to the actual position in the text information that references the image information. The computer terminal can parse the knowledge document converted to the target format, identify the image storage path embedded in the text information, determine the caption information corresponding to the image information to which the image storage path belongs, and trace the reference paragraph position of the image information in the text information. Then, the computer terminal can move the caption information and image storage path of the image information to the reference paragraph position of the image information, achieving physical coupling of context and avoiding logical separation between text information and image information.

[0058] In one exemplary embodiment, after associating and binding the associated text information with the multimodal information, the question-answering method further includes the following steps:

[0059] Determine the target logical identifier corresponding to the image information; the target logical identifier includes the reference paragraph position identifier, image storage path identifier, and caption information identifier corresponding to the image information; embed the target logical identifier at the associated position of the image information in the text information.

[0060] In this embodiment, the target logical identifier can refer to a logical tag corresponding to the image information, which may include a reference paragraph position identifier, an image storage path identifier, and a caption information identifier. In one example, the target logical identifier may be a logical triplet tag including an image storage path identifier (P-ID), a caption information identifier (C-ID), and a reference paragraph identifier (R-ID). The computer terminal can embed the target logical identifier in the associated position corresponding to the image information in the text information (e.g., the original position of the image information, the reference paragraph position, or the adjacent paragraph position of the reference paragraph position, etc.) to realize the display encoding of multimodal logical relationships.

[0061] The following is a specific example of text information embedding a target logical identifier in a computer terminal according to an embodiment of this application:

[0062] "...Before performing any operation, please be sure to refer to..." Figure 2 1. Understand the power interface layout of the device, which differs significantly from the interfaces of traditional devices...

[0063] [START_MULTIMODAL_UNIT | R-ID:PARA-4-5]

[0064] Caption: Figure 2 .1: Power supply and control interface layout for the new model equipment. [C-ID:FIG-2-1]

[0065] Image path: / img / word_A / device_layout.png [P-ID: / img / work_A / device_layout.png]

[0066] [END_MULTIMODAL_UNIT]

[0067] ...(next paragraph).

[0068] In this embodiment, the computer terminal embeds the image storage path into the original image location within the text information; furthermore, it moves the image storage path and caption information to the corresponding reference paragraph location of the image information, and / or embeds the target logical identifier at the associated location of the image information. In this way, through tight contextual coupling and logical identifier injection, the computer terminal can ensure that the LLM (Local Language Management System) can obtain the multimodal information related to the text and images when outputting answer information, and can understand the explicit roles and hierarchical relationships of the text and images within the document structure, thus guaranteeing the contextual integrity and reasoning support of the multimodal knowledge units.

[0069] In related technologies, traditional RAG workflows have significant shortcomings in processing source documents, especially complex multimodal documents (such as manuals, operating instructions, and other documents containing text, images, and tables). Non-textual information such as tables and images in multimodal documents is often ignored or corrupted during parsing. This is particularly true for tables with complex structures like merged cells, where traditional RAG methods struggle to accurately extract their content and structure. Specifically, because images are usually separated from their citations and descriptions in the text, the system cannot effectively associate visual information with textual information, leading to a one-sided understanding of the document. Although related technologies attempt to establish image-text associations by identifying image blocks and caption text blocks and calculating their physical distance (Euclidean distance) and bipartite graph matching, these associations are essentially based on paragraph or block-level matching. This geometric distance-based association method cannot deeply embed image information into the text stream, nor does it possess a deep understanding of the precise reference positions of images within the text and the document's logical structure. Consequently, in the subsequent knowledge base construction phase, multimodal information remains fragmented and cannot effectively serve the RAG process. In other words, related technologies cannot achieve deep association and knowledge unification for multimodal information.

[0070] In this embodiment, the computer terminal avoids the influence of physical distance between text and image information by moving the image storage path and caption information to the referenced paragraph position, and / or embedding the target logical identifier of the image information at the associated position of the image information. Based on the target logical identifier embedding method, even if these target logical identifiers are separated during subsequent segmentation or retrieval, LLM can still clearly define the relationship between multimodal information through these logical identifiers (e.g., the prompt word includes some logical identifiers), thus realizing deep association and knowledge unification of multimodal information.

[0071] Furthermore, in this embodiment, the edge device can also facilitate contextual tracing during knowledge base searches using this target logical identifier. For example, when the edge device's question-answering system retrieves a text block containing an image storage path identifier, the system can use this identifier to trace back and extract all text and image information with the same identifier, ensuring that the context provided to the LLM includes not only text information but also the image storage path and caption information corresponding to the image information. Based on this, the LLM, using this structured identifier, facilitates subsequent reasoning operations and makes it easier to be prompted to perform specific operations. For instance, based on the image storage path identifier, the LLM can output the original image information corresponding to that identifier in the answer information, facilitating front-end rendering.

[0072] 2. Non-text information is tabular information.

[0073] In one exemplary embodiment, in response to the multimodal information including text information and table information, step S302 determines the associated text information corresponding to the non-text information and associates and binds the associated text information with the multimodal information, including:

[0074] Extract column header information and key data relationships from the table information, and generate summary text corresponding to the table information based on the column header information and key data relationships; determine the original position of the table information in the text information, and insert the summary text in the original position of the table.

[0075] In this embodiment, for knowledge documents of different formats, the computer terminal can convert them into a common target format through format unification. This eliminates format heterogeneity and achieves format unification for content of different modalities. In one example, the computer terminal can adjust the line breaks in the knowledge document according to the Markdown target format, achieving format unification for paragraphs, headings, tables, etc. Format unification also improves the searchability of tabular data.

[0076] Furthermore, for complex tables already existing in knowledge documents, due to the presence of merged cells and other pre-existing operations, the computer terminal may repeatedly parse merged cells during the table conversion process after formatting the complex table. Therefore, the target format knowledge document may contain redundant content. The computer terminal can perform intelligent deduplication for the target format knowledge document. Specifically, it can target cells in certain columns or rows of the table that display the same text repeatedly, retaining only one duplicate and removing the others. In this way, the computer terminal can optimize the table structure, reduce data redundancy, improve storage efficiency, and make the table structure more concise, facilitating further analysis and retrieval.

[0077] In this embodiment, column header information can refer to the title of the data column corresponding to the table information, also known as column header, etc. Key data relationships can refer to various data relationships contained in the table information, specifically including numerical comparison relationships, time series relationships, sorting and filtering relationships, classification and grouping relationships, or trend description information, etc. Summary text can refer to descriptive text about the table information and can serve as associated text information for the table information.

[0078] Specifically, for tabular information in multimodal information (e.g., tabular information after intelligent deduplication), the computer terminal can extract the column header information and key data relationships of the tabular information. Then, it can further generate structured summary text based on the column header information and key data relationships. Afterward, the computer terminal can determine the original position of the tabular information within the text information and insert the summary text at that original position, thus achieving the association and binding of the summary text with the multimodal information.

[0079] In this embodiment of the application, the computer terminal generates a summary text of the table information and embeds the summary text into the original position of the table in the text information. This can transform the inherent structural information of the table into semantic information that can be vector-encoded, such as the summary text, which can significantly optimize the distinguishability and recall efficiency of the table information in subsequent retrieval.

[0080] In the process of preprocessing knowledge documents, the computer terminal uses structured parsing, associated text information generation, complex table normalization, and structured binding of image text and table text to deeply associate non-text information (such as images, tables, etc.) with text information in multimodal information. This enables integrated processing and association of multimodal information, ensuring the integrity and indexability of the information.

[0081] In an exemplary embodiment, the construction of the knowledge base based on multiple target documents in step S303 can be specifically achieved through the following steps (1) to (3):

[0082] Step (1): Semantically divide the content in each target document to obtain multiple semantic blocks.

[0083] In this embodiment, the computer terminal can semantically divide the content of each target document into multiple semantic blocks through semantic aggregation, entity recognition, or deep learning, thereby achieving semantic-based adaptive segmentation.

[0084] Specifically, the computer terminal divides the target document, such as a Markdown document, into semantic blocks with complete semantic meaning, and actively corrects the structural defects of the original document during the aggregation process, realizing semantic-based structural defect detection and adaptive block division. In one example, the semantic block division method of the computer terminal can adopt an iterative semantic aggregation method, which can guide semantic aggregation by calculating the similarity (δ) between the first paragraph (or the current paragraph) and the first semantic block (or the current semantic block). In one example, if the similarity between the first paragraph and the first semantic block is greater than or equal to the first preset similarity threshold (e.g., 80% or 90%), the computer terminal can merge the first paragraph into the first semantic block; if the similarity between the first paragraph and the first semantic block is less than the first preset similarity threshold, the computer terminal can create a new semantic block based on the first paragraph or merge it into other semantic blocks with a similarity not less than the first preset similarity threshold. For example, the similarity δ between the first paragraph and the first semantic block can be calculated by the following formula (1):

[0085] (1)

[0086] In the above formula (1), δ is the cosine similarity; S represents the similarity; P represents the first paragraph vector; N represents the total number of vectors corresponding to the first semantic block; Q i This can refer to the vectors corresponding to the first semantic block, where the value of i ranges from 1 to N. The cosine similarity between the first paragraph and the first semantic block in formula (1) can be calculated by calculating the similarity between the vector of the first paragraph and the average vector of the first semantic block. Of course, the similarity between the first paragraph and the first semantic block can also be calculated in other ways, or other types of similarity can be used. This application does not limit this.

[0087] Furthermore, in this embodiment, the computer terminal may also employ other semantic block segmentation methods (or rules) to enhance the robustness and adaptability of semantic block segmentation for low-quality documents, enabling the detection and repair of structural defects in knowledge units within the target document. In an exemplary embodiment, Figure 4 This is a logical diagram illustrating a semantic block division according to an embodiment of this application. Figure 4In this context, P represents the first paragraph, and S represents the first semantic block. For example... Figure 4 The specific methods for dividing semantic blocks shown also include at least one of the following:

[0088] In response to the total number of characters in the first semantic block being greater than a preset upper limit threshold, the first semantic block is used as the segmented semantic block, and a new second semantic block is constructed; and / or, in response to the total number of characters in the first semantic block being less than a preset lower limit threshold, the first paragraph is directly merged into the first semantic block; and / or, in response to the first paragraph being the target logical identifier corresponding to the image information, the first paragraph is directly merged into the first semantic block.

[0089] In this embodiment, the preset length upper limit threshold can refer to a pre-set upper limit value for the length of the first semantic block, i.e., the maximum length of the first semantic block. When the total number of characters in the first semantic block exceeds the preset length upper limit threshold, the computer terminal can forcibly terminate the construction of the first semantic block and construct a new second semantic block to specify subsequent semantic aggregation. In this way, by controlling the total number of characters in the first semantic block to not exceed the preset length upper limit threshold, the computer terminal can ensure subsequent retrieval efficiency, ensure the focus of reasoning, avoid excessively long context diluting key information and increasing the processing latency of edge devices, reduce the load of subsequent question-answering reasoning, improve the domain purity of the knowledge question-answering process, and improve the quality of answer information.

[0090] The preset minimum length threshold can refer to the pre-set minimum length value of the first semantic block, i.e., the minimum length of the first semantic block. When the total number of characters in the first semantic block is less than the preset minimum length threshold, the computer terminal does not need to calculate the similarity between the first paragraph and the first semantic block, and can directly merge the first paragraph into the first semantic block. In this way, the computer terminal, by forcibly merging the first semantic block with a total number of characters less than the preset minimum length threshold directly with the first paragraph, does not rely on similarity judgment. This can avoid the problem of important short texts such as key definitions and warning messages being isolated due to typesetting errors, and can achieve semantic isolation modification, that is, repair the semantic fragmentation caused by poor document structure, and ensure the coherence of knowledge retrieval.

[0091] In this embodiment, when the first paragraph is the target logical identifier corresponding to the image information, or when the first paragraph is the image storage path or caption information corresponding to the image information, the computer terminal does not need to calculate the similarity between the first paragraph and the first semantic block, and directly merges the first paragraph into the first semantic block. In this way, the computer terminal, through the forced image-text merging rule, can ensure the integrity of the integrated image-text knowledge unit, effectively solving the problem that image information and text information may be physically separated by unrelated content in low-quality documents, and improving the coherence and completeness of semantic block division.

[0092] For example, the following is a practical example of text block segmentation:

[0093] A certain non-normalized industrial document, after parsing, produced the following text stream structure:

[0094] Paragraph Q: "The operating temperature of the equipment should not exceed 60℃, otherwise it may cause the sensor to melt."

[0095] Paragraph R: "Warning code: E-T001".

[0096] Paragraph S: "When the temperature exceeds the safe threshold, step 4.2 should be executed immediately..."

[0097] Regarding the above text stream structure:

[0098] In related technologies, because paragraph R is extremely short and has low semantic similarity to the preceding and following paragraphs (R is just a code with little semantic information), traditional semantic aggregation methods typically divide paragraph R into isolated text blocks, for example, by using blank lines or delimiters. This division method in related technologies isolates short and crucial paragraphs R, resulting in poor semantic segmentation performance.

[0099] Beyond the examples mentioned above, in terms of knowledge chunking and semantic coherence, traditional RAG workflows often employ fixed-length or delimiter-based simple chunking methods, which easily disrupt a complete and coherent semantic unit. The chunking methods in these technologies are designed for plain text content and primarily rely on the document's original paragraph structure and semantics for aggregation. When the document is of low quality or poorly structured, this purely semantic aggregation lacks robustness; more importantly, this chunking method lacks the ability to constrain and process multimodal content.

[0100] In this embodiment, if the computer terminal uses paragraph R as the first semantic block, and the total number of characters in the first semantic block is lower than a preset length threshold using the aforementioned semantic isolation correction method, the computer terminal can directly merge the next paragraph, i.e., paragraph S, into the first semantic block. The final semantic block is "Warning code: E-T001. When the temperature exceeds the safety threshold, step 4.2 should be executed immediately...". Thus, based on the semantic isolation correction rule in this embodiment, it can ensure that when a user queries "How to handle the E-T001 fault code," the text block recalled by the question-and-answer system of the edge device not only contains the code itself but also forcibly includes its corresponding processing steps (i.e., the complete context of paragraph S), thereby ensuring the completeness and accuracy of knowledge retrieval and avoiding knowledge loss due to document structure defects. The semantic block division method in this embodiment can improve the tolerance for structural errors in low-quality documents and has a certain repair capability for fragmented key information, ensuring the comprehensiveness and accuracy of subsequent knowledge-based question answering.

[0101] In this embodiment, during the knowledge base architecture construction process, the computer terminal utilizes a semantic-based intelligent segmentation method to construct a hierarchical knowledge granularity structure. This fundamentally solves the context fragmentation problem caused by traditional fixed-length segmentation, while significantly improving retrieval accuracy. Specifically, the computer terminal employs an iterative semantic aggregation method, utilizing inter-paragraph vector similarity for intelligent aggregation to divide semantic blocks with complete context. Based on this iterative semantic aggregation method, the computer terminal combines other semantic block division rules (such as forced image-text merging, handling excessively long / short paragraphs, etc.) to ensure the semantic integrity of each semantic block, thus overcoming the deficiency of pure text semantic aggregation methods in related technologies in handling multimodal associations.

[0102] Step (2): Segment each semantic block to obtain multiple text blocks.

[0103] Step (3): Construct a knowledge base based on the text block, the document name corresponding to the target document, and the text block metadata corresponding to the text block; the text block metadata includes at least one of the following: the document name of the target document to which the text block belongs, the semantic block identifier of the semantic block to which the text block belongs, the position identifier of the text block in the semantic block to which it belongs, and the text block type of the text block.

[0104] In this embodiment, after semantically aggregating the target document to obtain multiple semantic blocks, the computer terminal can further divide each semantic block to obtain multiple text blocks corresponding to the semantic blocks. In one example, the computer terminal can divide the semantic blocks according to a preset upper limit for text block length, using preset punctuation marks (such as periods) as boundaries. That is, for semantic blocks whose total number of characters exceeds the preset upper limit for text block length, the preset punctuation marks are used to switch them, thereby obtaining multiple smaller and more refined text blocks corresponding to the semantic blocks. Of course, the computer terminal can also use other methods to divide the semantic blocks to obtain text blocks, such as dividing them based on information density, entity names (such as personal names, place names, product names, events, etc.), or topic paragraphs, etc. This embodiment does not limit this. In this embodiment, by further dividing each semantic block into text blocks, the computer terminal can realize the construction of a hierarchical knowledge granularity structure, and can transform complete semantic blocks into efficient retrieval units, providing complete data support for subsequent efficient retrieval by edge devices.

[0105] After obtaining multiple text blocks corresponding to each semantic block, the computer terminal can determine the text metadata corresponding to the text blocks and construct a knowledge base based on the document name of the target document, the multiple text blocks corresponding to the document name, and the text metadata corresponding to each text block. Thus, this embodiment of the application, by employing a hierarchical knowledge granularity strategy of "semantic block-text block" to construct the knowledge base, can improve the accuracy and efficiency of subsequent retrieval. By using smaller text blocks as retrieval units, the accuracy of subsequent knowledge retrieval can be improved.

[0106] In addition, the text block metadata corresponding to each text block includes at least one of the following:

[0107] The document name of the target document to which the text block belongs, the semantic block identifier of the semantic block to which the text block belongs (such as the semantic block number, which is used to achieve fast tracing and reconstruction of the complete context in the subsequent retrieval stage), the position identifier of the text block in the semantic block (such as the position number), and the text block type (such as a table or a text paragraph).

[0108] In this embodiment, the computer terminal finely segments excessively long semantic blocks into smaller text blocks using preset punctuation marks as boundaries, generates rich text block metadata for each text block, and then adds the text blocks and their metadata to the knowledge base. This hierarchical structure of semantic blocks and text blocks preserves complete contextual coherence (semantic blocks) in the knowledge base, improving the traceability of context within the knowledge base; it also provides high-precision units (text blocks) for retrieval, achieving a balance between contextual integrity and retrieval efficiency. This improves the quality and efficiency of subsequent question-and-answer sessions and also increases the response speed of user questions and answers.

[0109] Based on the above embodiments, Figure 5 The steps of the question-and-answer method in this application are as follows: Figure 2 .like Figure 5 As shown, searching for the answer information corresponding to the question information in the knowledge base can include the following steps:

[0110] Step S501: In the knowledge base, select the associated text block that is related to the question information from the text block corresponding to the document name.

[0111] Step S502: Based on the problem information, the associated text block, and the metadata of the associated text block, construct the target prompt information.

[0112] In this embodiment, the associated text block can refer to a text block associated with the question information. The target prompt information (Prompt) can refer to the prompt information that is ultimately input into the local quantized language model to obtain the answer information. The associated text block metadata refers to the metadata corresponding to the associated text block.

[0113] Specifically, after the knowledge base is built, the edge device, in response to the user's question information, can perform a search and matching within the knowledge base to determine the associated text blocks related to the question information. The specific search and matching process can be based on vector matching or semantic matching, and can also employ methods such as word frequency calculation or topic clustering; this embodiment does not limit the specific methods used. Subsequently, the edge device can construct target prompt information based on the question information, associated text blocks, and associated text block metadata. Following this, it can interact with the local lightweight language model based on the target prompt information to obtain the answer information.

[0114] Step S503: Input the target prompt information into the local quantization language model to obtain the answer information.

[0115] In this embodiment, the local quantized language model can refer to a large language model (LLM) deployed on an edge device after quantization processing. After determining the target prompt information, the computer terminal can input the target prompt information into the local quantized language model. The local quantized language model performs reasoning based on the target prompt information and finally outputs the answer information corresponding to the question information.

[0116] In an exemplary embodiment, in step S502, target prompt information is constructed based on the question information, the associated text block, and the metadata of the associated text block. This can be achieved in the following way:

[0117] Based on the metadata of the associated text block, determine the associated semantic block corresponding to the associated text block; determine the context information corresponding to the associated semantic block; and construct the target prompt information based on the context information and the question information.

[0118] In this embodiment, the metadata of the associated text block stores a semantic block identifier for the associated text block. Based on this semantic block identifier, the edge device can determine the associated semantic block corresponding to the associated text block. The edge device can then use the content of the associated semantic block as context information. Following this context information and question information, the edge device can construct structured target prompt information. In this way, based on a knowledge base, the edge device can efficiently trace and reconstruct the complete associated semantic block to which the associated text block belongs as context information, enabling targeted reconstruction of the context and ensuring the accuracy and rationality of subsequent answer information output.

[0119] In an exemplary embodiment, in step S503, the target prompt information is input into the local quantized language model to obtain the answer information, which can be implemented in the following way:

[0120] By using a local quantized language model, the contextual information in the target prompt is used as background knowledge to generate answer information for the question. The answer information includes at least one of the following: the image storage path corresponding to the image information, the original text of the table path corresponding to the table information, and the document name corresponding to the context information.

[0121] In this embodiment, the target prompt information can be generated based on a preset template to guide the local quantization language model in reasoning. In one example, the target prompt information may require the local quantization language model to act as an expert, answering questions based on provided background knowledge. After receiving the target prompt information, the local quantization language model can use the contextual information in the target prompt information as background knowledge and output answer information for the question. If the contextual information includes an image storage path or table information, the answer information output by the local quantization language model may include the original text of the image storage path or the table path corresponding to the table information. The original text of the table path may refer to the specific content of the table information included in the semantic block, so that the edge device can render and display it on the front end. In addition, the answer information may also include the document name that provides the knowledge source, i.e., the contextual information, which can enhance the credibility of the answer information. After receiving the answer information output by the local quantized language model, the question-answering system on the edge device can parse the answer information. If it contains image storage paths or table information, the question-answering system on the edge device can automatically render and display the corresponding image or table information on the front end based on the target format, thereby achieving a visually appealing and well-reasoned answer effect and improving the user's actual experience.

[0122] Based on the above embodiments, in an exemplary embodiment, the knowledge base constructed by the computer terminal is a vector knowledge base, which includes: multiple document name vectors corresponding to multiple target documents, multiple text block vectors corresponding to each document name vector, and text block metadata corresponding to each text block vector.

[0123] In this embodiment of the application, the knowledge base constructed by the computer terminal can specifically be a vector knowledge base. For example, Figure 6 This is a schematic diagram of the structure of a vector knowledge base according to an embodiment of this application. Figure 6 As shown, this vector knowledge base includes multiple document name vectors corresponding to multiple target documents. Each document name vector corresponds to the image storage path of the image information of the target document. The specific storage method for the image information in this embodiment is as follows (stored in folders according to document name, for example...). Figure 6 The image folders 1 to N correspond to document names 1 to N respectively (or they can be stored in a uniform sorting order, etc.) are not limited. This vector knowledge base also includes text block vectors corresponding to each text block of each document name vector, and each text block vector can correspond to a text block metadata.

[0124] In this embodiment, after the computer terminal divides the text into blocks, it can perform vector encoding on the text blocks to obtain text block vectors. Then, it can store the document name vector corresponding to the target document, the text block vectors, and the text block metadata corresponding to the text block vectors in a vector knowledge base. The text block metadata may include the document name to which the text block belongs, the document name vector corresponding to that document name, and the semantic block identifier to which the text block belongs (i.e.,...). Figure 6 (i.e., semantic block number 1 to N), the position identifier of the semantic block to which the text block belongs (i.e.) Figure 6 Semantic block sub-numbers 1 to N) and text block type ( Figure 6 (Not shown in the image). In this way, by adopting a vector knowledge base, the computer terminal can achieve hierarchical and fine-grained construction based on document name-semantic block-text block, and adopt rich text block metadata design, which can improve the efficiency of subsequent retrieval, achieve accurate and complete extraction of contextual information, and adapt to the actual scenarios of edge devices with limited memory.

[0125] In one exemplary embodiment, Figure 7 The steps of the question-and-answer method in this application are as follows: Figure 3 .like Figure 7 As shown, searching for the answer information corresponding to the question information in the knowledge base can include the following steps:

[0126] Step S701: Based on the first association similarity between the question vector corresponding to the question information and the document name vector, and the second association similarity between the question vector and the text block vector, select the associated text block vector that is associated with the question vector from the text block vector corresponding to the document name vector.

[0127] In this embodiment of the application, when the constructed knowledge base is a vector knowledge base, the edge device can determine the associated text block vectors of the question vectors based on vector matching. These associated text block vectors can be determined based on a first association similarity and a second association similarity. Specifically, they can be determined based on the weighted similarity of the first and second association similarities and semantic-based ranking and filtering, or based on the confidence level determined by the first and second association similarities (for example, based on the statistical distribution of the first and second association similarities, the similarity is converted into a confidence level, and the confidence level is standardized using a Gaussian distribution, or a confidence interval estimation is added, and then associated text block vectors are filtered according to the confidence level). Of course, other determination methods can also be used, and this embodiment of the application does not limit this.

[0128] In this step, the question vector can refer to the encoded vector corresponding to the question information; the first association similarity can refer to the similarity between the question vector and the document name vector; and the second association similarity can refer to the similarity between the question vector and the text block vector. In this way, by introducing the document name vector and calculating the similarity between vectors, edge devices can filter associated text block vectors, achieving efficient and low-power retrieval of associated text blocks. This aligns with the resource-constrained scenarios of edge devices and enables rapid response to user question information.

[0129] Step S702: Construct target prompt information based on the question vector, the associated text block vector, and the associated text block metadata corresponding to the associated text block vector.

[0130] Step S703: Input the target prompt information into the local quantization language model to obtain the answer information.

[0131] In this embodiment, after determining the associated text block vector, the edge device can construct target prompt information based on the question vector, the associated text block vector, and the associated text block metadata. Subsequently, the edge device can input this target prompt information into a local quantized language model to obtain the answer information corresponding to the question information. The process of constructing the target prompt information and the process of outputting the answer information can refer to the foregoing embodiments, and will not be repeated here.

[0132] In this embodiment, the edge device can efficiently trace and reconstruct the complete semantic block to which the associated text block belongs as context information by associating the text block vector with the text block vector metadata. Then, the target prompt information can be constructed by combining the question information and the context information. In this way, the memory usage and context assembly latency of the edge device can be greatly optimized through the metadata-driven targeted reconstruction mechanism, which can adapt to the resource-constrained scenarios of the edge device.

[0133] In an exemplary embodiment, step S701, based on the first association similarity between the question vector corresponding to the question information and the document name vector, and the second association similarity between the question vector and the text block vector, selects the associated text block vectors associated with the question vector from the text block vectors corresponding to the document name vector. This can be implemented in the following way:

[0134] Calculate the first association similarity between the question vector and the document name vector, and calculate the second association similarity between the question vector and the text block vector corresponding to the document name vector; calculate the weighted similarity between the question vector and the text block vector based on the first and second association similarities; in response to the weighted similarity satisfying the preset similarity condition, determine the candidate text block vector corresponding to the question vector from the text block vectors; sort the candidate text block vectors according to the degree of association between the candidate text block vectors and the question vector to obtain the associated text block vectors.

[0135] In this embodiment, the process of edge devices determining associated text block vectors can be a two-stage multi-dimensional retrieval, which can adapt to edge computing constraints and avoid the computational bottlenecks of traditional RAG technology's global search and high-load models in related technologies. The first stage is a coarse-grained cross-dimensional fusion for rapid recall to determine candidate text block vectors; the second stage is a fine-grained lightweight ranking to determine the final associated text block vectors.

[0136] Specifically, in the first stage, the edge device can encode the user's question information to obtain a question vector. Then, the edge device can calculate a first association similarity between the question vector and the document name vector, and a second association similarity between the question vector and the text block vector. The edge device can then calculate a weighted similarity based on the first and second association similarities. For example, this weighted similarity can be calculated using the following formula (2):

[0137] θ=α S(q, P) i )+β S(q,T) (2)

[0138] In the above formula (2), θ is the weighted similarity; α and β are weighting factors, the specific values ​​of which can be set based on actual needs, and this application embodiment does not limit them; q is the question vector; P i T is a vector of text blocks; T is a vector of document names.

[0139] The preset similarity condition can refer to a pre-set similarity threshold condition. For example, if the weighted similarity is greater than or equal to a second preset similarity threshold (e.g., 80% or 90%), the edge device can use the current text block vector as a candidate text block vector for the question vector. Thus, in this embodiment, by calculating weighted similarity across multiple dimensions and integrating document name vectors for similarity calculation, a cross-document, cross-domain fusion evaluation of the correlation between the question vector and the text block vector can be achieved. This enables the question-answering system of the edge device to efficiently and accurately prune text blocks at a coarse-grained level, significantly reducing the candidate set that needs to enter the second stage of calculation. Furthermore, in this embodiment, the preliminary screening of candidate text blocks by calculating weighted similarity avoids the time-consuming global matching retrieval method used in related technologies, adapting to the resource-constrained scenarios of edge devices and meeting the practical requirements under edge computing power limitations.

[0140] In the second stage, to ensure recall accuracy and control resource consumption, edge devices can use a lightweight re-ranking model to score and refine the recalled candidate text block vectors. This lightweight re-ranking model refers to a deeply optimized ranking model, not just simple pruning, but employing task-specific knowledge distillation and structured pruning techniques. It can transform high-precision ranking tasks into a model workload with extremely low computational complexity without sacrificing ranking accuracy, meeting the low power consumption, small memory requirements, and real-time response of edge devices.

[0141] In one example, during the training phase of a lightweight re-ranking model, edge devices can employ knowledge distillation to transfer the knowledge and decision-making capabilities of a large, high-precision teacher model to a simpler, fewer-parameter student model. This allows the student model to retain the complex decision boundaries of the teacher model while significantly reducing its computational resource requirements, maintaining high ranking accuracy while running efficiently on edge devices. Furthermore, edge devices can selectively remove task-irrelevant or low-contribution neurons and connections from the student model using structured pruning techniques to reduce the model's parameter count and computational complexity. This structured pruning technique does not sacrifice ranking performance but significantly reduces runtime memory usage and computational latency.

[0142] For the candidate text block vectors selected in the first stage, the edge device can use the candidate text block vectors as the input of the lightweight ranking model. The lightweight ranking model can sort the N candidate text block vectors based on the correlation between the candidate text block vectors and the question vector. Specifically, it can comprehensively evaluate the potential quality of each candidate text block vector based on the degree of deep semantic matching between the candidate text and the user question information, and the semantic matching length between the context information provided by the text metadata corresponding to the candidate text block vector and the question information. The edge device can select the candidate text block vectors ranked in the top M (M < N) as the associated text block vectors according to the sorting results of the candidate text block vectors. The specific value of M can be set according to the actual resource limitations and application requirements of the edge device, and the embodiments of this application do not limit this.

[0143] In view of the characteristics of low computing power and high response requirements of edge devices, the edge device in the embodiments of this application adopts an efficient two-stage retrieval strategy. In response to the user's question and answer information (that is, the user issues a question and answer request), in the first stage, the edge device realizes the fast and accurate recall of high-similarity text blocks across documents through weighted similarity calculation based on the correlation between the three dimensions of the question, text block, and document name. In the second stage, a lightweight re-ranking model and text block metadata are used to finely screen the recalled candidate text blocks (vectors). In this way, through the coarse-grained retrieval in the first stage and the fine-grained sorting and fine screening in the second stage, the edge device not only ensures the accuracy of the question and answer, but also significantly reduces the dependence on the computing power resources of the edge device and the response delay by adopting a two-stage mechanism with low computing power, and can adapt to the key constraints of low computing power and small memory of the edge device.

[0144] Based on the above embodiments, Figure 8 is a schematic logical architecture diagram of a question and answer system according to an embodiment of this application. As Figure 8 shown, the question and answer system mainly includes a document preprocessing module, a knowledge base construction module, and a question and answer retrieval module.

[0145] Among them, the document preprocessing module can be used to extract multi-modal information, extract associated text information, and associate and bind the associated text information with the multi-modal information. It can realize the integrated processing and association of multi-modal information, avoid the problems of multi-modal information fragmentation and low parsing robustness in traditional RAG technology when processing knowledge documents, and realize the deep logical encoding and association of multi-modal information. After the original knowledge document (.docx) is processed by this document preprocessing module, the computer terminal can obtain the target document in the target format (.md).

[0146] The knowledge base construction module can semantically divide multiple target documents into multiple semantic blocks. These semantic blocks can then be further divided into multiple text blocks. An adaptive, hierarchical knowledge granularity structure is then constructed based on the semantic block-text block model to serve as the knowledge base. Specifically, the knowledge base can be built based on the document name of the target document, the multiple text blocks corresponding to the document name, and the text block metadata. Alternatively, it can be built based on the document name vector, text block vector, and text block metadata. Subsequent knowledge retrieval can be based on this knowledge base, ensuring question-answering efficiency and reducing question-answering latency.

[0147] In the actual question-and-answer process, users can input question information through interaction with the client. The question-and-answer retrieval module can then perform a two-stage search and matching in the knowledge base based on this question information to determine the associated text block corresponding to the question information. Subsequently, it can perform contextual tracing based on the text block metadata of this associated text block to construct target prompt information. The edge device can then input this target prompt information into its local quantized language model to obtain the final answer information.

[0148] Traditional RAG processes in related technologies cannot quickly and accurately locate the specific document or document set indicated by the user's intent in resource-constrained scenarios such as edge devices. They generally suffer from technical defects such as loss of document information (e.g., loss of non-textual information), low retrieval efficiency, fragmented contextual semantics, and inability to implement text-image question answering in a lightweight manner. Furthermore, deploying and implementing multi-document fast retrieval and user intent question answering on edge devices faces the problem of limited computing resources.

[0149] The question-answering method in this application primarily belongs to the interdisciplinary technical field of artificial intelligence, natural language processing, and edge computing. By efficiently parsing complex multimodal documents, the computer terminal achieves rapid cross-document location and retrieval, as well as deep association of multimodal information. Ultimately, it realizes lightweight knowledge-based question answering on resource-constrained edge devices. This significantly reduces reliance on computing resources while ensuring answer quality and providing rich, illustrated responses, enabling efficient deployment on resource-constrained devices.

[0150] The question-answering method of this application embodiment can significantly improve the quality of answers. Specifically, by constructing and returning complete semantic blocks as context, it effectively solves the context fragmentation problem of traditional RAG, ensuring the integrity and logical coherence of the information provided to the local quantized language model. Furthermore, by combining document names and text block metadata in the retrieval step, this application embodiment can improve the accuracy of the final answer. Specifically, by constructing and returning complete semantic blocks as context, this application embodiment effectively solves the context fragmentation problem of traditional RAG, ensuring the integrity and logical coherence of the information provided to the large language model, and by integrating metadata into the retrieval step, it significantly improves the accuracy of the final answer. For example, Table 1 is a schematic table of accuracy comparison of this application embodiment, as follows:

[0151] Table 1

[0152]

[0153] As can be seen from Table 1, compared with the LLM direct answer or the traditional RAG process in related technologies, the question-answering method of this application has a higher retrieval accuracy and answer accuracy.

[0154] The question-answering method in this application overcomes the bottleneck of image-text question-answering on edge devices. Through an image-text binding mode, it successfully achieves efficient and low-latency image-text joint question-answering functionality on resource-constrained edge devices without deploying large-scale multimodal models, demonstrating high practicality and deployability. Furthermore, the question-answering method in this application is adaptable to edge devices in resource-constrained scenarios, significantly reducing computing power requirements and deployment costs. It enables the deployment of knowledge question-answering systems in information privacy and security-sensitive environments as well as low-cost application environments, expanding the application scope of question-answering systems.

[0155] In practical scenarios, the question-answering method in this application embodiment can achieve low-latency text-image question-answering. By adopting a lightweight LLM (Local Quantized Language Mode) and an optimized two-stage edge retrieval architecture, it can achieve rapid response (e.g., within 3 seconds) to complex questions on resource-constrained edge terminals (such as industrial tablets or in-vehicle systems) without significant network latency or power consumption spikes. Furthermore, the answer information in the question-answering method of this application embodiment can include the original path; that is, the answer information returned by the question-answering system explicitly outputs the image storage path (e.g., / data / img / X.png) or the original table path near the image or table information. Through character-level path embedding and path return mechanisms, a richly illustrated answer can be achieved, ensuring the quality of the answer information.

[0156] The question-answering method in this embodiment exhibits high efficiency in cross-document queries. In large knowledge bases, when a user's question involves document names or sources, the question-answering system in this embodiment can quickly focus the search scope on the documents mentioned by the user by combining the document name during the retrieval process, for example, by adopting a multi-dimensional recall strategy based on document name vectors. Furthermore, the question-answering method in this embodiment, through a metadata-driven semantic block tracing and reconstruction mechanism, can adapt to efficient context assembly at the edge, resulting in clearer retrieval context boundaries. The text length of the cited sources (e.g., context information determined based on semantic blocks) provided in the final output answer information is significantly longer than the basic vector block (i.e., the associated text block vector), and the content is semantically complete and has coherent boundaries (e.g., containing complete paragraphs or captions), ensuring the quality of the answer information.

[0157] The question-answering method of this application is applicable to scenarios where data privacy is sensitive or rapid response is required, such as in the medical, industrial, and financial fields. For example, this question-answering method can be used as a smart assistant in factories to help workers answer questions about process knowledge; it can also be used as a smart assistant in medical settings to help doctors analyze medical records; and it can be used in the financial industry as a smart assistant to help financial experts analyze and evaluate financial data. Of course, this question-answering method can also be applied to other scenarios, and the specific application can be flexibly selected based on actual needs; this application does not limit its application in this regard.

[0158] It should be understood that in the various embodiments of this application, the sequence number of each process and step does not imply the order of execution. The execution order of each process and step should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0159] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods of the various embodiments of this application.

[0160] This embodiment also provides a question-and-answer device for implementing the above embodiments and implementation methods; details already described will not be repeated. As used below, the term "module" can refer to a combination of software and / or hardware that implements a predetermined function. Although the device described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated. The question-and-answer device includes:

[0161] The search module is used to search for the answer information corresponding to the question information in the knowledge base in response to receiving the user's question information.

[0162] In one exemplary embodiment, the device is further used for:

[0163] Extract multimodal information from multiple knowledge documents; wherein, multimodal information includes at least one of text information, table information, and image information;

[0164] In response to the fact that multimodal information includes non-textual information, the associated textual information corresponding to the non-textual information is determined, and the associated textual information is linked and bound to the multimodal information to obtain multiple target documents corresponding to multiple knowledge documents; non-textual information includes image information and / or table information;

[0165] A knowledge base is constructed based on multiple target documents; the knowledge base includes: the document name of each target document, multiple text blocks corresponding to each document name, and the text block metadata corresponding to each text block.

[0166] In one exemplary embodiment, the device is further used for:

[0167] In the knowledge base, select the associated text block that is related to the question information from the text block corresponding to the document name;

[0168] Based on the question information, related text blocks, and the metadata of the related text blocks, construct the target prompt information;

[0169] The target prompt information is input into the local quantized language model to obtain the answer information.

[0170] In one exemplary embodiment, the knowledge base is a vector knowledge base, which includes: multiple document name vectors corresponding to multiple target documents, multiple text block vectors corresponding to each document name vector, and text block metadata corresponding to each text block vector; the device is further used for:

[0171] Based on the first association similarity between the question vector corresponding to the question information and the document name vector, and the second association similarity between the question vector and the text block vector, the associated text block vectors that are associated with the question vector are selected from the text block vectors corresponding to the document name vector;

[0172] Based on the question vector, the associated text block vector, and the associated text block metadata corresponding to the associated text block vector, construct the target prompt information;

[0173] The target prompt information is input into the local quantized language model to obtain the answer information.

[0174] In one exemplary embodiment, the device is further used for:

[0175] Multimodal information is extracted from knowledge documents using a structured parser to obtain candidate multimodal information;

[0176] In response to the fact that the extraction accuracy of the candidate multimodal information is greater than or equal to a preset accuracy threshold, the candidate multimodal information is used as the multimodal information in the knowledge document; or, in response to the failure of the structured parser to parse or the extraction accuracy being less than a preset accuracy threshold, the multimodal information in the knowledge document is extracted by the target visual recognition model to obtain the multimodal information in the knowledge document.

[0177] In one exemplary embodiment, in response to the multimodal information including text information and image information, the device is further configured to:

[0178] Determine the image storage path corresponding to the image information; determine the original location of the image corresponding to the image information in the text information, and modify the image information at the original location to the image storage path.

[0179] In one exemplary embodiment, the device is further used for:

[0180] Determine the corresponding caption information for the image information and the position of the paragraph in which the image information is quoted in the text information;

[0181] In the text information, move the image storage path and caption information to the location of the quoted paragraph.

[0182] In one exemplary embodiment, the device is further used for:

[0183] Determine the target logical identifier corresponding to the image information; the target logical identifier includes the reference paragraph position identifier, the image storage path identifier, and the caption information identifier;

[0184] Embed the target logical identifier at the corresponding location of the image information in the text information.

[0185] In one exemplary embodiment, in response to multimodal information including text information and table information, the device is further configured to:

[0186] Extract column header information and key data relationships from the table information, and generate summary text corresponding to the table information based on the column header information and key data relationships;

[0187] Determine the original position of the table information in the text information, and insert summary text at the original position of the table.

[0188] In one exemplary embodiment, the device is further used for:

[0189] Semantically segment the content of each target document to obtain multiple semantic blocks;

[0190] Each semantic block is segmented to obtain multiple text blocks;

[0191] A knowledge base is constructed based on the text block, the document name corresponding to the target document, and the text block metadata corresponding to the text block. The text block metadata includes at least one of the following: the document name of the target document to which the text block belongs, the semantic block identifier of the semantic block to which the text block belongs, the position identifier of the text block in its semantic block, and the text block type of the text block.

[0192] In one exemplary embodiment, the device is further used for:

[0193] Calculate the first association similarity between the question vector and the document name vector, and calculate the second association similarity between the question vector and the text block vector corresponding to the document name vector;

[0194] Calculate the weighted similarity between the question vector and the text block vector based on the first and second association similarities.

[0195] In response to the weighted similarity meeting the preset similarity condition, candidate text block vectors corresponding to the question vector are determined from the text block vectors;

[0196] Based on the degree of correlation between the candidate text block vector and the question vector, the candidate text block vectors are sorted to obtain the associated text block vectors.

[0197] In one exemplary embodiment, the device is further used for:

[0198] Based on the metadata of the associated text block, determine the associated semantic block corresponding to the associated text block;

[0199] Determine the context information corresponding to the associated semantic block;

[0200] Based on the context and question information, construct the target prompt information.

[0201] In one exemplary embodiment, the device is further used for:

[0202] By using a local quantized language model, the contextual information in the target prompt is used as background knowledge to generate answer information for the question. The answer information includes at least one of the following: the image storage path corresponding to the image information, the original text of the table path corresponding to the table information, and the document name corresponding to the context information.

[0203] It should be noted that the above modules can be implemented by software or hardware. For the latter, they can be implemented in the following ways, but are not limited to: all the above modules are located in the same processor; or, the above modules are located in different processors in any combination.

[0204] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above-described question-and-answer method embodiments when run.

[0205] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), computer hard disk, magnetic disk, or optical disk.

[0206] Figure 9 This is a structural block diagram of an edge device according to an embodiment of this application, such as... Figure 9 As shown, embodiments of this application also provide an edge device 90, including a memory 901 and a processor 902. The memory 901 stores a computer program, and the processor 902 is configured to run the computer program to perform the steps in any of the above-described question-and-answer method embodiments.

[0207] In one exemplary embodiment, the edge device may further include a transmission device and an input / output device, wherein the transmission device is connected to the processor and the input / output device is connected to the processor.

[0208] Embodiments of this application also provide a computer program product, including a computer program that, when executed by a processor, implements the steps in any of the above-described question-and-answer method embodiments.

[0209] Specific examples in this embodiment can be found in the examples described in the above embodiments and exemplary implementations, and will not be repeated here.

[0210] Obviously, those skilled in the art should understand that the modules or steps of this application described above can be implemented using general-purpose computing devices. They can be centralized on a single computing device or distributed across a network of multiple computing devices. They can be implemented using computer-executable program code, and thus can be stored in a storage device for execution by a computing device. In some cases, the steps shown or described can be performed in a different order than those presented here, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, this application is not limited to any particular combination of hardware and software.

[0211] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the principles of this application should be included within the protection scope of this application.

Claims

1. A question-and-answer method, characterized in that, Applied to edge devices, including: In response to receiving a user's question, the system searches the knowledge base for the answer information corresponding to the question.

2. The method according to claim 1, characterized in that, The method for constructing the knowledge base includes: Extract multimodal information from multiple knowledge documents; wherein the multimodal information includes at least one of text information, table information, and image information; In response to the fact that the multimodal information includes non-textual information, the associated textual information corresponding to the non-textual information is determined, and the associated textual information is associated and bound with the multimodal information to obtain multiple target documents corresponding to the multiple knowledge documents; the non-textual information includes image information and / or table information; Based on the multiple target documents, the knowledge base is constructed; wherein, the knowledge base includes: the document name of each target document, multiple text blocks corresponding to each document name, and text block metadata corresponding to each text block.

3. The method according to claim 2, characterized in that, The step of searching the knowledge base for the answer information corresponding to the question information includes: In the knowledge base, select the associated text block that is related to the question information from the text block corresponding to the document name; Based on the question information, the associated text block, and the metadata of the associated text block, construct the target prompt information; The target prompt information is input into the local quantized language model to obtain the answer information.

4. The method according to claim 2, characterized in that, The knowledge base is a vector knowledge base, which includes: multiple document name vectors corresponding to the multiple target documents, multiple text block vectors corresponding to each document name vector, and text block metadata corresponding to each text block vector; the step of searching for the answer information corresponding to the question information in the knowledge base includes: Based on the first association similarity between the question vector corresponding to the question information and the document name vector, and the second association similarity between the question vector and the text block vector, an associated text block vector that is associated with the question vector is selected from the text block vector corresponding to the document name vector; Based on the question vector, the associated text block vector, and the associated text block metadata corresponding to the associated text block vector, a target prompt information is constructed; The target prompt information is input into the local quantized language model to obtain the answer information.

5. The method according to claim 2, characterized in that, The extraction of multimodal information from multiple knowledge documents includes: The knowledge document is processed by a structured parser to extract multimodal information and obtain candidate multimodal information. In response to the extraction accuracy corresponding to the candidate multimodal information being greater than or equal to a preset accuracy threshold, the candidate multimodal information is used as the multimodal information in the knowledge document; or, in response to the structured parser failing to parse or the extraction accuracy being less than the preset accuracy threshold, the multimodal information in the knowledge document is extracted by a target visual recognition model to obtain the multimodal information in the knowledge document.

6. The method according to claim 2, characterized in that, In response to the multimodal information including text information and image information, determining the associated text information corresponding to the non-text information and associating and binding the associated text information with the multimodal information includes: Determine the image storage path corresponding to the image information; determine the original image location corresponding to the image information in the text information, and modify the image information at the original image location to the image storage path.

7. The method according to claim 6, characterized in that, After associating and binding the associated text information with the multimodal information, the method further includes: Determine the caption information corresponding to the image information and the position of the paragraph in which the image information is referenced in the text information; In the text information, the image storage path and the caption information are moved to the position of the referenced paragraph.

8. The method according to claim 6, characterized in that, After associating and binding the associated text information with the multimodal information, the method further includes: Determine the target logical identifier corresponding to the image information; the target logical identifier includes the reference paragraph position identifier, the image storage path identifier, and the caption information identifier corresponding to the image information; The target logical identifier is embedded in the associated position corresponding to the image information in the text information.

9. The method according to claim 2, characterized in that, In response to the multimodal information including text information and table information, determining the associated text information corresponding to the non-text information and associating and binding the associated text information with the multimodal information includes: Extract the column header information and key data relationships from the table information, and generate a summary text corresponding to the table information based on the column header information and the key data relationships; The original position of the table information is determined in the text information, and the summary text is inserted at the original position of the table.

10. The method according to claim 2, characterized in that, The construction of the knowledge base based on the multiple target documents includes: The content of each target document is semantically segmented to obtain multiple semantic blocks; Each semantic block is segmented to obtain multiple text blocks; The knowledge base is constructed based on the text block, the document name corresponding to the target document, and the text block metadata corresponding to the text block; the text block metadata includes at least one of the following: the document name of the target document to which the text block belongs, the semantic block identifier of the semantic block to which the text block belongs, the position identifier of the text block in the semantic block to which it belongs, and the text block type of the text block.

11. The method according to claim 4, characterized in that, The step of selecting associated text block vectors related to the question vector from the text block vectors corresponding to the document name vector, based on the first association similarity between the question vector corresponding to the question information and the document name vector, and the second association similarity between the question vector and the text block vector, includes: Calculate the first association similarity between the question vector and the document name vector, and calculate the second association similarity between the question vector and the text block vector corresponding to the document name vector; Based on the first association similarity and the second association similarity, calculate the weighted similarity between the question vector and the text block vector; In response to the weighted similarity satisfying a preset similarity condition, a candidate text block vector corresponding to the question vector is determined from the text block vector; Based on the degree of correlation between the candidate text block vector and the question vector, the candidate text block vector is sorted to obtain the associated text block vector.

12. The method according to claim 3, characterized in that, The step of constructing target prompt information based on the question information, the associated text block, and the metadata of the associated text block includes: Based on the associated text block metadata, determine the associated semantic block corresponding to the associated text block; Determine the context information corresponding to the associated semantic block; Based on the context information and the question information, the target prompt information is constructed.

13. The method according to claim 12, characterized in that, The step of inputting the target prompt information into the local quantized language model to obtain the answer information includes: Using the local quantization language model, the contextual information in the target prompt information is used as background knowledge to generate answer information for the question information; the answer information includes at least one of the following: the image storage path corresponding to the image information, the original text of the table path corresponding to the table information, and the document name corresponding to the context information.

14. An edge device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the question-and-answer method according to any one of claims 1 to 13.

15. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, it implements the steps of the question-and-answer method according to any one of claims 1 to 13.

16. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the question-and-answer method as described in any one of claims 1 to 13.