Long text question and answer method and device, electronic equipment and readable storage medium
By employing hybrid retrieval based on a search index, global information extraction, and chained verification processing, the problem of context fragmentation in long text question answering is solved, enabling the generation of more accurate and relevant answers.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING JIZHI DIGITAL TECH CO LTD
- Filing Date
- 2026-02-10
- Publication Date
- 2026-06-19
Smart Images

Figure CN122240761A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of data retrieval technology, and in particular to a long text question-answering method, apparatus, electronic device, and readable storage medium. Background Technology
[0002] Existing long-text question answering systems typically employ Retrieval-Augmented Generation (RAG) technology, which assists in generating answers by segmenting long documents and retrieving relevant fragments. However, segmentation disrupts the global structure of the document, resulting in missing fragment contexts and making it difficult for the model to grasp the overall semantics, thus affecting the accuracy of the answers. Expanding the model's context window to process even longer texts significantly increases computational costs, and the improvement in retrieval accuracy is limited under incomplete fragment semantics.
[0003] It is evident that existing technologies suffer from low accuracy and high retrieval noise in long text question answering due to their inability to fully process long text data, fragmented contextual information, and lack of a systematic review process. Summary of the Invention
[0004] In view of this, the present disclosure provides a long text question answering method, apparatus, electronic device, and readable storage medium to solve the problems in the prior art that the inability to fully process long text data, the fragmentation of contextual information, and the lack of a systematic review process lead to low accuracy and high retrieval noise in long text question answering.
[0005] A first aspect of this disclosure provides a long text question answering method, comprising: performing mixed retrieval processing on long text question data based on a preset retrieval index library to obtain a retrieval result fragment dataset corresponding to the long text question data; performing global information extraction processing on the retrieval result fragment dataset to obtain global context data; performing chained verification processing on the long text question data, global context data, and retrieval result fragment dataset to obtain a semantic fragment dataset; and performing answer generation processing on the global context data and semantic fragment dataset to obtain target answer text data corresponding to the long text question data.
[0006] In some embodiments, global information extraction processing is performed on the search result fragment dataset to obtain global contextual data, including: performing fragment mapping processing on the search result fragment dataset to obtain a search paragraph dataset; and performing context extraction processing on the search paragraph dataset to obtain global contextual data.
[0007] In some embodiments, chain-validation processing is performed on long text question data, global context data, and retrieval result fragment dataset to obtain semantic fragment dataset, including: performing reasoning generation processing on long text question data and global context data to obtain chain-thinking data; performing fragment evaluation processing on retrieval result fragment dataset based on chain-thinking data to obtain evaluation state data; and performing filtering processing on evaluation state data to obtain semantic fragment dataset.
[0008] In some embodiments, before performing mixed retrieval processing on long text question data based on a preset retrieval index library to obtain a retrieval result fragment dataset corresponding to the long text question data, the method further includes: segmenting the long text data to be processed to obtain an initial fragment dataset; performing context expansion processing on the initial fragment dataset to obtain a text fragment dataset; and performing index building processing on the text fragment dataset to obtain a preset retrieval index library.
[0009] In some embodiments, a hybrid retrieval process is performed on long text question data based on a preset retrieval index library to obtain a dataset of retrieval result fragments corresponding to the long text question data. This includes: vectorizing the long text question data to obtain long text question vectors; performing similarity retrieval processing on the long text question vectors to obtain a vector retrieval dataset; extracting keywords from the long text question data to obtain question keyword data; performing full-text retrieval processing on the question keyword data to obtain a full-text retrieval dataset; and merging and filtering the vector retrieval dataset and the full-text retrieval dataset to obtain a dataset of retrieval result fragments.
[0010] In some embodiments, merging and filtering the vector retrieval dataset and the full-text retrieval dataset to obtain a retrieval result fragment dataset includes: merging and sorting the vector retrieval dataset and the full-text retrieval dataset to obtain a merged dataset; and performing confidence filtering on the merged dataset to obtain a retrieval result fragment dataset.
[0011] In some embodiments, answer generation processing is performed on global context data and semantic fragment datasets to obtain target answer text data corresponding to long text question data, including: performing knowledge fusion processing on global context data and semantic fragment datasets to obtain fused question data; and performing answer reasoning processing on fused question data to obtain target answer text data.
[0012] A second aspect of this disclosure provides a long text question-answering device, comprising: a first processing module, configured to perform mixed retrieval processing on long text question data based on a preset retrieval index library to obtain a retrieval result fragment dataset corresponding to the long text question data; a second processing module, configured to perform global information extraction processing on the retrieval result fragment dataset to obtain global context data; a third processing module, configured to perform chained verification processing on the long text question data, global context data, and retrieval result fragment dataset to obtain a semantic fragment dataset; and a fourth processing module, configured to perform answer generation processing on the global context data and semantic fragment dataset to obtain target answer text data corresponding to the long text question data.
[0013] A third aspect of this disclosure provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method described above.
[0014] A fourth aspect of this disclosure provides a readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described method.
[0015] The beneficial effects of this disclosed embodiment compared with the prior art are as follows: By performing mixed retrieval processing on long text question data based on a preset retrieval index library, a dataset of retrieval result fragments corresponding to the long text question data is obtained; global information extraction processing is performed on the dataset of retrieval result fragments to obtain global context data; chain verification processing is performed on the long text question data, global context data, and dataset of retrieval result fragments to obtain a semantic fragment dataset; and answer generation processing is performed on the global context data and semantic fragment dataset to obtain the target answer text data corresponding to the long text question data. This improves the contextual accuracy of long text question answering, reduces retrieval noise, enhances the ability to understand global information, improves the relevance and accuracy of retrieval results, and strengthens the factual basis and logical coherence of the answer. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the embodiments of this disclosure, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0017] Figure 1 This is a schematic diagram illustrating an application scenario of an embodiment of this disclosure; Figure 2 This is a flowchart illustrating a long text question-and-answer method provided in an embodiment of this disclosure; Figure 3 This is a flowchart illustrating another long text question-answering method provided in this embodiment of the disclosure; Figure 4 This is a schematic diagram of the structure of a long text question-and-answer device provided in an embodiment of this disclosure; Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation
[0018] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, so as to provide a thorough understanding of the embodiments of this disclosure. However, those skilled in the art will understand that this disclosure may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this disclosure with unnecessary detail.
[0019] It should be noted that the user information (including but not limited to terminal device information, user personal information, etc.) and data (including but not limited to data used for display, data used for analysis, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties.
[0020] A long text question-and-answer method and apparatus according to embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings.
[0021] Figure 1 This is a schematic diagram illustrating an application scenario of an embodiment of this disclosure. The application scenario may include terminal devices 1, 2, and 3, server 4, and network 5.
[0022] Terminal devices 1, 2, and 3 can be hardware or software. When terminal devices 1, 2, and 3 are hardware, they can be various electronic devices with displays and supporting communication with server 4, including but not limited to smartphones, tablets, laptops, and desktop computers. When terminal devices 1, 2, and 3 are software, they can be installed in the aforementioned electronic devices. Terminal devices 1, 2, and 3 can be implemented as multiple software programs or software modules, or as a single software program or software module; this disclosure does not limit this. Furthermore, various applications can be installed on terminal devices 1, 2, and 3, such as data processing applications, instant messaging tools, social platform software, search applications, shopping applications, etc.
[0023] Server 4 can be a server that provides various services, such as a backend server that receives requests sent by terminal devices with which it has established communication connections. This backend server can receive and analyze the requests sent by the terminal devices and generate processing results. Server 4 can be a single server, a server cluster consisting of several servers, or a cloud computing service center. This disclosure embodiment does not limit this.
[0024] It should be noted that server 4 can be either hardware or software. When server 4 is hardware, it can be various electronic devices that provide various services to terminal devices 1, 2, and 3. When server 4 is software, it can be multiple software programs or software modules that provide various services to terminal devices 1, 2, and 3, or it can be a single software program or software module that provides various services to terminal devices 1, 2, and 3. This disclosure does not limit the scope of the embodiments.
[0025] Network 5 can be a wired network using coaxial cable, twisted pair, and fiber optic connection, or it can be a wireless network that enables interconnection of various communication devices without wiring, such as Bluetooth, Near Field Communication (NFC), and Infrared. This disclosure does not limit the scope of the network.
[0026] Users can establish a communication connection with server 4 via network 5 through terminal devices 1, 2, and 3 to receive or send information. Specifically, server 4 can obtain long text question data through terminal devices 1, 2, and 3, perform mixed retrieval processing on the long text question data based on a preset retrieval index library to obtain a dataset of retrieval result fragments corresponding to the long text question data; perform global information extraction processing on the dataset of retrieval result fragments to obtain global context data; perform chained verification processing on the long text question data, global context data, and dataset of retrieval result fragments to obtain a semantic fragment dataset; and perform answer generation processing on the global context data and semantic fragment dataset to obtain the target answer text data corresponding to the long text question data.
[0027] It should be noted that the specific types, quantities, and combinations of terminal devices 1, 2, and 3, server 4, and network 5 can be adjusted according to the actual needs of the application scenario, and this disclosure embodiment does not impose any restrictions on this.
[0028] Figure 2 This is a flowchart illustrating a long text question-and-answer method provided in an embodiment of this disclosure. Figure 2 Long text question answering methods can be developed by Figure 1 The server executes this. For example... Figure 2 As shown, this long text question-answering method includes: S201, Based on the preset retrieval index library, perform mixed retrieval processing on the long text question data to obtain the retrieval result fragment dataset corresponding to the long text question data.
[0029] Specifically, the preset retrieval index can be a structured data storage system pre-built to achieve efficient information retrieval. This preset retrieval index can take the form of a data index set that supports subsequent retrieval operations. It can be used to store processed long document fragments and their associated vectorized representations and full-text indexes. This preset retrieval index can be obtained through long document segmentation, specifically by dividing the long document into fixed-length fragments, with sentences as the smallest unit. Furthermore, to maintain semantic coherence, a sliding window strategy can be used to add overlapping content from the previous sentence to each fragment for contextual expansion. Finally, the processed fragments can be encoded into vectors, embedded, and a full-text index can be built, all of which are then stored in the preset retrieval index.
[0030] Long text question data can be natural language query information that requires answers from long text. The form of long text question data can be a question or request made by a user that includes long text content.
[0031] Hybrid retrieval processing can be a data processing procedure that combines retrieval modes such as semantic similarity and keyword matching. It can search for relevant information through vector retrieval and full-text retrieval, thereby finding the set of document fragments most relevant to the long text question data from a pre-defined retrieval index. Specifically, the input long text question data can be vectorized and keywords parsed, and vector similarity retrieval and full-text matching retrieval can be performed in parallel in the pre-defined retrieval index. The candidate results returned by the retrieval modes are then fused and sorted, and the top k most relevant fragments are selected, where k can be a pre-defined threshold for the number of selected results. This yields the retrieval result fragment dataset corresponding to the long text question data. The retrieval result fragment dataset can be a set of document fragments obtained through hybrid retrieval processing, that is, a series of text fragments most relevant to the question as determined by retrieval.
[0032] Furthermore, the dataset of retrieval result fragments can be represented as {pc1, pc2, ..., pc k Each fragment can be accompanied by its source identifier.
[0033] This application embodiment receives long text question data, inputs it into a preset retrieval index for hybrid retrieval processing, performs vectorization encoding and keyword parsing on the question data, executes vector similarity retrieval and full-text matching retrieval in parallel, and merges and sorts the candidate results returned by the two retrieval modes, selecting the top k most relevant document fragments to obtain a retrieval result fragment dataset. This enhances the comprehensiveness and accuracy of information retrieval; improves the response capability and efficiency to complex long text queries; and increases the relevance and coverage of retrieval results, thereby improving the recall and precision of the initial retrieval results.
[0034] For example, in a technical report on the preparation of a new material to answer questions about related technical principles, the report can be treated as a long document. By segmenting it into sentences and using a sliding window strategy for contextual expansion, a series of semantically coherent document fragments can be generated. These fragments can be encoded into vectors and indexed in a full-text database, and stored together in a pre-defined retrieval index. When a user submits a long text question about the mechanism of action of a specific synthesis step of the material, semantic similarity can be calculated using the vector database and keyword matching can be performed using a full-text search engine. The results can be retrieved in parallel from the pre-defined retrieval index, and the preliminary results returned by the two methods can be relevance-scored and merged to obtain a dataset of retrieval results fragments containing the k most relevant technical report fragments. These fragments may include relevant paragraphs characterizing reaction conditions, chemical equations, and material characterization results.
[0035] S202, global information extraction processing is performed on the fragment dataset of the search results to obtain global contextual data.
[0036] Specifically, global information extraction processing can be a data processing procedure that extracts and integrates information from the retrieval result fragment dataset to obtain information that goes beyond the scope of a single fragment and can be used to characterize the overall structure and background of the document. This allows discrete fragment information to be placed back into the complete document context, restoring the contextual relationships lost due to fragmentation. Specifically, each fragment in the retrieval result fragment dataset can be located and restored to its corresponding complete paragraph position in the original long document based on a preset mapping relationship. Then, the restored paragraphs with more complete context can be combined and input into the large language model. The large language model can be guided by a preset global information extraction prompt template to analyze and extract summary information about the document's theme, background, logical structure, and key contextual relationships from the paragraphs, thereby obtaining global contextual data.
[0037] This application embodiment maps and locates the retrieved result fragment dataset, restoring each fragment to its complete paragraph position in the original long document. The restored complete paragraphs are then combined and input into a large language model. Through a preset global information extraction prompt template, the large language model is guided to analyze the combined paragraphs, extract and summarize the global contextual data, thereby enhancing the relevance and coherence of discrete fragment information within the overall document context; improving the ability to extract core background and structural information from long documents; enhancing the contextual completeness and accuracy of subsequent understanding and responses; and improving the overall understanding of long documents.
[0038] S203 involves chain-validation of the long text question data, global context data, and retrieval result fragment dataset to obtain the semantic fragment dataset.
[0039] Specifically, chained verification processing can include chained thinking generation and filtering. In chained thinking generation, chained thinking data can be generated based on long text question data and global context data. This chained thinking data can be a series of intermediate thinking steps or logical clue texts generated by simulating the step-by-step reasoning process of humans. In this way, global context information can be transformed into a reasoning path for filtering specific micro-information. Specifically, long text question data and global context data can be input into a large language model. Through a preset chained thinking generation prompt template, the large language model can be guided to generate a description of the thought process for answering the question, thereby obtaining chained thinking data.
[0040] For example, in a long product manual document designed to answer user technical questions, after obtaining long textual question data about a certain device malfunction, an overview of the manual's overall architecture and principles, and other global contextual data, chain-thinking generation can generate a reasoning chain based on the above information, such as "First, according to the overall principles of the document, this phenomenon may be related to the power module or signal processing link; second, in the troubleshooting section, the power indicator status should be checked first, and then the voltage of key test points should be measured..."
[0041] In the filtering process, each retrieval result fragment in the retrieval result fragment dataset can be verified based on chained thinking data to filter out fragments that can support chained thinking data, thereby obtaining a semantic fragment dataset. This semantic fragment dataset can be a collection of text fragments that have been determined to contain key factual details and support the generation of the target answer after chained verification. This can serve as the core factual basis for generating the target answer, ensuring that the information input to the generation model has high relevance and high quality.
[0042] The specific verification process can use chain-thinking data as an evaluation standard to judge whether each search result fragment contains the key factual details required to answer the question. In other words, the reasoning path provided by the chain-thinking data can be used as a benchmark to measure whether each fragment is direct evidence or supporting material for a certain link or conclusion in the reasoning path, thereby achieving the screening from relevant to necessary.
[0043] This application embodiment inputs long text question data and global context data into a large language model, guides the generation of chain-thinking data through a preset chain-thinking generation prompt template, and verifies each fragment in the retrieval result fragment dataset based on this chain-thinking data, judging whether each fragment contains key factual details or direct evidence supporting the reasoning path, and filters out fragments that meet the conditions to obtain a semantic fragment dataset. This enhances the orientation and logical relevance of factual evidence filtering; improves the efficiency of locating core supporting information from batch retrieval results; and improves the information quality of the input generation model and the reliability of answer generation, thereby improving the quality and accuracy of information.
[0044] For example, in the aforementioned application scenario, the generated chain-thinking data can be compared and verified with each initially retrieved segment. For the reasoning step "the power indicator status should be checked first", the presence of content describing the criteria for judging the normal / abnormal state of the power indicator can be evaluated in the retrieved segment. For the reasoning step "measure the voltage of key test points", the presence of specific test point locations and voltage range values in the segment can be evaluated. Furthermore, the segments used to verify or refine a certain step in the chain-thinking can be filtered to form a semantic segment dataset.
[0045] S204 performs answer generation processing on the global context data and semantic fragment dataset to obtain the target answer text data corresponding to the long text question data.
[0046] Specifically, global contextual data can be generalized data obtained by mapping and extracting retrieved text fragments. This global contextual data can be used to characterize the overall background, structure, and contextual relationships of the original long document. It can also provide macro-level guidance and constraints for answer generation, ensuring that the generated answers conform to the overall narrative logic and thematic scope of the document.
[0047] A semantic fragment dataset can be a collection of text fragments that have been filtered through chained validation and contain the specific facts, details, or evidence necessary to answer long text questions. This semantic fragment dataset can be used to provide micro-level, accurate factual support for answer generation.
[0048] Answer generation can be achieved by fusing global contextual data with semantic fragment datasets and generating coherent natural language answers. This process integrates macro-level guidance with micro-level facts to synthesize target answer text data. Specifically, the global contextual data and semantic fragment datasets can be formatted and organized based on a preset answer generation prompt template to form a complete prompt containing instructions, questions, global background information, and specific supporting evidence. This complete prompt can then be input into a large language model, guiding it to prioritize understanding the context of the question through the global contextual data and responding by combining details from the semantic fragment dataset, thereby obtaining the target answer text data corresponding to the long text question data.
[0049] Among them, the large language model can adopt a large language model based on the Transformer architecture. This large language model can simultaneously pay attention to information from different parts of the prompt through its attention mechanism, so as to achieve the interaction and trade-off between global context and local details.
[0050] This application embodiment formats and organizes global contextual data and semantic fragment datasets, constructs a complete prompt based on a preset answer generation prompt template, and includes instructions, questions, global background information, and specific supporting evidence. This complete prompt is input into a large language model, which integrates the macro-level guidance provided by the global contextual data with the micro-level factual details provided by the semantic fragment dataset to generate a coherent natural language answer. This yields the target answer text data corresponding to the long text question data, thereby enhancing the balance and accuracy of answer generation between macro-level context and micro-level facts; improving the logical coherence and factual support of the generated answer; and enhancing the reliability and overall quality of solutions to complex long text questions, as well as improving the accuracy and completeness of the generated answer.
[0051] According to the technical solution provided in this disclosure, long text question data is subjected to hybrid retrieval processing based on a preset retrieval index library. Vector retrieval and full-text retrieval are used in parallel for searching and merging sorting to select the most relevant document fragments, resulting in a retrieval result fragment dataset. Global information extraction processing is performed on the retrieval result fragment dataset, restoring each fragment to the complete paragraph position of the original long document and combining them. Summary information is extracted using a large language model to obtain global contextual data. The long text question data, global contextual data, and retrieval result fragment dataset are then subjected to chain-like verification processing. First, chain-like thinking data is generated, and then this is used to verify and... By filtering segments to obtain a semantic segment dataset, answer generation processing is performed on the global context data and the semantic segment dataset. A large language model with formatted prompts is used to generate the target answer text data, thereby improving the response capability and efficiency for complex long text queries; enhancing the relevance and coverage of search results; strengthening the correlation and coherence of discrete segment information within the overall document context; improving the ability to extract core background and structural information from long documents; enhancing the contextual completeness and accuracy of subsequent understanding and answers; strengthening the orientation and logical relevance of factual evidence screening; improving the contextual accuracy of long text question answering; and reducing search noise.
[0052] In some embodiments, global information extraction processing is performed on the search result fragment dataset to obtain global contextual data, including: performing fragment mapping processing on the search result fragment dataset to obtain a search paragraph dataset; and performing context extraction processing on the search paragraph dataset to obtain global contextual data.
[0053] Specifically, fragment mapping processing can be a data processing process that locates and restores each text fragment in the retrieval result fragment dataset to the complete logical paragraph to which it belongs in the original long document, based on a preset mapping relationship. This can result in a retrieval paragraph dataset, which can be a collection of complete paragraphs of the original document corresponding to the retrieved text fragments. This retrieval paragraph dataset can be used to restore the text context that has been fragmented by the segmentation operation, providing a semantically coherent text foundation for subsequent extraction of the document's macro structure and background information.
[0054] Context extraction processing can be a data processing process that analyzes and extracts high-level semantic information from the retrieved paragraph dataset using a large language model. This can generate a high-level understanding framework about the document as a whole, i.e. global context data. Specifically, the retrieved paragraph dataset and long text question data can be input into the large language model. Through a preset global information extraction prompt template, the large language model is guided to output global background information related to the question answer, thereby obtaining global context data.
[0055] For example, in a lengthy technical report answering user questions about the derivation process of core conclusions, a hybrid retrieval process can be used to retrieve multiple text fragments mentioning key data and derivation steps from the segmented report text, forming a dataset of retrieval result fragments. Then, fragment mapping processing can be performed, mapping the retrieval result fragments back to their respective original chapters or complete paragraphs according to preset mapping relationships. For example, a fragment about "experimental data A" can be mapped to the entire paragraph of "Chapter 3 Experimental Results," and a fragment about "inference B" can be mapped to the entire paragraph of "Chapter 4 Discussion and Analysis," thus forming a dataset of retrieval paragraphs. The dataset of retrieval paragraphs and the long text question data can be input into a large language model, instructing the model to extract global background information related to understanding the derivation of the core conclusion. The large language model can then output global contextual data, which may include a general description such as, "This report aims to verify theory X; the experimental section designed scheme Y; and finally, conclusions were drawn by comparing data from group Z."
[0056] According to the technical solution provided in this disclosure, by performing segment mapping processing on the retrieval result segment dataset, each text segment is located and restored to its complete logical paragraph in the original long document based on a preset mapping relationship, thus obtaining a retrieval segment dataset. The retrieval segment dataset and long text question data are input into a large language model. Through a preset global information extraction prompt template, the model is guided to analyze and extract global background information related to the question answer, thus obtaining global context data. This enhances the ability to accurately restore the retrieval segment to the original context; improves the targeting and completeness of high-level semantic information extraction from the document; enhances the accuracy of global background understanding and question association; and improves the depth of understanding of complex questions and the accuracy of answer generation in long text question answering.
[0057] In some embodiments, chain-validation processing is performed on long text question data, global context data, and retrieval result fragment dataset to obtain semantic fragment dataset, including: performing reasoning generation processing on long text question data and global context data to obtain chain-thinking data; performing fragment evaluation processing on retrieval result fragment dataset based on chain-thinking data to obtain evaluation state data; and performing filtering processing on evaluation state data to obtain semantic fragment dataset.
[0058] Specifically, reasoning generation processing can be a data processing process that involves inputting long text question data and global context data into a large language model, guiding the large language model to generate a reasoning chain to answer the question, thereby obtaining chain thinking data. This chain thinking data can be a series of intermediate thinking steps or logical clue texts generated by simulating the step-by-step reasoning process of humans.
[0059] Fragment evaluation processing can be achieved by analyzing each text fragment in the retrieval result fragment dataset using chain-thinking data as the judgment criterion. The evaluation assesses whether the information contained in the text fragment supports or explains the data processing of the reasoning steps described by the chain-thinking data, thus obtaining evaluation status data. This evaluation status data can be a set of Boolean values, where each Boolean value corresponds to a retrieval result fragment and can be used to characterize whether the retrieval result fragment has passed the chain-thinking-based evaluation. Specifically, each fragment, along with the chain-thinking data and the long text question data, can be input into a large language model. The large language model is then instructed to determine whether the fragment information can explain the problem's thought process. If the fragment content can provide factual support or detailed supplementation for the key reasoning nodes in the chain-thinking, the evaluation status is true; if the fragment content is irrelevant to the reasoning logic or lacks sufficient information, the evaluation status is false.
[0060] The filtering process can be a data processing procedure that filters the retrieved fragment dataset based on the evaluation status data. This allows fragments with a true evaluation status to be retained, while fragments with a false evaluation status are filtered out, resulting in a semantic fragment dataset.
[0061] According to the technical solution provided in this disclosure, by inputting long text question data and global context data into a large language model, and through reasoning generation processing, guiding the model to generate chain-like thinking data, and performing segment evaluation processing on the retrieval result segment dataset based on the chain-like thinking data, each segment, together with the chain-like thinking data and the long text question data, is input into the large language model for evaluation to determine whether the segment information supports or explains the reasoning steps, thereby obtaining evaluation state data. The evaluation state data is then filtered, retaining segments with a true evaluation state and filtering out segments with a false evaluation state, thus obtaining a semantic segment dataset. This enhances the accuracy of logical matching between the reasoning process and factual evidence; improves the automation and objectivity of segment selection; enhances the efficiency and reliability of core supporting information extraction; improves the accuracy of selecting high-quality segments from the retrieval results; and removes irrelevant and noisy information.
[0062] In some embodiments, before performing mixed retrieval processing on long text question data based on a preset retrieval index library to obtain a retrieval result fragment dataset corresponding to the long text question data, the method further includes: segmenting the long text data to be processed to obtain an initial fragment dataset; performing context expansion processing on the initial fragment dataset to obtain a text fragment dataset; and performing index building processing on the text fragment dataset to obtain a preset retrieval index library.
[0063] Specifically, the long text data to be processed can be the original data of a long document or text collection that needs to be processed for question answering. This long text data can be used as the original material for the knowledge source of the question answering system and is the data foundation for all subsequent processing steps. The segmentation process can be based on sentence boundaries, fixed character length, or semantic units to cut the continuous long text into multiple smaller text units according to preset rules, thereby obtaining an initial fragment dataset. This initial fragment dataset can be a collection of multiple text fragments obtained after the initial segmentation. This initial fragment dataset can be used as the input for subsequent context expansion processing.
[0064] In addition, context expansion processing can be a data processing procedure that adds adjacent contextual content to text fragments to enhance their semantic integrity. Specifically, a sliding window strategy can be used to add the preceding sentence or a certain length of preceding text to each initial fragment data to ensure semantic coherence at the beginning of the fragment and avoid truncation of key information, thereby obtaining a text fragment dataset. This text fragment dataset can be a collection of text fragment data with each fragment containing expanded context and more complete semantics. This text fragment dataset can be used as material for building retrieval indexes.
[0065] In addition, index building can be a data processing process for establishing an efficient query data structure for text datasets. Specifically, it can include generating vectorized representations for each text fragment and storing them in a vector database to support semantic similarity retrieval, and building a full-text inverted index to support keyword matching retrieval, thereby forming an index structure that supports hybrid retrieval modes and obtaining a preset retrieval index library.
[0066] For example, a detailed research report on a certain technology can be used as long text data to be processed. The report can be segmented into multiple initial segments based on the natural chapters and sentence boundaries. Each segment can include several coherent paragraphs. Context expansion processing can be performed on each initial segment, using a sliding window method to add the content of the preceding paragraph to each segment, generating a more semantically coherent text segment dataset. Furthermore, this text segment dataset can be indexed. On the one hand, all segments can be converted into high-dimensional vectors using an embedding model and stored in a vector index library. On the other hand, a full-text index can be built on the text content of all segments to complete the construction of a pre-defined retrieval index library.
[0067] According to the technical solution provided in this disclosure, long text data to be processed is segmented and cut according to rules such as sentence boundaries or fixed lengths to obtain an initial fragment dataset. The initial fragment dataset is then subjected to context expansion processing, and a sliding window strategy is used to add preceding text content to each fragment, generating a semantically more complete text fragment dataset. The text fragment dataset is then indexed and constructed, and through vectorization representation and the establishment of a full-text inverted index, an index structure that simultaneously supports semantic similarity and keyword matching retrieval is formed, resulting in a preset retrieval index library. This enhances the coherence and completeness of the structured processing of the original long text data; improves the rationality and contextual dependence of fragment semantic boundaries; improves the precision and efficiency of subsequent mixed retrieval; and enhances the information quality of individual fragments.
[0068] In some embodiments, a hybrid retrieval process is performed on long text question data based on a preset retrieval index library to obtain a dataset of retrieval result fragments corresponding to the long text question data. This includes: vectorizing the long text question data to obtain long text question vectors; performing similarity retrieval processing on the long text question vectors to obtain a vector retrieval dataset; extracting keywords from the long text question data to obtain question keyword data; performing full-text retrieval processing on the question keyword data to obtain a full-text retrieval dataset; and merging and filtering the vector retrieval dataset and the full-text retrieval dataset to obtain a dataset of retrieval result fragments.
[0069] Specifically, vectorization can be a data processing procedure that converts text data into numerical vectors through an embedding model. This allows the semantic information of the text to be encoded into points in a high-dimensional space, facilitating mathematical similarity calculations and obtaining a long text question vector. This long text question vector can be used to represent the overall semantic information of the question.
[0070] Similarity retrieval processing can be a data processing procedure in a vector database that calculates the distance or similarity between vectors to find the vector that is closest to the query vector. In this way, candidate segments that are semantically related to the question can be located from a batch of document fragment vectors, resulting in a vector retrieval dataset. This vector retrieval dataset can be a set of document fragments that are semantically similar to the long text question vector and obtained from the vector index through similarity retrieval.
[0071] Keyword extraction processing can be a data processing procedure that identifies and extracts words or phrases from text that can represent its core content. This can capture the key entities and keywords of the question, provide input for literal matching-based retrieval, and obtain question keyword data. This question keyword data can be a set of words or phrases that can be used to represent the core content of the question, obtained from long text question data through keyword extraction processing. This keyword extraction processing can be implemented through statistical methods or methods based on pre-trained language models.
[0072] Full-text search processing can be a data processing procedure that searches for document fragments containing specific keywords or phrases in a text database based on an inverted index. This allows for the search of document fragments containing key information about the question from a literal matching perspective, resulting in a full-text search dataset. This full-text search dataset can be a collection of document fragments containing the question keywords obtained from a full-text index through full-text search.
[0073] Merging and filtering can be a data processing procedure that deduplicates, sorts, and filters candidate results from different retrieval channels. This can combine the advantages of semantic retrieval and literal retrieval to ensure the comprehensiveness and accuracy of the retrieval result fragment dataset. Specifically, the union of the vector retrieval dataset and the full-text retrieval dataset can be taken, and the fragments can be sorted and truncated according to their relevance scores to the question and their sources. The most relevant fragments are then retained to obtain the retrieval result fragment dataset.
[0074] For example, in academic literature question answering, when a user inputs a complex, long sentence about a theory in a specific field, the long text question data can be vectorized using an embedding model to generate a high-dimensional long text question vector. Keyword extraction algorithms can then be used to extract core terms from this long text question data as question keywords. Furthermore, two operations can be performed in parallel within a pre-defined search index: firstly, the long text question vector can be fed into a vector database, and by calculating cosine similarity, the top few semantically closest document fragments can be retrieved to form a vector retrieval dataset; secondly, the question keyword data can be fed into a full-text search engine to retrieve all document fragments containing these keywords, forming a full-text search dataset. The fragments from the vector retrieval dataset and the full-text search dataset can then be merged, and a weighted comprehensive ranking can be performed based on the fragments' ranking under both search methods. Completely duplicated fragments are removed, and the top k highest-ranking fragments are selected to form a search result fragment dataset.
[0075] According to the technical solution provided in this disclosure, long text question data is vectorized and converted into numerical vectors using an embedding model to obtain long text question vectors. Keyword extraction is then performed on this long text question data to identify and extract core words or phrases, resulting in question keyword data. Similarity retrieval and full-text retrieval are then performed. The long text question vectors are used to calculate similarity in a vector database to obtain a vector retrieval dataset. The question keyword data is then matched and retrieved in a full-text index to obtain a full-text retrieval dataset. Finally, the vector retrieval dataset and the full-text retrieval dataset are merged and filtered to form a retrieval result fragment dataset. This enhances the synergistic complementarity of semantic matching and literal matching during the retrieval process, improves the overall performance of the retrieval result set in terms of relevance and coverage, and enhances the accuracy and recall of information retrieval for complex questions.
[0076] In some embodiments, merging and filtering the vector retrieval dataset and the full-text retrieval dataset to obtain a retrieval result fragment dataset includes: merging and sorting the vector retrieval dataset and the full-text retrieval dataset to obtain a merged dataset; and performing confidence filtering on the merged dataset to obtain a retrieval result fragment dataset.
[0077] Specifically, the merge sorting process can be a data processing procedure that merges the vector retrieval dataset and the full-text retrieval dataset and re-sorts them based on preset rules. This can create a unified and ordered pool of candidate fragments. Specifically, the fragments in the two datasets can be merged and duplicate fragment identifiers can be removed. The merged dataset is obtained by comprehensively calculating based on the fragments' ranking position in their respective original datasets, retrieval scores, or the weight allocation of the two retrieval strategies. This merged dataset can be an intermediate set of candidate fragments formed after preliminary merging and sorting.
[0078] Confidence filtering is a data processing procedure that filters fragments in the merged dataset based on a set confidence evaluation standard, retaining high-confidence fragments. This removes noisy or low-relevance fragments from the merged candidate pool, ensuring the context quality of the input generator. Specifically, it can calculate the semantic similarity score of the fused fragment in vector retrieval, the keyword matching degree in full-text retrieval, and the comprehensive score of the overall ranking position after merging and sorting. A confidence threshold is set to retain fragments with scores higher than the preset threshold, or selection is made according to a preset upper limit for the number of returned fragments, resulting in a dataset of retrieval result fragments. This dataset of retrieval result fragments can be a set of context fragments determined by merging and filtering that can be used as input to the generator.
[0079] For example, in a long-text question-and-answer database of scientific papers, when a user enters a detailed question about a certain technical principle, a hybrid retrieval process can be used to perform vector retrieval and full-text retrieval separately. Vector retrieval, based on the semantics of the question, finds multiple chapter fragments discussing the relevant principles in a high-dimensional space to form a vector retrieval dataset, while full-text retrieval can locate the paragraphs containing specific technical terms mentioned in the question to form a full-text retrieval dataset. Then, a merge sorting process can be performed to combine the fragments from the two datasets, and the fragments are weighted and sorted according to their similarity score in vector retrieval and keyword hit frequency in full-text retrieval, generating a merged dataset sorted in descending order based on comprehensive relevance. The merged dataset can then undergo a confidence screening process, calculating a comprehensive confidence score for each fragment. This score can be combined with its ranking, the original scores from both retrieval sources, and the fragment's own coherence assessment. By setting a threshold, fragments that contain keywords but are semantically off-topic or semantically related but lack key terminology support can be filtered out, outputting a dataset of retrieval result fragments.
[0080] According to the technical solution provided in this disclosure, by merging and sorting the vector retrieval dataset and the full-text retrieval dataset, a unified and ordered merged dataset is generated through a combination of union, deduplication, and calculations based on retrieval scores, sorting positions, or weight allocation. The merged dataset is then subjected to confidence screening; by calculating a comprehensive score and setting a threshold or upper limit on the number of segments, low-confidence segments are filtered out, while high-confidence segments are retained, resulting in a retrieval result segment dataset. This enhances the rationality and synergy of the fusion and sorting of results from different retrieval sources; improves the overall quality and noise suppression capability of the candidate segment pool; enhances the relevance and confidence of the retrieval result set; eliminates noisy segments; and improves the purity and relevance of the contextual information of the input generator.
[0081] In some embodiments, answer generation processing is performed on global context data and semantic fragment datasets to obtain target answer text data corresponding to long text question data, including: performing knowledge fusion processing on global context data and semantic fragment datasets to obtain fused question data; and performing answer reasoning processing on fused question data to obtain target answer text data.
[0082] Specifically, knowledge fusion processing can be a data processing procedure that combines global contextual data with semantic fragment datasets. This integrates macro-level contextual information with micro-level factual details into a unified and comprehensive input context. This can be achieved by constructing specific prompt templates. These prompt templates can introduce global contextual data as background information and list semantic fragment datasets as core reference evidence, guiding subsequent generative models to reason based on the fused information to obtain fused question data. This fused question data can be a dataset that integrates global context and specific semantic fragments after knowledge fusion processing.
[0083] Answer reasoning processing can be a data processing process based on the analysis and synthesis of fused question data by a generative model to produce answer text. In this way, the logical clues provided by the global context and the specific facts provided by semantic fragments can be used to generate target answers that are highly relevant to the question. Specifically, the fused question data can be input into a large language model, and the large language model can be instructed to output only the answer to the question through a preset answer generation prompt template, thus obtaining the target answer text data.
[0084] For example, in scientific paper Q&A, when a user asks a specific question about the experimental conclusions of a long scientific paper, multiple relevant text fragments can be retrieved from the paper through hybrid retrieval processing. Then, through global information extraction processing, these text fragments are mapped back to complete paragraphs in the original text, extracting global contextual data about the paper's research background, overall experimental design, and chapter structure. Chained verification processing can then be used to filter the retrieved fragments to obtain a semantic fragment dataset containing key experimental parameters, results data, and comparative conclusions. In the answer generation stage, knowledge fusion processing can be performed to integrate the global contextual data describing the overall structure of the paper with the semantic fragment dataset recording specific experimental findings, forming fused question data containing a logical chain of "research background-method-specific results." Answer reasoning processing can then be performed, inputting the fused question data into a generative model. This model can accurately infer the specific content of the experimental conclusions and their significance in the entire paper based on the complete context, generating the target answer text data.
[0085] According to the technical solution provided in this disclosure, by performing knowledge fusion processing on global contextual data and semantic fragment datasets, and integrating macro-background information with micro-factual details through a preset prompt template, a complete fused question data is formed. The fused question data is then processed for answer reasoning and input into a large language model. The model analyzes and synthesizes the fused complete context to generate target answer text data, thereby enhancing the consistency between macro-guidance and micro-evidence in the generation process; improving the logical coherence and factual accuracy of answer generation; and enhancing the comprehensive reasoning ability and output quality of answering complex long text questions.
[0086] All of the above-mentioned optional technical solutions can be combined in any way to form optional embodiments of this disclosure, and will not be described in detail here.
[0087] Figure 3 This is a flowchart illustrating another long-text question-answering method provided in this embodiment of the disclosure. Figure 3 As shown, this long text question-answering method includes: Hybrid Retriever: Retrieves the most relevant chunks from long documents (long text data to be processed).
[0088] Input: Given a problem q.
[0089] Output: Retrieve the top k most relevant fragments {pc1, pc2, ..., pc3} that are most relevant to the problem. k}
[0090] Long document segmentation: Divide a long document into fixed-length segments, with sentences as the smallest unit.
[0091] Sliding window strategy: Expand the context by adding overlapping content from the previous sentence to avoid semantic breaks at the truncation point.
[0092] Efficient retrieval: Utilizing a combination of vectors and full-text data from the vector library, efficient retrieval operations are achieved.
[0093] Large Language Model (LLM) Enhanced Information Extractor: Extracts global information from retrieved fragments (retrieval result fragment dataset) to restore the contextual information of the fragments.
[0094] Input: The retrieved fragments {pc1, pc2, ..., pc k}
[0095] Output: Global Information I(g): Overall understanding of long documents, including background, structure, and contextual information (global contextual data).
[0096] Fragment mapping: Maps retrieved fragments back to their original long document paragraphs.
[0097] Global Information Extraction: The mapped paragraphs are input into the LLM (Large Model LLM) to extract global background and structural information. For example: {supporting paragraphs} Based on the above background, please provide the original information needed to answer the following questions. Please ensure that the information cited is detailed and comprehensive.
[0098] Question: {question} Output only the original information of the required references: {global information} Chain-guided filter (CoT-guided Filter): Filters out segments containing factual details and removes irrelevant information.
[0099] Input: The retrieved fragments {pc1, pc2, ..., pc k} and global information I(g).
[0100] Output: A set of fragments I(d) containing factual details.
[0101] Chain of Thought Generation: Generates chain of thought (CoT) from a global perspective, providing global clues for answering questions.
[0102] For example, {supporting paragraphs} Given question: {question} The answer is: {answer} Your task is to provide your thought process for the given problem based on the information above. Only provide your thought process; do not provide any other information.
[0103] Thought process: {CoT} Filtering phase: Using CoT as a global clue, each fragment is evaluated to determine whether it contains the key information needed to answer the question.
[0104]
[0105] For example, the question: {question} Problem-solving approach: {CoT} Answer: {answer} Please evaluate whether your approach to this problem can explain the answer to the question. If it can explain the answer, set the value of status to True. If it cannot explain the answer, set the value of status to False.
[0106] Your output should be in the following JSON format: status: {value of status} LLM-augmented Generator: Combines global information and factual details to generate the final answer.
[0107] Input: Global information I(g) and a set of fragments I(d) containing factual details.
[0108] Output: Final answer α.
[0109] Knowledge interaction: Combines global information and factual details, inputs them into the LLM to generate answers.
[0110] Answer generation: Guides the LLM to generate accurate answers using prompt templates.
[0111] For example, the instruction: {content} Based on the information above, only the answer will be provided, without any other text.
[0112] Question: {question} Output: {answer} According to the technical solution provided in this disclosure, the retrieved fragments are mapped back to paragraphs of the original document through a mapping function, restoring the context information of the fragments and solving the problem of incomplete context information in the prior art; the chain-thinking guided filter selects fragments containing factual details, improving the retrieval quality and solving the problem of low retrieval quality in the prior art; and the combination of global information and factual details generates accurate answers, solving the problem of inaccurate answer generation in the prior art.
[0113] The following are embodiments of the apparatus disclosed herein, which can be used to execute embodiments of the method disclosed herein. For details not disclosed in the apparatus embodiments of this disclosure, please refer to the embodiments of the method disclosed herein.
[0114] Figure 4 This is a schematic diagram of a long text question-and-answer device provided in an embodiment of this disclosure. Figure 4 As shown, the long text question-and-answer device includes: The first processing module 401 is used to perform mixed retrieval processing on long text question data based on a preset retrieval index library to obtain a dataset of retrieval result fragments corresponding to the long text question data. The second processing module 402 is used to perform global information extraction processing on the retrieval result fragment dataset to obtain global context data; The third processing module 403 is used to perform chain-validation processing on the long text question data, global context data and retrieval result fragment dataset to obtain semantic fragment dataset; The fourth processing module 404 is used to perform answer generation processing on the global context data and semantic fragment dataset to obtain the target answer text data corresponding to the long text question data.
[0115] According to the technical solution provided in this disclosure, long text question data is subjected to hybrid retrieval processing based on a preset retrieval index library. Vector retrieval and full-text retrieval are used in parallel for searching and merging sorting to select the most relevant document fragments, resulting in a retrieval result fragment dataset. Global information extraction processing is performed on the retrieval result fragment dataset, restoring each fragment to the complete paragraph position of the original long document and combining them. Summary information is extracted using a large language model to obtain global contextual data. The long text question data, global contextual data, and retrieval result fragment dataset are then subjected to chain-like verification processing. First, chain-like thinking data is generated, and then this is used to verify and... By filtering segments to obtain a semantic segment dataset, answer generation processing is performed on the global context data and the semantic segment dataset. A large language model with formatted prompts is used to generate the target answer text data, thereby improving the response capability and efficiency for complex long text queries; enhancing the relevance and coverage of search results; strengthening the correlation and coherence of discrete segment information within the overall document context; improving the ability to extract core background and structural information from long documents; enhancing the contextual completeness and accuracy of subsequent understanding and answers; strengthening the orientation and logical relevance of factual evidence screening; improving the contextual accuracy of long text question answering; and reducing search noise.
[0116] In some embodiments, the second processing module 402 is specifically used to perform segment mapping processing on the search result segment dataset to obtain a search paragraph dataset; and to perform context extraction processing on the search paragraph dataset to obtain global context data.
[0117] In some embodiments, the third processing module 403 is specifically used to perform reasoning generation processing on long text question data and global context data to obtain chain thinking data; perform fragment evaluation processing on the retrieval result fragment dataset based on the chain thinking data to obtain evaluation state data; and perform filtering processing on the evaluation state data to obtain a semantic fragment dataset.
[0118] In some embodiments, the long text question-answering device described above is further configured to: segment the long text data to be processed to obtain an initial fragment dataset; perform context expansion processing on the initial fragment dataset to obtain a text fragment dataset; and perform index building processing on the text fragment dataset to obtain a preset retrieval index library.
[0119] In some embodiments, the first processing module 401 is specifically used to: vectorize the long text question data to obtain long text question vectors; perform similarity retrieval processing on the long text question vectors to obtain a vector retrieval dataset; extract keywords from the long text question data to obtain question keyword data; perform full-text retrieval processing on the question keyword data to obtain a full-text retrieval dataset; and merge and filter the vector retrieval dataset and the full-text retrieval dataset to obtain a retrieval result fragment dataset.
[0120] In some embodiments, merging and filtering the vector retrieval dataset and the full-text retrieval dataset to obtain a retrieval result fragment dataset specifically involves merging and sorting the vector retrieval dataset and the full-text retrieval dataset to obtain a merged dataset; and then performing confidence filtering on the merged dataset to obtain a retrieval result fragment dataset.
[0121] In some embodiments, the fourth processing module 404 is specifically used to perform knowledge fusion processing on the global context data and the semantic fragment dataset to obtain fused question data; and to perform answer reasoning processing on the fused question data to obtain target answer text data.
[0122] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this disclosure.
[0123] Figure 5 This is a schematic diagram of the electronic device 5 provided in an embodiment of this disclosure. Figure 5 As shown, the electronic device 5 of this embodiment includes: a processor 501, a memory 502, and a computer program 503 stored in the memory 502 and executable on the processor 501. When the processor 501 executes the computer program 503, it implements the steps in the various method embodiments described above. Alternatively, when the processor 501 executes the computer program 503, it implements the functions of each module / unit in the various device embodiments described above.
[0124] Electronic device 5 can be a desktop computer, laptop, handheld computer, cloud server, or other electronic device. Electronic device 5 may include, but is not limited to, processor 501 and memory 502. Those skilled in the art will understand that... Figure 5 This is merely an example of electronic device 5 and does not constitute a limitation on electronic device 5. It may include more or fewer components than shown, or different components.
[0125] The processor 501 can be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
[0126] The memory 502 can be an internal storage unit of the electronic device 5, such as a hard disk or RAM of the electronic device 5. The memory 502 can also be an external storage device of the electronic device 5, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, Flash Card, etc., equipped on the electronic device 5. The memory 502 can also include both internal and external storage units of the electronic device 5. The memory 502 is used to store computer programs and other programs and data required by the electronic device.
[0127] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0128] If integrated modules / units are implemented as software functional units and sold or used as independent products, they can be stored in a readable storage medium (e.g., a computer-readable storage medium). Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program may include computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc.
[0129] The above embodiments are only used to illustrate the technical solutions of this disclosure, and are not intended to limit it. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure, and should all be included within the protection scope of this disclosure.
Claims
1. A long text question-answering method, characterized in that, include: Based on a preset retrieval index, a mixed retrieval process is performed on the long text question data to obtain a dataset of retrieval result fragments corresponding to the long text question data; Global information extraction processing is performed on the dataset of the search results fragments to obtain global contextual data; The long text question data, the global context data, and the search result fragment dataset are subjected to chain validation to obtain a semantic fragment dataset. The global context data and the semantic fragment dataset are processed to generate answers, resulting in target answer text data corresponding to the long text question data.
2. The long text question-answering method according to claim 1, characterized in that, The global information extraction processing of the retrieved result fragment dataset to obtain global contextual data includes: The search result fragment dataset is subjected to fragment mapping processing to obtain the search paragraph dataset; The retrieved paragraph dataset is subjected to context extraction processing to obtain the global context data.
3. The long text question-answering method according to claim 1, characterized in that, The step involves chaining the long text question data, the global context data, and the retrieval result fragment dataset to obtain a semantic fragment dataset, including: The long text question data and the global context data are subjected to reasoning and generation processing to obtain chain-like thinking data; Based on the chain-thinking data, the retrieved result fragment dataset is subjected to fragment evaluation processing to obtain evaluation status data; The evaluation state data is filtered to obtain the semantic fragment dataset.
4. The long text question-answering method according to claim 1, characterized in that, Before performing mixed retrieval processing on the long text question data based on a preset retrieval index library to obtain the retrieval result fragment dataset corresponding to the long text question data, the method further includes: The long text data to be processed is segmented to obtain an initial fragment dataset; The initial fragment dataset is then subjected to context expansion processing to obtain a text fragment dataset; The text fragment dataset is indexed to obtain the preset retrieval index library.
5. The long text question-answering method according to claim 1, characterized in that, The method involves performing a mixed retrieval process on long text question data based on a preset retrieval index library to obtain a dataset of retrieval result fragments corresponding to the long text question data, including: The long text question data is vectorized to obtain the long text question vector; The long text question vectors are subjected to similarity retrieval processing to obtain a vector retrieval dataset; The long text question data is processed by keyword extraction to obtain question keyword data; Full-text search processing is performed on the keyword data of the problem to obtain a full-text search dataset; The vector retrieval dataset and the full-text retrieval dataset are merged and filtered to obtain the retrieval result fragment dataset.
6. The long text question-answering method according to claim 5, characterized in that, The process of merging and filtering the vector retrieval dataset and the full-text retrieval dataset to obtain the retrieval result fragment dataset includes: The vector retrieval dataset and the full-text retrieval dataset are merged and sorted to obtain a merged dataset. The merged dataset is subjected to confidence filtering to obtain the retrieval result fragment dataset.
7. The long text question-answering method according to claim 1, characterized in that, The step of performing answer generation processing on the global context data and the semantic fragment dataset to obtain the target answer text data corresponding to the long text question data includes: The global context data and the semantic fragment dataset are subjected to knowledge fusion processing to obtain fused problem data; The fused question data is processed by answer reasoning to obtain the target answer text data.
8. A long text question-and-answer device, characterized in that, include: The first processing module is used to perform mixed retrieval processing on long text question data based on a preset retrieval index library to obtain a dataset of retrieval result fragments corresponding to the long text question data. The second processing module is used to perform global information extraction processing on the retrieved result fragment dataset to obtain global contextual data. The third processing module is used to perform chain-validation processing on the long text question data, the global context data, and the search result fragment dataset to obtain a semantic fragment dataset. The fourth processing module is used to perform answer generation processing on the global context data and the semantic fragment dataset to obtain the target answer text data corresponding to the long text question data.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 7.
10. A readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 7.