An efficient and universal large-scale long text enhanced retrieval method, system and product

By employing a two-stage retrieval strategy, long texts are converted into documents and query vectors using a pre-trained language model. Combining coarse-grained and fine-grained matching, the problem of low retrieval efficiency and low accuracy in large-scale long document knowledge bases is solved, achieving efficient and universal long text augmented retrieval.

CN120873104BActive Publication Date: 2026-06-19NANKAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANKAI UNIV
Filing Date
2025-06-10
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing retrieval methods suffer from low retrieval efficiency, low accuracy, and poor versatility when dealing with large-scale long document knowledge bases. In particular, they have high computational overhead and insufficient applicability in ultra-large-scale, long document knowledge bases.

Method used

A two-stage retrieval strategy is adopted. First, the long text is converted into document vectors and query vectors through a first pre-trained language model for coarse-grained screening. Then, the candidate documents are segmented into paragraphs, and keyword vectors are extracted using a second pre-trained language model for fine-grained matching to generate interpretable search results.

Benefits of technology

It significantly reduces computational overhead, improves retrieval efficiency and accuracy, enhances the versatility and flexibility of knowledge bases across different fields, and is suitable for efficient and accurate retrieval of large-scale knowledge bases.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120873104B_ABST
    Figure CN120873104B_ABST
Patent Text Reader

Abstract

This invention relates to the field of natural language processing technology, providing an efficient and universal method, system, and product for large-scale long text augmented retrieval. The method includes: using a first pre-trained language model to convert large-scale long text into document vectors and encoding queries into query vectors; calculating the similarity between document vectors and query vectors, selecting the top K1 documents as candidate documents; extracting paragraph keywords and query keywords, using a second pre-trained language model to convert paragraph keywords into paragraph keyword vectors and query keywords into query keyword vectors; calculating the similarity between paragraph keyword vectors and query keyword vectors, selecting the top K2 paragraphs as the final retrieval results; and generating an interpretable search result report. This invention significantly reduces computational overhead, improves retrieval efficiency, enhances the universality of knowledge bases across different domains, and greatly improves flexibility and scalability in practical applications.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and in particular to an efficient and universal method, system and product for large-scale long text enhanced retrieval. Background Technology

[0002] Efficient and accurate information retrieval in large-scale knowledge bases is an important research direction in the field of natural language processing, such as patent documents, laws and regulations, and medical literature. Mainstream retrieval methods include sparse retrieval such as BM25, dense retrieval such as Dual-Encoder, and multi-vector retrieval such as ColBERT and XTR. Although these methods have achieved some success in different application scenarios, they still have significant shortcomings when processing ultra-large-scale, long-document knowledge bases, specifically in terms of retrieval efficiency, accuracy, and applicability.

[0003] Sparse retrieval methods, based on term matching and statistical weights, rank query terms by frequency of occurrence in documents. While suitable for short texts, they can lead to excessive noise in very long documents, reducing retrieval accuracy. Dense retrieval methods use deep learning models to map queries and documents into high-dimensional vector spaces and perform retrieval based on vector similarity. While these methods improve semantic matching, they are computationally expensive, cannot handle localized information in long documents, and require large amounts of data for training. Multi-vector retrieval methods preserve word-level semantic information, but suffer from high computational complexity, non-linear scoring mechanisms that increase computational overhead, and poor adaptability to different tasks: some methods perform well on specific tasks, but their performance degrades in long document retrieval and cross-domain retrieval scenarios.

[0004] Whether it's sparse retrieval, dense retrieval, or multi-vector retrieval, when dealing with ultra-large-scale, long document knowledge bases, they all face problems such as low retrieval efficiency, inconsistent accuracy, the need for additional fine-tuning of pre-trained models, increased computational overhead, and limitations on the generality and scalability of the methods. Summary of the Invention

[0005] This invention aims to address at least one of the technical problems existing in related technologies. To this end, this invention provides a highly efficient and universal large-scale long-text enhanced retrieval method, system, and product, solving the problems of low retrieval efficiency, low accuracy, and poor versatility of existing retrieval methods when processing large-scale long-document knowledge bases. Through a two-stage retrieval strategy, it significantly reduces computational overhead and improves retrieval efficiency while ensuring retrieval accuracy, and enhances the versatility across different domain knowledge bases, greatly improving flexibility and scalability in practical applications.

[0006] This invention provides an efficient and universal method for enhancing retrieval of large-scale long texts, comprising:

[0007] S1: Use the first pre-trained language model to convert large-scale long texts into document vectors; use the first pre-trained language model to encode queries into query vectors;

[0008] S2: Calculate the similarity between the document vector and the query vector, and select the top K1 documents as candidate documents based on the similarity.

[0009] S3: Segment the candidate document into paragraphs, extract paragraph keywords, and use the second pre-trained language model to convert the paragraph keywords into paragraph keyword vectors;

[0010] S4: Extract query keywords and use the second pre-trained language model to convert the query keywords into query keyword vectors;

[0011] S5: Calculate the similarity between the paragraph keyword vector and the query keyword vector, and select the top K2 paragraphs as the final search results based on the similarity.

[0012] S6: Associate the final search results with large-scale long texts to generate an interpretable search results report.

[0013] Furthermore, step S1 includes:

[0014] S11: Use the sliding window method to split large-scale long texts into multiple document paragraphs;

[0015] S12: Use the first pre-trained language model to convert document paragraphs into embedding vectors;

[0016] S13: Hierarchical clustering algorithm is used to cluster the embedding vectors to obtain document vectors;

[0017] S14: Encode the query into a query vector using the first pre-trained language model.

[0018] Furthermore, the first pre-trained language model includes BERT and / or RoBERTa.

[0019] Furthermore, in step S2, the similarity between the document vector and the query vector is calculated using Euclidean distance or cosine similarity.

[0020] Furthermore, step S2 includes:

[0021] S21: Calculate the similarity between the query vector and the document vector, and select the top K1 documents as the initial candidate document set based on the similarity.

[0022] S22: Remove duplicates from the initial candidate document set;

[0023] S23: Use regular expressions to segment the initial candidate document set after deduplication into sentences; and merge short sentences by setting the sliding window size and step size to obtain the candidate document set;

[0024] S24: Set an ID for each candidate document in the candidate document set.

[0025] Furthermore, step S3 includes:

[0026] S31: Match candidate documents in the candidate document set based on their IDs;

[0027] S32: Using multi-process vectors, candidate documents in the candidate document set are segmented into paragraphs based on punctuation marks;

[0028] S33: Extract keywords for each paragraph, remove stop words and high-frequency words, and use the second pre-trained language model to generate paragraph keyword vectors.

[0029] Furthermore, in step S4, keyword extraction techniques are used to extract query keywords, including one or more of TF-IDF, TextRank, and BERT-NER.

[0030] Furthermore, the second pre-trained language model includes one or more of the following: dual-tower encoder, DPR, and ColBERT.

[0031] This invention also provides a highly efficient and universal large-scale long text augmented retrieval system for performing any of the above-described highly efficient and universal large-scale long text augmented retrieval methods, comprising:

[0032] The first vector conversion module uses a first pre-trained language model to convert large-scale long text into document vectors; and uses the first pre-trained language model to encode queries into query vectors.

[0033] The first similarity calculation module calculates the similarity between the document vector and the query vector, and selects the top K1 documents as candidate documents based on the similarity.

[0034] The second vector conversion module segments the candidate document into paragraphs, extracts paragraph keywords, and uses a second pre-trained language model to convert the paragraph keywords into paragraph keyword vectors.

[0035] The third vector conversion module extracts query keywords and uses the second pre-trained language model to convert the query keywords into query keyword vectors.

[0036] The second similarity calculation module calculates the similarity between the paragraph keyword vector and the query keyword vector, and selects the top K2 paragraphs as the final search results based on the similarity.

[0037] The results generation module associates the final search results with large-scale long texts to generate an interpretable search results report.

[0038] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the efficient and general-purpose large-scale long text enhanced retrieval method described in any of the preceding claims.

[0039] The above-described one or more technical solutions in the embodiments of the present invention have at least one of the following technical effects:

[0040] This invention addresses the problems of low retrieval efficiency, low accuracy, and poor versatility in existing retrieval methods when processing large-scale long document knowledge bases. Through a two-stage retrieval strategy, it significantly reduces computational overhead and improves retrieval efficiency while ensuring retrieval accuracy, and enhances the versatility of knowledge bases in different fields, greatly improving the flexibility and scalability in practical applications.

[0041] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0042] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0043] Figure 1 This is a flowchart illustrating an efficient and universal large-scale long text enhancement retrieval method provided by the present invention.

[0044] Figure 2 This is a schematic diagram of the structure of a highly efficient and universal large-scale long text enhanced retrieval system provided by the present invention.

[0045] Figure 3 This is a schematic diagram of the coarse-grained retrieval process of a highly efficient and universal large-scale long text enhanced retrieval method provided by the present invention.

[0046] Figure 4 This is a schematic diagram of the fine-grained retrieval process of a highly efficient and universal large-scale long text enhanced retrieval method provided by the present invention.

[0047] Figure label:

[0048] 101. First vector conversion module; 102. First similarity calculation module; 103. Second vector conversion module; 104. Third vector conversion module; 105. Second similarity calculation module; 106. Result generation module. Detailed Implementation

[0049] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention. The following embodiments are used to illustrate this invention but cannot be used to limit the scope of this invention.

[0050] In the description of the embodiments of the present invention, it should be noted that the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance. In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0051] The following is combined Figures 1 to 4 This invention describes an efficient and universal large-scale long text enhanced retrieval method, system, and product.

[0052] like Figure 1 As shown, an efficient and general-purpose large-scale long text augmentation retrieval method includes:

[0053] S1: Use the first pre-trained language model to convert large-scale long texts into document vectors; use the first pre-trained language model to encode queries into query vectors;

[0054] S11: Use the sliding window method to split large-scale long texts into multiple document paragraphs;

[0055] For each long document, a sliding window method is used to split it into multiple paragraphs, with the length of each paragraph not exceeding a set threshold t.

[0056] Specifically, the document is split as follows:

[0057] Set the sliding window size to w and the window step size to s;

[0058] Each time the window slides 's' words, a new paragraph is formed, until the entire content of the document is broken down into multiple smaller paragraphs.

[0059] The maximum length of each paragraph is t, and semantic integrity between paragraphs is ensured.

[0060] Each paragraph will be processed as a separate text unit to ensure that each paragraph represents the semantic information of its original document.

[0061] By using a sliding window method, we can ensure that the content of the split document effectively reflects the overall information of the original document.

[0062] In some specific embodiments of the present invention, a document title is added at the beginning of each paragraph as contextual information to preserve the core semantics of the document, and [CLS] (start) and [SEP] (end) markers are inserted at the beginning and end of each paragraph.

[0063] S12: Use the first pre-trained language model to convert document paragraphs into embedding vectors;

[0064] To convert each document paragraph into a computer-processable form, a first pre-trained language model is used to convert each paragraph in the document into a vector representation, calculated as follows:

[0065]

[0066] in, For the first The embedding vector of each paragraph, For the first The original text of each paragraph, This is the first pre-trained language model.

[0067] S13: Hierarchical clustering algorithm is used to cluster the embedding vectors to obtain document vectors;

[0068] To reduce computational redundancy, a hierarchical clustering algorithm is used to cluster all generated paragraph embedding vectors. This process merges multiple similar paragraphs into a single cluster center, thereby reducing the computational load in subsequent retrieval stages.

[0069] After clustering, each document will be represented as a set of vectors. ,in, For document vectors, For the first Each cluster center represents a semantic representation of multiple similar paragraphs.

[0070] S14: Encode the query into a query vector using the first pre-trained language model.

[0071] The first pre-trained language models include BERT and / or RoBERTa.

[0072] S2: Calculate the similarity between the document vector and the query vector, and select the top K1 documents as candidate documents based on the similarity.

[0073] The similarity between document vectors and query vectors can be calculated using Euclidean distance or cosine similarity.

[0074] S21: Calculate the similarity between the query vector and the document vector, and select the top K1 documents as the initial candidate document set based on the similarity.

[0075] S22: Remove duplicates from the initial candidate document set;

[0076] S23: Use regular expressions to segment the initial candidate document set after deduplication into sentences; and merge short sentences by setting the sliding window size and step size to obtain the candidate document set;

[0077] In some specific embodiments of the present invention, the size of the candidate document set is controlled by changing the sliding window size and the window step size.

[0078] S24: Set an ID for each candidate document in the candidate document set.

[0079] IDs are used to establish a traceable indexing system.

[0080] The above process completes the coarse-grained retrieval stage.

[0081] S3: Segment the candidate document into paragraphs, extract paragraph keywords, and use the second pre-trained language model to convert the paragraph keywords into paragraph keyword vectors;

[0082] S31: Match candidate documents in the candidate document set based on their IDs;

[0083] Extract candidate document IDs, and quickly match and locate the corresponding original documents based on the candidate document IDs by traversing the document directory.

[0084] S32: Using multi-process vectors, candidate documents in the candidate document set are segmented into paragraphs based on punctuation marks;

[0085] Each paragraph serves as a basic semantic unit, participating in subsequent vectorization processing and similarity matching.

[0086] S33: Extract keywords for each paragraph, remove stop words and high-frequency words, and use the second pre-trained language model to generate paragraph keyword vectors.

[0087] The second pre-trained language model includes one or more of the following: dual-tower encoder, DPR, and ColBERT.

[0088] S4: Extract query keywords and use the second pre-trained language model to convert the query keywords into query keyword vectors;

[0089] The query statement is first segmented into words to remove punctuation, stop words, and high-frequency words, and then keyword extraction technology is used to search for keywords.

[0090] Keyword extraction techniques include one or more of TF-IDF, TextRank, and BERT-NER.

[0091] S5: Based on the query keyword vector, calculate the similarity between the paragraph keyword vector and the query keyword vector, and select the top K2 paragraphs as the final search results based on the similarity.

[0092] The similarity between the paragraph keyword vector and the query keyword vector is calculated using a scoring function. The calculation expression is as follows:

[0093]

[0094] in, For the scoring function, For query, For the candidate document set, This represents the initial number of paragraphs. This represents the number of vectors after hierarchical clustering. Total number of paragraphs

[0095] For the query of the first The nth element and the candidate set of the nth element The correlation weights between elements For query The Middle The transpose of a vector representation of n elements For candidate document set The Middle A vector representation of n elements, For query The Middle The relevance of each element to the candidate paragraph.

[0096] Record the best scores for matching different terms in the same paragraph, and select the top K paragraphs from the candidate documents as the final search results based on similarity.

[0097] S6: Associate the final search results with large-scale long texts to generate an interpretable search results report.

[0098] This invention provides an efficient and accurate retrieval method, known as the EAMA-R retrieval method. This method is applicable to large-scale knowledge bases and employs a two-stage retrieval strategy—coarse-grained retrieval and fine-grained retrieval—to significantly reduce computational overhead and improve retrieval efficiency while maintaining retrieval accuracy.

[0099] Coarse-grained retrieval stage: Through segmented coding enhanced with document titles and hierarchical clustering, the million-level document database is compressed into a hundred-level candidate set;

[0100] Fine-grained retrieval stage: Keyword vector construction and dynamic weighted scoring mechanism are used to achieve paragraph-level semantic matching;

[0101] The proposed method adapts to different domain knowledge bases by embedding a model-compatible interface under the condition of no model fine-tuning.

[0102] The EAMA-R retrieval method first quickly filters out candidate documents that may contain the target information, then further refines the matching of relevant paragraphs within the candidate documents, and finally returns the retrieval results. The entire process requires no additional pre-training or fine-tuning and can be directly applied to knowledge bases in different domains.

[0103] In some specific embodiments of the present invention, raw data is obtained, and the text and queries are structured and organized to obtain a JSON file. The raw data can come from any field and is not limited in size, length, or language.

[0104] like Figure 3 As shown, the query and document are each divided into several short text slices. The sliding window method is used to split the document into multiple paragraphs to ensure that no information is missed.

[0105] Add a title to each paragraph and insert [CLS] (start) and [SEP] (end) tags at the beginning and end of each paragraph for context modeling during subsequent text embedding. This information will be uniformly formatted into the input fragment.

[0106] The above text paragraphs are encoded using a pre-trained language model (such as BERT or RoBERTa) to obtain corresponding embedding vectors. Each paragraph is mapped to a vector representation.

[0107] Hierarchical clustering is used to cluster and compress all paragraph embedding vectors within a document, merging multiple semantically similar paragraph vectors into cluster centers, thereby reducing the computational complexity of subsequent retrieval and reducing redundancy.

[0108] When a query is input, it is also converted into a vector representation through an encoding model, and then matched with the cluster center vectors of all candidate documents. The similarity between the document vector and the query vector is calculated, and the top K documents are selected as candidate documents based on the similarity. Several candidate documents are then filtered out by deduplication and used as input for subsequent fine ranking or generation.

[0109] like Figure 3 The blue highlighted sub-blocks represent the cluster centers most relevant to the query. The system outputs the documents to which these sub-blocks belong as candidate documents.

[0110] like Figure 4 As shown, this method is used to identify the fine-grained semantic fragments most relevant to the query from candidate documents after coarse-grained filtering, in order to achieve more accurate information retrieval.

[0111] Each document in the candidate document set is divided into multiple smaller passages, typically several sentences or paragraphs, to improve the accuracy and flexibility of subsequent searches.

[0112] Based on the query text, keyword extraction techniques such as TF-IDF, TextRank, and BERT-NER are used to generate paragraph keywords. These paragraph keywords are used to represent the core semantics of the query.

[0113] Keywords and all document paragraphs are input into a second pre-trained language model for encoding, generating paragraph keyword vectors and query keyword vectors. The second pre-trained model includes one or more architectures such as dual-tower encoder, DPR, or ColBERT.

[0114] Based on the query keyword vector, the similarity between the query keyword vector and the keyword vectors of all paragraphs is calculated, and all paragraphs are ranked using a vector space retrieval method. Paragraphs from each candidate document p1, p2 to pn are all included in the matching process.

[0115] The paragraphs with the highest similarity ranking are selected as the final search results (ranked passages). For example... Figure 4 As shown, green indicates paragraphs with high matching degree. These paragraphs will be used as input for subsequent reading comprehension or generative models to improve the accuracy and relevance of answer generation.

[0116] Fine-grained retrieval mechanisms not only preserve semantic information but also have the ability to locate key information fragments in long documents, making them particularly suitable for scenarios requiring precise evidence location, such as open-domain question answering and patent examination.

[0117] like Figure 2 As shown, an efficient and general-purpose large-scale long text augmented retrieval system is used to execute an efficient and general-purpose large-scale long text augmented retrieval method, including:

[0118] The first vector conversion module 101 uses a first pre-trained language model to convert large-scale long text into document vectors; it also uses the first pre-trained language model to encode queries into query vectors.

[0119] The first similarity calculation module 102 calculates the similarity between the document vector and the query vector, and selects the top K documents as candidate documents based on the similarity.

[0120] The second vector conversion module 103 segments the candidate document into paragraphs, extracts paragraph keywords, and uses the second pre-trained language model to convert the paragraph keywords into paragraph keyword vectors.

[0121] The third vector conversion module 104 extracts query keywords and uses the second pre-trained language model to convert the query keywords into query keyword vectors.

[0122] The second similarity calculation module 105 calculates the similarity between the paragraph keyword vector and the query keyword vector, and selects the top K paragraphs as the final search results based on the similarity.

[0123] The results generation module 106 associates the final search results with large-scale long texts to generate an interpretable search results report.

[0124] Through the collaborative work of the aforementioned modules, the retrieval process is gradually refined from coarse-grained to fine-grained, significantly improving retrieval efficiency. In the coarse-grained stage, EAMA-R uses a single vector to represent the query and document, quickly filtering the most relevant candidate documents from a massive amount of data. The fine-grained stage further utilizes keyword vectors for detailed document matching, ensuring highly relevant retrieved information. Experimental results show that compared to existing retrieval methods, such as sparse retrieval like BM25, dense retrieval like Dual-Encoder, and multi-vector retrieval like ColBERT and XTR, EAMA-R improves retrieval overhead by 73.778%, significantly reducing computational resource consumption while maintaining retrieval accuracy, making it particularly suitable for retrieval tasks on ultra-large-scale knowledge bases. In the fine-grained retrieval stage, EAMA-R accurately identifies relevant paragraphs by precisely analyzing and matching key terms in the query and document, effectively avoiding redundant information during the retrieval process. Compared to other existing retrieval methods, EAMA-R achieves a 40.095% improvement in retrieval accuracy. Especially when handling long documents and complex queries, EAMA-R, through word-level matching, can better capture the deep semantic relationships between queries and documents, thus achieving more accurate information retrieval. EAMA-R effectively helps large language models generate more accurate answers by enhancing the quality of retrieved content. By inputting retrieved relevant information into the generative model, EAMA-R can reduce the illusion problem of the model and improve the reliability of the generated answers. Experiments show that EAMA-R improves question-answering performance by 11.198% compared to other retrieval strategies in large language model question-answering tasks. This improvement mainly comes from efficient and accurate document retrieval and a refined paragraph selection process. EAMA-R is a model-agnostic retrieval method that is compatible with any pre-trained embedding model without additional fine-tuning or further training. EAMA-R has strong versatility across different domain knowledge bases. For example, whether it's patent documents, legal clauses, or medical papers, EAMA-R can effectively adapt to different types of knowledge bases without requiring additional training data or computational resources for specific domains, greatly enhancing the flexibility and scalability of EAMA-R in practical applications.

[0125] On the other hand, the present invention also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, and when the program instructions are executed by a computer, the computer is able to execute an efficient and universal large-scale long text enhancement retrieval method provided by the above methods.

[0126] On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the efficient and general-purpose large-scale long text enhancement retrieval method provided above.

[0127] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the method of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0128] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical methods, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0129] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A highly efficient and universal method for enhancing retrieval of large-scale long texts, characterized in that: include: S1: Use the first pre-trained language model to convert large-scale long texts into document vectors; use the first pre-trained language model to encode queries into query vectors; S2: Calculate the similarity between the document vector and the query vector, and select the top K1 documents as candidate documents based on the similarity. S3: Segment the candidate document into paragraphs, extract paragraph keywords, and use the second pre-trained language model to convert the paragraph keywords into paragraph keyword vectors; S4: Extract query keywords and use the second pre-trained language model to convert the query keywords into query keyword vectors; S5: Calculate the similarity between the paragraph keyword vector and the query keyword vector, and select the top K2 paragraphs as the final search results based on the similarity. S6: Associate the final search results with large-scale long texts to generate an interpretable search results report.

2. The efficient and universal large-scale long text augmented retrieval method according to claim 1, characterized in that, Step S1 includes: S11: Use the sliding window method to split large-scale long texts into multiple document paragraphs; S12: Use the first pre-trained language model to convert document paragraphs into embedding vectors; S13: Hierarchical clustering algorithm is used to cluster the embedding vectors to obtain document vectors; S14: Encode the query into a query vector using the first pre-trained language model.

3. The efficient and universal large-scale long text augmented retrieval method according to claim 1, characterized in that, The first pre-trained language model includes BERT and / or RoBERTa.

4. The efficient and universal large-scale long text augmented retrieval method according to claim 1, characterized in that, In step S2, the similarity between the document vector and the query vector is calculated using Euclidean distance or cosine similarity.

5. The efficient and universal large-scale long text augmented retrieval method according to claim 1, characterized in that, Step S2 includes: S21: Calculate the similarity between the query vector and the document vector, and select the top K1 documents as the initial candidate document set based on the similarity. S22: Remove duplicates from the initial candidate document set; S23: Use regular expressions to segment the initial candidate document set after deduplication into sentences; and merge short sentences by setting the sliding window size and step size to obtain the candidate document set; S24: Set an ID for each candidate document in the candidate document set.

6. The efficient and universal large-scale long text augmented retrieval method according to claim 5, characterized in that, Step S3 includes: S31: Match candidate documents in the candidate document set based on their IDs; S32: Using multi-process vectors, candidate documents in the candidate document set are segmented into paragraphs based on punctuation marks; S33: Extract keywords for each paragraph, remove stop words and high-frequency words, and use the second pre-trained language model to generate paragraph keyword vectors.

7. The efficient and universal large-scale long text augmented retrieval method according to claim 1, characterized in that, In step S4, keyword extraction techniques are used to extract query keywords. These techniques include one or more of TF-IDF, TextRank, and BERT-NER.

8. The efficient and universal large-scale long text augmented retrieval method according to claim 1, characterized in that, The second pre-trained language model includes one or more of the following: dual-tower encoder, DPR, and ColBERT.

9. A highly efficient and universal large-scale long text enhanced retrieval system, characterized in that, To perform an efficient and general-purpose large-scale long text enhancement retrieval method as described in any one of claims 1 to 8, comprising: The first vector conversion module uses a first pre-trained language model to convert large-scale long text into document vectors; and uses the first pre-trained language model to encode queries into query vectors. The first similarity calculation module calculates the similarity between the document vector and the query vector, and selects the top K1 documents as candidate documents based on the similarity. The second vector conversion module segments the candidate document into paragraphs, extracts paragraph keywords, and uses a second pre-trained language model to convert the paragraph keywords into paragraph keyword vectors. The third vector conversion module extracts query keywords and uses the second pre-trained language model to convert the query keywords into query keyword vectors. The second similarity calculation module calculates the similarity between the paragraph keyword vector and the query keyword vector, and selects the top K2 paragraphs as the final search results based on the similarity. The results generation module associates the final search results with large-scale long texts to generate an interpretable search results report.

10. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the efficient and general-purpose large-scale long text enhancement retrieval method as described in any one of claims 1 to 8.