A multi-level text retrieval method, device and medium
By combining named entity recognition and statistical word segmentation in parallel processing with semantic evaluation of a large language model, a multi-level partitioned index is constructed, which solves the problem of insufficient semantic association capture in patent text retrieval and achieves efficient, accurate and interpretable patent retrieval results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING AUGUST MELON TECHNOLOGY CO LTD
- Filing Date
- 2026-03-06
- Publication Date
- 2026-06-30
AI Technical Summary
Existing patent search methods struggle to effectively capture semantic relationships when dealing with complex patent texts, resulting in insufficient recall, biased relevance assessments, and weak interpretability of the vector search process, making it difficult to support the traceability of evidence required in patent examination and legal proceedings.
We employ parallel processing of named entity recognition and statistical word segmentation, combined with semantic mapping of a large language model, to construct a multi-level partitioned index and a restricted fine ranking method. Through the fusion strategy of NER and TF-IDF, we generate a multi-level vocabulary and perform staged retrieval to ensure the accuracy, robustness and interpretability of the retrieval.
It enables more accurate identification of core technical content in patent texts, improves search accuracy, shortens the search time for large-scale patent databases, increases system throughput and scalability, and provides a transparent chain of evidence to support manual review.
Smart Images

Figure CN122309697A_ABST
Abstract
Description
Technical Field
[0001] This document relates to the field of information retrieval technology, and in particular to a multi-level text retrieval method, device and medium. Background Technology
[0002] With technological advancements and intensifying industry competition, the number of patent documents has grown exponentially. Patent retrieval has become a fundamental task in technology research and development, patent strategy, infringement analysis, and examination and approval. Patent documents exhibit several significant textual and business characteristics: patent texts are typically long and structurally complex, containing multiple sections such as specifications, claims, and descriptions of figures; the terminology is highly specialized, containing numerous synonyms / near-synonyms, field-specific abbreviations and symbols such as chemical formulas, parameter expressions, and numerical units; key technical points are often sparsely distributed throughout the text. Therefore, directly treating patents as general natural language text for retrieval often yields poor results.
[0003] Currently, mainstream retrieval methods are mainly divided into two categories: statistical model-based methods, such as Boolean retrieval, TF-IDF, and BM25, while possessing good robustness and interpretability, struggle to effectively capture semantic relationships and are prone to insufficient recall and biased relevance judgment when dealing with terminological diversity and cross-sentence technical relationships. Vector-based semantic retrieval methods, such as word embedding and sentence vectors, can improve semantic matching capabilities, but still suffer from the following problems in actual patent retrieval: First, they lack differentiation between high-frequency template sentences and industry-standard terms in the text, leading to noise interference in the semantic vectors; second, synonymous terms form dense clusters in the vector space, and simply accumulating similarity can easily result in excessive weighting of a particular technical dimension; third, the vector retrieval process has weak interpretability, making it difficult to support the evidence traceability required in patent examination and legal proceedings. Summary of the Invention
[0004] This invention provides a multi-level text retrieval method, device, and medium. It aims to solve the above problems through parallel NER and statistical word segmentation, a fusion strategy of dual-source TF-IDF and semantic mapping of large language models, keyword clustering control, and the engineering implementation of multi-level partitioned indexing and restricted fine ranking. This achieves a balance between accuracy, robustness, interpretability, and scalability, and meets the actual business needs of patent retrieval.
[0005] According to an embodiment of the present invention, a multi-level text retrieval method is provided, comprising: S1. Preprocess and structure the original patent data; S2. The preprocessed patent data is processed in parallel using a named entity recognition model and a word segmentation tool to obtain entity keywords and statistical word segmentation results. Based on the entity keywords and word segmentation results, statistical weights are fused to obtain preliminary statistical weights. S3. Use a large language model to evaluate the semantic importance of the entity keywords, generate semantic weights, and merge the semantic weights with the preliminary statistical weights according to a preset ratio to generate the final weight of each keyword. S4. Based on the final weight, construct a three-level thesaurus for each document. The three-level thesaurus includes a topic thesaurus for initial screening, a phrase thesaurus for partitioned retrieval, and a keyword thesaurus for fine ranking. S5. Construct a corresponding hierarchical index based on the three-level vocabulary; S6. Receive a query request, perform a search based on the hierarchical index, and aggregate the matching contributions of semantically similar keywords during the search process; S7. Output the sorted search results.
[0006] Preferably, S2 specifically includes: The preprocessed patent data is processed using a named entity recognition model to extract technical entities and record their location information. A dictionary- and rule-based word segmentation tool is used to process the preprocessed patent data in parallel with the extraction process of the named entity recognition model to obtain the word segmentation results and record their location information. Based on the technical entities and the word segmentation results, global inverse document frequencies are constructed and the TF-IDF values of each term are calculated. For a technical entity that can be semantically divided into multiple word segments, the TF-IDF value of the technical entity and the average TF-IDF value of its corresponding multiple word segments are weighted and summed according to a preset first weight ratio to generate the preliminary statistical weight.
[0007] Preferably, S3 specifically includes: The original semantic scores of the large language model are mapped to the numerical range of the preliminary statistical weights of the current document to generate semantic weights consistent with the scale of the preliminary statistical weights. The semantic weights and preliminary statistical weights are linearly fused according to a second preset weight ratio, wherein the proportion of the semantic weights is higher than the proportion of the preliminary statistical weights.
[0008] Preferably, S4 specifically includes: The thesaurus is composed of the top N keywords with the highest final weights selected from the documents. N-gram phrases are extracted from the document text using a sliding window, and the phrase vocabulary is formed by filtering according to the final weight. The keyword table is composed of all the deduplicated keywords in the document and their metadata.
[0009] Preferably, in step S5, the corresponding hierarchical index is constructed based on a three-level vocabulary, specifically including: A first vector index for global initial screening is constructed based on the aforementioned thesaurus. Based on the phrase vocabulary, a second vector index is constructed for efficient retrieval within each different technical topic partition; A third vector index is constructed based on the keyword table for refined matching and sorting.
[0010] Preferably, the retrieval based on a hierarchical index in step S6 specifically includes: A preliminary search is performed based on the first vector index, and one or more technical topic partitions are determined based on the preliminary search result set; Within the determined technology topic partition, a partition retrieval is performed based on the corresponding second vector index to obtain a candidate patent set; Within the candidate patent set, retrieval and ranking are performed based on the third vector index.
[0011] Preferably, the retrieval and ranking based on the third vector index specifically includes: Cluster keywords with similar semantics; The maximum value is taken for the contribution of multiple keyword matches within the same cluster group; The matching contributions of different cluster groups are summed to obtain the final similarity score.
[0012] Preferably, after step S7, the method further includes: based on the location information of the matching keywords in the document, tracing back to the corresponding sentence or paragraph in the original text and highlighting it.
[0013] According to an embodiment of the present invention, an electronic device is provided, comprising: Processor; and, A memory is configured to store computer-executable instructions, which, when executed, cause the processor to perform the steps of the multi-level text retrieval method described above.
[0014] According to an embodiment of the present invention, a storage medium is provided for storing computer-executable instructions, which, when executed, implement the steps of the above-described multi-level text retrieval method.
[0015] This invention employs parallel named entity recognition and statistical word segmentation, combined with semantic evaluation from a large language model, to perform multi-source weight fusion on keywords. This ensures that weight calculation considers both the objectivity of word frequency statistics and the importance of semantics, thereby more accurately identifying and representing the core technical content of patent documents and improving the retrieval accuracy of complex patent texts. By constructing a three-level thesaurus and corresponding hierarchical indexes, and adopting a phased retrieval process of initial screening, partitioned retrieval, and fine ranking, the global retrieval task is decomposed into multiple efficient local retrievals. This significantly reduces the computational scope of each retrieval, lowers the time consumption for full-text fine matching in a large-scale patent database, and significantly improves system throughput and scalability. The system fully retains metadata such as keyword statistical weights, semantic weights, location information, and clustering relationships. While outputting the ranking results, it can backtrack and highlight relevant sentences and segments in the original text based on the position of the matched keywords, providing a clear chain of evidence for manual review, making the retrieval process and results more transparent and credible. During the fine-grained ranking stage, by clustering semantically similar keywords and using an aggregation strategy for multiple matching items within the same cluster, the similarity score caused by repeated counting of synonyms and near-synonyms is effectively prevented from being unduly amplified, making the final ranking result more reasonable and stable. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in one or more embodiments of this specification or in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0017] Figure 1 This is a flowchart of a multi-level text retrieval method according to an embodiment of the present invention; Figure 2 This is a flowchart illustrating the process of obtaining preliminary statistical weights according to an embodiment of the present invention; Figure 3 This is a schematic diagram for performing dual-source statistics and fusion. Detailed Implementation
[0018] To enable those skilled in the art to better understand the technical solutions in one or more embodiments of this specification, the technical solutions in one or more embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this specification, and not all of the embodiments. Based on one or more embodiments of this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of this document.
[0019] The core of this invention follows the approach of first extracting entities, then supplementing statistics, fusing statistics with semantics, constructing a multi-layered vocabulary, and using partitioned indexing and fine ranking.
[0020] Precise entity extraction refers to the initial stage of the process where, instead of relying on general vocabulary segmentation, a named entity recognition model specifically tailored for the patent field is used to precisely extract high-quality technical entities, such as devices, materials, and method steps, from lengthy and complex patent texts. This step aims to capture the deep semantic units that best represent the essence of patent innovation, providing a high-quality starting point for subsequent analysis.
[0021] Statistical and semantic fusion is key to achieving accurate weight calculation in this method. It involves two levels of fusion: First, at the statistical feature level, dual-source statistics from entities and word segments are intelligently weighted and fused to generate preliminary statistical weights. Second, a large language model is introduced to assess the semantic importance of keywords, generating semantic weights, which are then linearly fused with the preliminary statistical weights after scale alignment to obtain the final weights. This fusion strategy ensures that the importance of each keyword reflects both its objective document statistical characteristics and its deep semantic connection to the document's topic.
[0022] Constructing a multi-layered thesaurus is a structured representation of document content based on the final weights. According to the needs of different retrieval stages, a three-level thesaurus is constructed: a topic thesaurus for rapid topic identification and initial screening, a phrase thesaurus for efficient partitioned retrieval, and a keyword thesaurus for final refined matching and ranking. This hierarchical representation method lays the foundation for subsequent staged retrieval.
[0023] Partitioned indexing combined with fine-grained ranking is an efficient retrieval strategy for large-scale patent databases. Based on a multi-layered thesaurus, a corresponding hierarchical vector index is constructed, decomposing a complex global retrieval task into a multi-stage pipeline: global initial screening, topic partitioning, intra-partition retrieval, and candidate set fine-grained ranking. In the fine-grained ranking stage, the matching contributions of semantically similar keywords are aggregated to effectively avoid synonymous duplicate scoring. Combined with evidence backtracking and highlighting, the final output is efficient, accurate, and interpretable retrieval results.
[0024] Method Implementation Examples According to an embodiment of the present invention, a multi-level text retrieval method is provided. Figure 1 This is a flowchart of a multi-level text retrieval method according to an embodiment of the present invention. Figure 1 As shown, the multi-level text retrieval method of this invention specifically includes: S1. Preprocess and structure the original patent data; Specifically, the original patent data is first cleaned and standardized, removing tags, HTML, and control characters, unifying character encoding, and preserving complete sentence and paragraph boundaries. After processing, JSON is constructed and stored according to the publication number, with the publication number as the key and the merged specification text as the value. Simultaneously, a sentence index table is generated for each document, recording the start and end character offsets and sentence number for later highlighting and backtracking. This stage requires parallel batch processing to adapt to the import of massive amounts of files, and batch and version information is recorded during storage for index reconstruction and rollback.
[0025] S2. The preprocessed patent data is processed in parallel using a named entity recognition (NER) model and a word segmentation tool to obtain entity keywords and statistical word segmentation results. Based on these entity keywords and word segmentation results, statistical weights are fused to obtain preliminary statistical weights. The core of S2 is dual-source parallel processing and statistical weight fusion. Its design philosophy is that named entity recognition (NER) models and general word segmentation tools (jieba) each have their advantages and disadvantages. NER excels at recognizing complex technical terms with strong domain relevance and complete structure, but may have insufficient recall for new words or unlabeled entities; while dictionary- and rule-based word segmentation tools can provide fine-grained, high-coverage word segmentation, but may fragment technical terms that should be treated as a whole. Through parallel processing, dual-source statistics, and intelligent fusion, the integrity of terms and the representativeness of statistics can be balanced, laying a more reliable foundation for subsequent semantic analysis and weight calculation. Specifically, S2 includes: S21. The preprocessed patent data is processed using a named entity recognition model to extract technical entities and record their location information. In this embodiment of the invention, the Named Entity Recognition (NER) model performs annotation, model training, and batch extraction on preprocessed patent data. Specifically, the corpus samples are divided according to IPC sub-nodes, and each subdomain is manually annotated using the doccano platform. The labels cover technical entity categories (examples include devices, components, materials, parameters, method steps, etc.). The annotations are exported as standard sequence labeling format and used to train a Transformer-based sequence labeling model. The model training uses multi-domain patent corpora for fine-tuning to improve the recognition rate of long patent sentences and technical terms. During training, the confidence level of the model output is retained, and the trained model is applied in batches to the entire database to generate metadata for each recognized entity: including publication number, word text, start and end offsets, sentence number, entity category, and confidence level. This step ensures high recall of domain terms, providing a reliable foundation for subsequent phrase synthesis and location backtracking.
[0026] To illustrate this more specifically, we will use patent texts under IPC classification "H01L 21 / 02" as an example. First, several representative documents are sampled from patents in this technical field as annotation corpus. On the doccano platform, annotators annotate the text sentence by sentence according to a predefined set of entity tags. For example, for the sentence "A silicon dioxide thin film is grown on a silicon substrate by chemical vapor deposition," the annotator will label "silicon substrate" as a "material" entity, "chemical vapor deposition" as a "method step" entity, and "silicon dioxide thin film" as a "component" entity. The annotation process follows the BIO or BIOES annotation system to distinguish the start, interior, and end positions of entities. For example, "chemical vapor deposition" in the BIOES system might be labeled as "B-method step, I-method step, I-method step, E-method step." After all annotations are completed, the data is exported as a standard format file containing sentence sequences and corresponding tag sequences. During the model training phase, pre-trained Transformer models such as BERT or RoBERTa are used as the foundation. The final layer of the model is replaced with a sequence label head designed specifically for this task, typically a linear classification layer responsible for predicting the corresponding entity label for each input token. The training process uses a labeled patent corpus for supervised fine-tuning, with token-by-token cross-entropy loss often employed. To improve the model's generalization ability to long patent sentences and technical terms, data augmentation strategies such as random masking and synonym replacement are applied during training, and domain-adaptive pre-training steps for patent text may be introduced. After training, the model not only assigns a category label to each identified entity but also outputs a confidence score between 0 and 1, reflecting the model's confidence in the prediction. In the batch extraction phase, the trained model is deployed to an inference server to process the entire patent database in a pipeline manner. For each document, the model infers sentence by sentence, outputting the labels of all identified entities, their precise start and end character offsets in the text, the index of the sentence, and the confidence score. This information, along with the document's publication number, is stored in a structured manner, forming a complete entity metadata record. For example, for the patent with publication number "CN202311234567A", one entity record might be: {"text":"graphene nanoribbons","start":1254,"end":1260,"sentence_id":15,"label":"materials","confidence":0.96}. This systematic and fine-grained entity extraction and recording lays a solid data foundation for subsequently building a high-quality keyword weighting system and achieving accurate sentence and paragraph positioning and highlighting.
[0027] S22. A dictionary- and rule-based word segmentation tool is used to process the preprocessed patent data in parallel with the extraction process of the named entity recognition model to obtain the word segmentation results and record their position information. Specifically, parallel to NER extraction, Jieba segmentation is performed on each document, retaining the start and end positions of each segmented word and its sentence number. For special items such as chemical formulas, parametric expressions, numerical values and units, and English abbreviations, regular rules or custom dictionaries are used for protective processing during the segmentation stage to avoid erroneous segmentation. The segmentation results and NER results are saved simultaneously for subsequent dual-source statistics and fusion. This step, along with step S21, constitutes a parallel dual-channel processing architecture. The role of the general-purpose segmentation tool is to provide full literal, fine-grained vocabulary coverage, ensuring that any word appearing in the text is included in the statistical analysis. It plays an important supplementary role for emerging terms, uncommon expressions, or unlabeled entities that may not be covered by the NER model. The protective processing mechanism is crucial; for example, it treats expressions like "H2O" or "50MPa" as complete lexical units, protecting them from being segmented into "H", "2", "O" or "50", "MPa", thus losing their complete scientific or engineering meaning. The recorded word segmentation location information is in the same format as the NER entity location information, together constructing a "lexical map" of the document. This allows both semantic entities and statistical words to be accurately located to specific sentences and character ranges in the original text. This parallel dual-source data acquisition strategy ensures the comprehensiveness and robustness of subsequent feature extraction.
[0028] S23. Based on the technical entity and the word segmentation result, construct the global inverse document frequency and calculate the TF-IDF value of each term; Based on the word segmentation results and NER results obtained in S21 and S22, the IDF of the NER dictionary and the IDF of Jieba word segmentation are constructed respectively. The term frequency is calculated within each document and logarithmically processed to obtain the TF value. Then, the TF-IDF is calculated as follows: First, the document frequency of each unique NER entity and each unique segmentation term is statistically analyzed across the entire database, and their respective inverse document frequencies are calculated. The IDF value reflects the prevalence or uniqueness of a word within the entire patent database; the more documents a word appears in, the lower its IDF value and the weaker its discriminative ability. Next, for each individual patent document, the term frequency of each term within that document is calculated, and the original term frequency is logarithmically transformed to smooth out the influence of extreme values, yielding the term frequency value. Finally, the term frequency value is multiplied by the global IDF value to obtain the TF-IDF value of that word in that document. This calculation assigns each word a dual statistical weight based on its frequency of occurrence in the current document and its distribution breadth across the entire database. By establishing independent IDF statistics and TF-IDF calculation processes for NER entities and segmentation results respectively, the system can fairly assess the statistical importance of words from different sources and at different granularities, providing objective and quantifiable data input for subsequent fusion steps.
[0029] S24. For a technical entity that can be semantically divided into multiple word segments, the TF-IDF value of the technical entity and the average TF-IDF value of its corresponding multiple word segments are weighted according to a preset first weight ratio. If a NER term can be semantically decomposed into several Jieba segmentation terms, the statistical weight of the NER term is synthesized according to a specified combination strategy: the TF-IDF of the NER term itself is linearly weighted with the average TF-IDF of the corresponding Jieba segmentation terms at a ratio of 0.6 / 0.4 to obtain the statistical weight of the NER term in the current document. Intra-document normalization is performed on the statistical weight of each document, for example, using min-max normalization, to ensure consistent scaling with the semantic weights later.
[0030] This step is the core mechanism for achieving intelligent fusion of statistical features. Its key lies in identifying and processing the semantic compositional relationship between NER entities and their basic segmentation terms. For example, the NER entity "high-strength aluminum alloy" might be composed of the segmentation terms "high strength," "aluminum," and "alloy." The fusion strategy is not a simple selection, but a weighted synthesis. Giving the NER entity its own TF-IDF a higher weight reflects respect for and priority given to the complete semantic concepts identified by the model, which helps maintain the integrity of core terms. Simultaneously, incorporating the average TF-IDF value of its constituent segments as a supplement allows for the absorption of more granular vocabulary statistical information, enhancing the stability of the weights and their adaptability to contextual changes. The preset weight ratio of 0.6 to 0.4 has been optimized in practice to balance the contribution of terminology integrity with the contribution of component statistics. For NER entities that cannot be split, their own TF-IDF value is directly used. The final document-level normalization process maps all preliminary statistical weights to a uniform numerical range, effectively eliminating absolute value biases caused by differences in length and vocabulary density among different documents. This prepares the groundwork for a fair and effective fusion with semantic weights from a large language model in the next step. Figure 3 This is a schematic diagram illustrating dual-source statistics and fusion according to an embodiment of the present invention.
[0031] S3. Use a large language model to evaluate the semantic importance of the entity keywords, generate semantic weights, and merge the semantic weights with the preliminary statistical weights according to a preset ratio to generate the final weight of each keyword; S3 specifically includes: S31. Map the original semantic score of the language model to the numerical range of the preliminary statistical weight of the current document to generate semantic weights consistent with the scale of the preliminary statistical weights. Specifically, the document text, its NER vocabulary, and corresponding statistical weights are used as contextual input to the LLM model, which then scores the topic relevance and importance of each keyword. Since the raw LLM scores may not have the same numerical scale as the statistical weights, to ensure reasonable fusion, the relative LLM scores are normalized and mapped to the numerical range of the document's statistical weights, thus obtaining the semantic weights. This mapping method ensures that the semantic weights and statistical weights are within the same numerical range, avoiding bias caused by scale differences during fusion. The implementation employs batch processing and caching strategies for LLM calls to balance cost and latency, and provides a degradation scheme for a small local reranker to improve robustness.
[0032] S32. The semantic weight and the preliminary statistical weight are linearly fused according to the second preset weight ratio, wherein the proportion of the semantic weight is higher than the proportion of the preliminary statistical weight.
[0033] Semantic and statistical weights are linearly fused, with semantic weights accounting for 0.6 and statistical weights for 0.4, to obtain the final weight for each keyword in each patent. After fusion, necessary normalization is performed again within the document for use in topword ranking and subsequent weight calculation. This weight reflects both the statistical importance of the word in the current document and its semantic relevance to the document's topic and solution.
[0034] S4. Based on the final weight, construct a three-level thesaurus for each document. The three-level thesaurus includes a topic thesaurus for initial screening, a phrase thesaurus for partitioned retrieval, and a keyword thesaurus for fine-grained ranking. Specifically, S4 includes: S41. Select the top N keywords with the highest final weight from the documents to form the thesaurus. The first set of vocabulary constructed is the topic vocabulary (topwords). For example, the top 20 words with the final weight are taken and concatenated into a string in a certain order, which serves as the main semantic representation of the patent and is used for subsequent clustering and IPC partitioning determination. S42. Extract N-gram phrases from the document text based on a sliding window, and filter them according to the final weight to form the phrase vocabulary; The second vocabulary is an ngramwords vocabulary, which is generated by a sliding window based on the NER word sequence and word segmentation results in the text. The window length k and overlap length j can be configured during implementation. Within each window, words are sorted according to their final weights and priority items are retained in order to build a set of phrases for the partitioned index.
[0035] S43. Compile all the deduplicated keywords in the document and their metadata into the keyword table.
[0036] Specifically, the third-level thesaurus is the keyword list, which is a complete set of deduplicated keywords, including word position, tags, and final weights. It is the main input for the fine-grained ranking stage. The above three levels of thesaurus simultaneously record the original word text, normalized text, position offset, and calculated weight values.
[0037] Furthermore, density clustering (such as DBSCAN) is performed on the keywords of each patent using the same vectorization model in the vector space to identify synonym or near-synonym clusters and noise terms. The purpose of clustering is to adopt an aggregation strategy for multiple hits within the same cluster during the fine-tuning stage, avoiding the amplification of similarity statistics due to the sheer number of variations of the same type. The clustering parameters can be optimized on the validation set based on the embedding space density, and the clustering results are recorded in the keyword metadata for easy use during fine-tuning.
[0038] S5. Construct a corresponding hierarchical index based on the three-level vocabulary; S5 specifically includes: S51. Construct a first vector index for global initial screening based on the aforementioned thesaurus; After full-database vectorization of topwords, a Faiss index is created to support rapid topic identification and initial screening. Within each IPC subclass partition, ngramwords are vectorized and a Faiss IVFPQ index is created within that partition to support efficient partitioned retrieval. The IVFPQ nlist can be adaptively configured to match the partition size. S52. Based on the phrase vocabulary, construct a second vector index for efficient retrieval within each different technical topic partition; Keywords are vectorized and normalized before being entered into Milvus (using inner product retrieval to correspond to cosine similarity), and a mapping table from public account to Milvus primary key is constructed at the same time; S53. Construct a third vector index based on the keyword table for refined matching and sorting.
[0039] Since a patent typically contains multiple keywords, the mapping table assigns a contiguous ID space to each patent, with each keyword corresponding to a unique ID within that range. This helps to limit searches by publication number or filter results using metadata. Both the index and the mapping table require versioning and regular backups to support traceability.
[0040] S6. Receive a query request, perform a retrieval based on the hierarchical index, and aggregate the matching contributions of semantically similar keywords during the retrieval process; S6 specifically includes: S61. Perform a preliminary search based on the first vector index, and determine one or more technical topic partitions based on the preliminary search result set; The first stage involves retrieving the topwords vector of the query from the topwords full database index to find the most similar patents and selecting the top 500. By statistically analyzing the IPC subclass distribution of these 500 patents and selecting the top 30 subclasses with the highest frequency, content-based IPC partitioning is completed. This step is theoretically more robust than a simple NLP classifier.
[0041] S62. Within the determined technical topic partition, perform partition retrieval based on the corresponding second vector index to obtain a candidate patent set; In the second stage, the ngramwords vector of the query is sent to the Faiss IVFPQ index of the corresponding partition in parallel in the selected top 30 partitions. Each partition returns candidates and merges them according to similarity and weight logic. The merging strategy can be to accumulate the same publication number or take the maximum value and add deduplication control. After merging, the top 1000 are selected.
[0042] S63. Within the candidate patent set, perform retrieval and ranking based on the third vector index.
[0043] The third stage maps the publication numbers of the top 1000 patents to a range of IDs within Milvus, and restricts the search scope within Milvus. Batch searches are performed on each keyword in the query to obtain the similarity value from the keyword to the candidate ID. During the fine-grained ranking process, for multiple keywords of the same patent, a "maximum value" strategy is first applied to multiple contributions within the same cluster based on the keyword clustering results. Then, the contributions between clusters are summed to obtain the keyword score for that patent. During the search, a similarity threshold (empirical value approximately 0.8) is set for strong matches to mark high-confidence matches, while continuous similarity values are retained in the score statistics for ranking. Finally, when merging scores across the three sources, an interpretable weighting strategy can be used, either as specified or commonly used in engineering, such as giving fine-grained ranking a higher weight, or retaining linearly adjustable fusion parameters for validation set optimization.
[0044] S7. Output the sorted search results. For the top-ranking candidate patents, trace back to the original sentence based on the keyword's position information within the patent and calculate the keyword concentration. Select the text window with the highest concentration as the highlighted segment. Window scoring is based on keyword weight and similarity. Several non-overlapping segments can be selected for evidence presentation. The front-end display should retain the confidence level, similarity, original sentence number, and character offset for each matching word to facilitate manual verification and judgment by examiners.
[0045] By employing the embodiments of the present invention, the following beneficial effects are achieved: This invention employs parallel named entity recognition and statistical word segmentation, combined with semantic evaluation from a large language model, to perform multi-source weight fusion on keywords. This ensures that weight calculation considers both the objectivity of word frequency statistics and the importance of semantics, thereby more accurately identifying and representing the core technical content of patent documents and improving the retrieval accuracy of complex patent texts. By constructing a three-level thesaurus and corresponding hierarchical indexes, and adopting a phased retrieval process of initial screening, partitioned retrieval, and fine ranking, the global retrieval task is decomposed into multiple efficient local retrievals. This significantly reduces the computational scope of each retrieval, lowers the time consumption for full-text fine matching in a large-scale patent database, and significantly improves system throughput and scalability. The system fully retains metadata such as keyword statistical weights, semantic weights, location information, and clustering relationships. While outputting the ranking results, it can backtrack and highlight relevant sentences and segments in the original text based on the position of the matched keywords, providing a clear chain of evidence for manual review, making the retrieval process and results more transparent and credible. During the fine-grained ranking stage, by clustering semantically similar keywords and using an aggregation strategy for multiple matching items within the same cluster, the similarity score caused by repeated counting of synonyms and near-synonyms is effectively prevented from being unduly amplified, making the final ranking result more reasonable and stable.
[0046] Device Example 1 According to an embodiment of the present invention, an electronic device is provided, comprising: Processor; and, A memory is configured to store computer-executable instructions, which, when executed, cause the processor to perform the steps of the multi-level text retrieval method described above.
[0047] Device Example 2 According to an embodiment of the present invention, a storage medium is provided for storing computer-executable instructions, which, when executed, implement the steps of the above-described multi-level text retrieval method.
[0048] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A multi-level text retrieval method, characterized in that... include: S1. Preprocess and structure the original patent data; S2. The preprocessed patent data is processed in parallel using a named entity recognition model and a word segmentation tool to obtain entity keywords and statistical word segmentation results. Based on the entity keywords and word segmentation results, statistical weights are fused to obtain preliminary statistical weights. S3. Use a large language model to evaluate the semantic importance of the entity keywords, generate semantic weights, and merge the semantic weights with the preliminary statistical weights according to a preset ratio to generate the final weight of each keyword. S4. Based on the final weight, construct a three-level thesaurus for each document. The three-level thesaurus includes a topic thesaurus for initial screening, a phrase thesaurus for partitioned retrieval, and a keyword thesaurus for fine ranking. S5. Construct a corresponding hierarchical index based on the three-level vocabulary; S6. Receive a query request, perform a search based on the hierarchical index, and aggregate the matching contributions of semantically similar keywords during the search process; S7. Output the sorted search results.
2. The method according to claim 1, characterized in that, S2 specifically includes: The preprocessed patent data is processed using a named entity recognition model to extract technical entities and record their location information. A dictionary- and rule-based word segmentation tool is used to process the preprocessed patent data in parallel with the extraction process of the named entity recognition model to obtain the word segmentation results and record their location information. Based on the technical entities and the word segmentation results, global inverse document frequencies are constructed and the TF-IDF values of each term are calculated. For a technical entity that is semantically divided into multiple word segments, the TF-IDF value of the technical entity and the average TF-IDF value of its corresponding multiple word segments are weighted and summed according to a preset first weight ratio to generate the preliminary statistical weight.
3. The method according to claim 1, characterized in that, S3 specifically includes: The original semantic scores of the large language model are mapped to the numerical range of the preliminary statistical weights of the current document to generate semantic weights consistent with the scale of the preliminary statistical weights. The semantic weights and preliminary statistical weights are linearly fused according to a second preset weight ratio, wherein the proportion of the semantic weights is higher than the proportion of the preliminary statistical weights.
4. The method according to claim 1, characterized in that, S4 specifically includes: The thesaurus is composed of the top N keywords with the highest final weights selected from the documents. N-gram phrases are extracted from the document text using a sliding window, and the phrase vocabulary is formed by filtering according to the final weight. The keyword table is composed of all the deduplicated keywords in the document and their metadata.
5. The method according to claim 1, characterized in that, The hierarchical index constructed in S5 based on the three-level vocabulary specifically includes: A first vector index for global initial screening is constructed based on the aforementioned thesaurus. Based on the phrase vocabulary, a second vector index is constructed for efficient retrieval within each different technical topic partition; A third vector index is constructed based on the keyword table for refined matching and sorting.
6. The method according to claim 5, characterized in that, The retrieval based on hierarchical indexing in S6 specifically includes: A preliminary search is performed based on the first vector index, and one or more technical topic partitions are determined based on the preliminary search result set; Within the determined technology topic partition, a partition retrieval is performed based on the corresponding second vector index to obtain a candidate patent set; Within the candidate patent set, retrieval and ranking are performed based on the third vector index.
7. The method according to claim 6, characterized in that, The retrieval and ranking based on the third vector index specifically includes: Cluster keywords with similar semantics; The maximum value is taken for the contribution of multiple keyword matches within the same cluster group; The matching contributions of different cluster groups are summed to obtain the final similarity score.
8. The method according to claim 1, characterized in that, The process after S7 also includes: based on the location information of the matching keywords in the document, tracing back to the corresponding sentence or paragraph in the original text and highlighting it.
9. An electronic device, comprising: processor; as well as, A memory configured to store computer-executable instructions, which, when executed, cause the processor to perform the steps of the multi-level text retrieval method according to any one of claims 1-8.
10. A storage medium for storing computer-executable instructions, which, when executed, implement the steps of the multi-level text retrieval method according to any one of claims 1-8.