A two-stage hybrid retrieval method and system based on keyword guidance and semantic ranking
By employing a two-stage hybrid retrieval method combining keyword guidance and semantic ranking, along with intelligent context expansion and dynamic weight fusion, the challenges of accuracy and semantic understanding in professional document retrieval are solved, thereby improving retrieval precision and efficiency and enhancing user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA MCC20 GRP CORP LTD
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies struggle to simultaneously achieve accurate matching and semantic understanding of technical terms in document retrieval within specialized fields, resulting in low recall rates, high resource consumption, and poor user experience.
A two-stage hybrid retrieval method based on keyword guidance and semantic fine ranking is adopted. By combining keyword coarse screening and semantic fine ranking, and introducing intelligent context expansion and dynamic weight fusion mechanism, the retrieval accuracy and recall rate are improved while reducing resource consumption.
It achieves a balance between precise matching of technical terms and semantic understanding, improving retrieval accuracy and recall, enhancing retrieval efficiency and user experience, and reducing resource consumption.
Smart Images

Figure CN122309702A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer information retrieval technology, and in particular to a two-stage hybrid retrieval method and system based on keyword guidance and semantic ranking. Background Technology
[0002] In document retrieval in professional fields such as engineering, law, and academia, user queries typically contain precise technical terms (such as "C30 concrete compressive strength") and vague semantic descriptions (such as "construction precautions"). Existing mainstream solutions mainly fall into two categories: First, keyword retrieval based on inverted indexes. Its advantages include accurate and fast matching of technical terms, but it cannot understand semantics and has low recall rates for synonyms, long-tail keywords, and contextual descriptions. Long-tail keywords are non-core target keywords that still generate search traffic; they are typically composed of three or more words and are closer to natural language. Second, semantic vector retrieval based on deep learning. Its advantages include understanding query intent and recalling relevant concepts, but it has weak precise matching capabilities for technical terms, and its inefficient and resource-intensive similarity calculation across the entire database is costly.
[0003] Currently, the simple "keyword + semantic" weighted fusion method attempts to combine the advantages of both, but it has inherent drawbacks: 1. Static weight fusion cannot adapt to different query types (terminal vs. descriptive); 2. It lacks context awareness and handles structured information such as tables and formulas within documents poorly; 3. The results are homogenized, potentially returning a large number of repetitive fragments. These shortcomings mean that existing hybrid retrieval systems remain unsatisfactory in terms of professionalism, accuracy, and user experience. Summary of the Invention
[0004] The technical problem to be solved by the present invention is to provide a two-stage hybrid retrieval method and system based on keyword guidance and semantic fine ranking. This method and system overcome the shortcomings of traditional single keyword or semantic retrieval. By coordinating keyword-guided coarse screening and semantic-aware fine ranking, and introducing intelligent context expansion and dynamic weight fusion mechanism, the retrieval accuracy and recall rate are improved, and the retrieval efficiency is high with low resource consumption.
[0005] To address the aforementioned technical problems, the present invention provides a two-stage hybrid retrieval method based on keyword guidance and semantic ranking, comprising the following steps: Step 1: Query Analysis and Keyword Guidance Extraction. Receive user query text and extract core keywords using a multi-level strategy, including: extraction based on statistical features, term matching based on a pre-built domain dictionary, and expansion based on a thesaurus. Assess the importance of the extracted keywords and select the Top-N keywords to form a query guidance set. Step 2: Keyword coarse screening based on inverted index. Using the constructed inverted index, the keyword query guides the rapid matching and retrieval in the document library, supporting AND and OR logical operations, and initially screening out a batch of candidate document fragments containing keywords. Step 3: Intelligent contextual expansion of candidate document fragments. For each candidate document fragment in the coarse screening result set, analyze the structure of the document in which it belongs, and perform adaptive contextual expansion accordingly. Step 4: Semantic Vectorization and Similarity Calculation. Using a pre-trained semantic model, the user query text and each expanded text fragment are encoded into high-dimensional vectors. The cosine similarity between the user query text vector and each expanded text fragment vector is calculated to obtain the semantic similarity score. Step 5: Dynamic weight fusion and fine ranking. Obtain the keyword matching score for each candidate document fragment in the coarse screening result set, dynamically analyze query features, and calculate the fusion weight coefficient α (0 < α < 1) in real time based on the analysis query results. Then, the weighted fusion score = α × keyword matching score + (1-α) × semantic similarity score. For terminology queries, α is increased to emphasize keyword matching; for descriptive queries, α is decreased to emphasize semantic understanding. The coarse screening result set is then re-ranked based on the weighted fusion score. Step 6: Result Optimization and Output. The top-K candidate document fragments after fine ranking are optimized using the maximum marginal relevance algorithm. While ensuring relevance, the algorithm improves the diversity of query results, avoids content redundancy, and outputs a list of deduplicated, diverse, and descendingly relevance-sorted final search results.
[0006] Furthermore, the statistical feature extraction in step one is performed using TF-IDF and TextRank algorithms.
[0007] Furthermore, in step three, the document structure includes paragraphs, tables, formulas, and list items. The adaptive context expansion includes expanding table fragments by extending their titles, headers, and the first few rows of data; and expanding formula fragments by extending their numbers, descriptions, and variable declarations.
[0008] Furthermore, the pre-trained semantic model in step four is the BGE-large-zh Chinese text embedding model.
[0009] Furthermore, the dynamic analysis of query features in step five includes the proportion of professional terms, whether it contains numbers, and whether it is an open-ended question.
[0010] A two-stage hybrid retrieval system based on the above method, which combines keyword guidance and semantic ranking, includes a keyword extraction module, an inverted index and coarse screening module, a context intelligent expansion module, a semantic ranking module, and a result optimization module. The keyword extraction module is configured to execute step one; The inverted index and coarse screening module includes an inverted index data structure and is configured to execute step two. The context intelligent expansion module is configured to identify the candidate document fragment type and execute the candidate document fragment adaptive expansion strategy in step three; The semantic ranking module includes a semantic encoding model and a dynamic weight fusion unit, and is configured to execute steps four and five. The result optimization module is configured to perform query result diversity optimization and deduplication in step six, and output the final search result list; The modules mentioned above are connected in sequence and work together to complete the hybrid retrieval process from query input to result output. This invention employs a two-stage hybrid retrieval method and system based on keyword guidance and semantic ranking, utilizing the aforementioned technical solutions. Specifically, the method performs query analysis and keyword extraction on the received user query text to form a query guidance set; it then uses inverted index-based keyword coarse screening to obtain a coarse screening result set of candidate document fragments; adaptive context expansion is performed on the candidate document fragments; semantic vectorization and similarity calculation are performed to obtain semantic similarity scores; dynamic weight fusion is used to re-rank the coarse screening result set; and the ranked candidate document fragments are optimized to output a deduplicated, diverse, and descendingly relevance-based final retrieval result list. The system includes a keyword extraction module, an inverted index and coarse screening module, a context intelligent expansion module, a semantic ranking module, and a result optimization module. These modules are sequentially connected and work collaboratively to complete the hybrid retrieval process from query input to result output. This method and system overcome the shortcomings of traditional single-keyword or semantic retrieval by combining keyword-guided coarse screening with semantically aware ranking, and by introducing intelligent context expansion and dynamic weight fusion mechanisms, thereby improving retrieval accuracy and recall, while also achieving high retrieval efficiency and low resource consumption. Attached Figure Description
[0011] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments: Figure 1 This is a flowchart of the method; Figure 2 This is a schematic diagram of the system structure. Detailed Implementation
[0012] Implementation, for example Figure 1 As shown, the two-stage hybrid retrieval method based on keyword guidance and semantic ranking of the present invention includes the following steps: Step 1: Query Analysis and Keyword Guidance Extraction. Receive user query text and employ a multi-level strategy to extract core keywords, including: extraction based on statistical features, terminology matching based on a pre-built domain dictionary, and expansion based on a thesaurus. Assess the importance of the extracted keywords and select the Top-N keywords to form a query guidance set. The purpose of this step is to transform user intent into precisely matchable terminology clues. Step 2: Keyword coarse screening based on inverted index. Using the constructed inverted index, the keyword query guides the rapid matching and retrieval in the document database, supporting AND and OR logical operations, and initially screening out a batch of candidate document fragments containing keywords. The core role of this stage is to use the efficiency and accuracy of keywords to quickly narrow down the search scope and ensure overall search efficiency. Step 3: Intelligent contextual expansion of candidate document fragments. For each candidate document fragment in the coarse screening result set, the structure of the document in which it belongs is analyzed, and adaptive contextual expansion is performed accordingly. The purpose of this step is to provide richer and more accurate textual context for subsequent semantic understanding and overcome the problem of incomplete information in the original fragment. Step 4: Semantic Vectorization and Similarity Calculation. Using a pre-trained semantic model, the user query text and each expanded text fragment are encoded into high-dimensional vectors. The cosine similarity between the user query text vector and each expanded text fragment vector is calculated to obtain the semantic similarity score. Step 5: Dynamic weight fusion and fine ranking. Obtain the keyword matching score for each candidate document fragment in the coarse screening result set, dynamically analyze query features, and calculate the fusion weight coefficient α (0 < α < 1) in real time based on the analysis query results. Then, the weighted fusion score = α × keyword matching score + (1-α) × semantic similarity score. For terminology queries, α is increased to emphasize keyword matching; for descriptive queries, α is decreased to emphasize semantic understanding. The coarse screening result set is then re-ranked based on the weighted fusion score. Step 6: Result Optimization and Output. The top-K candidate document fragments after fine ranking are optimized using the maximum marginal relevance algorithm. While ensuring relevance, the algorithm improves the diversity of query results, avoids content redundancy, and outputs a list of deduplicated, diverse, and descendingly relevance-sorted final search results.
[0013] Preferably, the statistical feature extraction in step one is performed using the TF-IDF and TextRank algorithms. TF-IDF and TextRank are commonly used unsupervised algorithms for keyword extraction from text. The former is based on word frequency and document frequency statistics, while the latter draws on the idea of webpage ranking to model word co-occurrence relationships.
[0014] Preferably, the document structure in step three includes paragraphs, tables, formulas, and list items. The adaptive context expansion includes expanding the table fragment by extending its title, header, and the first few rows of data; and expanding the formula fragment by extending its numbering, description, and variable description.
[0015] Preferably, the pre-trained semantic model in step four is the BGE-large-zh Chinese text embedding model. The BGE-large-zh model improves text representation capabilities in Chinese environments by converting text into high-dimensional vectors, and can be applied to scenarios such as semantic search, question answering systems, cluster analysis, and recommendation engines.
[0016] Preferably, the dynamic analysis query features in step five include the proportion of professional terms, whether it contains numbers, and whether it is an open-ended question.
[0017] like Figure 2 As shown, a two-stage hybrid retrieval system based on the above method, which combines keyword guidance and semantic ranking, includes a keyword extraction module, an inverted index and coarse screening module, a context intelligent expansion module, a semantic ranking module, and a result optimization module. The keyword extraction module is configured to execute step one; The inverted index and coarse screening module includes an inverted index data structure and is configured to execute step two. The context intelligent expansion module is configured to identify the candidate document fragment type and execute the candidate document fragment adaptive expansion strategy in step three; The semantic ranking module includes a semantic encoding model and a dynamic weight fusion unit, and is configured to execute steps four and five. The result optimization module is configured to perform query result diversity optimization and deduplication in step six, and output the final search result list; The modules mentioned above are connected in sequence and work together to complete the hybrid retrieval process from query input to result output.
[0018] Compared with existing technologies, this method and system have the following advantages and technical effects: 1. Significantly improved retrieval accuracy and recall: Keyword coarse screening ensures accurate hits of professional terms, semantic fine ranking recalls relevant concepts, and contextual expansion improves semantic understanding accuracy, fundamentally solving the limitations of a single retrieval mode.
[0019] 2. High retrieval efficiency and low resource consumption: The two-stage architecture limits the computationally intensive semantic similarity calculation to a small set of coarsely screened results after keyword filtering, avoiding full database traversal, effectively reducing the average response time, and is suitable for large-scale document databases.
[0020] 3. Strong intelligent adaptive capability: The innovative dynamic weight fusion mechanism can automatically identify the query intent and achieve the best balance between "literal exact matching" and "semantic fuzzy matching", resulting in a better user experience.
[0021] 4. Excellent adaptability to professional fields: Special context expansion strategies are designed for special structures such as tables and formulas in engineering documents, and domain dictionary import is supported, making the system perform better than general search models in professional scenarios.
[0022] 5. Higher quality results: Post-processing using the MMR algorithm (Maximum Marginal Correlation Algorithm) ensures the diversity of returned results, reduces information redundancy, and provides users with more comprehensive and valuable reference information.
Claims
1. A two-stage hybrid retrieval method based on keyword guidance and semantic ranking, characterized in that... Includes the following steps: Step 1: Query Analysis and Keyword Guidance Extraction. Receive user query text and extract core keywords using a multi-level strategy, including: extraction based on statistical features, term matching based on a pre-built domain dictionary, and expansion based on a thesaurus. Assess the importance of the extracted keywords and select the Top-N keywords to form a query guidance set. Step 2: Keyword coarse screening based on inverted index. Using the constructed inverted index, the keyword query guides the rapid matching and retrieval in the document library, supporting AND and OR logical operations, and initially screening out a batch of candidate document fragments containing keywords. Step 3: Intelligent contextual expansion of candidate document fragments. For each candidate document fragment in the coarse screening result set, analyze the structure of the document in which it belongs, and perform adaptive contextual expansion accordingly. Step 4: Semantic Vectorization and Similarity Calculation. Using a pre-trained semantic model, the user query text and each expanded text fragment are encoded into high-dimensional vectors. The cosine similarity between the user query text vector and each expanded text fragment vector is calculated to obtain the semantic similarity score. Step 5: Dynamic weight fusion and fine ranking. Obtain the keyword matching score for each candidate document fragment in the coarse screening result set, dynamically analyze query features, and calculate the fusion weight coefficient α (0 < α < 1) in real time based on the analysis query results. Then, the weighted fusion score = α × keyword matching score + (1-α) × semantic similarity score. For terminology queries, α is increased to emphasize keyword matching; for descriptive queries, α is decreased to emphasize semantic understanding. The coarse screening result set is then re-ranked based on the weighted fusion score. Step 6: Result Optimization and Output. The top-K candidate document fragments after fine ranking are optimized using the maximum marginal relevance algorithm. While ensuring relevance, the algorithm improves the diversity of query results, avoids content redundancy, and outputs a list of deduplicated, diverse, and descendingly relevance-sorted final search results.
2. The two-stage hybrid retrieval method based on keyword guidance and semantic ranking according to claim 1, characterized in that: In step one, the extraction based on statistical features is performed using TF-IDF and TextRank algorithms.
3. The two-stage hybrid retrieval method based on keyword guidance and semantic ranking according to claim 1, characterized in that: In step three, the document structure includes paragraphs, tables, formulas, and list items. Adaptive context expansion includes expanding table fragments by adding their titles, headers, and the first few rows of data; and expanding formula fragments by adding their numbering, descriptions, and variable declarations.
4. The two-stage hybrid retrieval method based on keyword guidance and semantic ranking according to claim 1, characterized in that: The pre-trained semantic model in step four is the BGE-large-zh Chinese text embedding model.
5. The two-stage hybrid retrieval method based on keyword guidance and semantic ranking according to claim 1, characterized in that: The dynamic analysis query features in step five include the proportion of professional terms, whether it contains numbers, and whether it is an open-ended question.
6. A two-stage hybrid retrieval system based on keyword guidance and semantic ranking, according to the method of any one of claims 1 to 5, characterized in that: It includes a keyword extraction module, an inverted index and coarse screening module, a context-based intelligent expansion module, a semantic fine ranking module, and a result optimization module; The keyword extraction module is configured to execute step one; The inverted index and coarse screening module includes an inverted index data structure and is configured to execute step two. The context intelligent expansion module is configured to identify the candidate document fragment type and execute the candidate document fragment adaptive expansion strategy in step three; The semantic ranking module includes a semantic encoding model and a dynamic weight fusion unit, and is configured to execute steps four and five. The result optimization module is configured to perform query result diversity optimization and deduplication in step six, and output the final search result list; The modules mentioned above are connected in sequence and work together to complete the hybrid retrieval process from query input to result output.