Multi-model paper retrieval method for academic question answering

By combining a multi-model architecture with contrastive learning training samples, the problem of insufficient accuracy and robustness of academic question-answering systems in high-difficulty tasks is solved, achieving more efficient academic literature retrieval and interdisciplinary understanding.

CN122019735BActive Publication Date: 2026-06-23SOUTHWEST PETROLEUM UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUTHWEST PETROLEUM UNIV
Filing Date
2026-04-13
Publication Date
2026-06-23

Smart Images

  • Figure SMS_1
    Figure SMS_1
  • Figure SMS_10
    Figure SMS_10
  • Figure SMS_11
    Figure SMS_11
Patent Text Reader

Abstract

The application discloses a kind of academic question and answer-oriented multi-model paper retrieval method, it is related to natural language processing and information retrieval technical field, including: first, construct unified corpus and training dataset, utilize the model in first model set and second target model respectively encode generation document vector set;Then, based on the initial retrieval result of second model, difficult negative sample is filtered, and the contrast learning sample pair is constructed to fine-tune and re-encode corpus;With the model group of the second target model after fine-tuning and the model in first model set, each model in model group is executed similarity retrieval in parallel respectively, and the corresponding original similarity matrix is obtained, based on the original similarity matrix, the document is filtered, and the first target document list of target query is generated.The application can effectively improve the accuracy and robustness of academic literature retrieval.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of natural language processing and information retrieval technology, specifically a multi-model paper retrieval method for academic question answering. Background Technology

[0002] With the deepening of academic research and the exponential growth of the number of documents, academic question-answering systems have become a key tool for researchers to efficiently acquire knowledge. In order to break through the limitations of traditional keyword matching technology in terms of semantic understanding depth, most of the existing mainstream technical solutions adopt a deep learning-based intensive retrieval paradigm, that is, using a pre-trained language model as an encoder to map the user's natural language query and academic documents to the same high-dimensional vector space, and recalling relevant documents by calculating the geometric distance between vectors.

[0003] However, queries in academic fields are often highly abstract and technically specialized. The correct answer and the distracting documents that are merely literally similar often have only very subtle semantic boundaries. When the retrieval system uses only a general pre-trained model, due to the lack of a deep understanding of the logic of a specific domain, the model is easily misled by these highly similar difficult samples and cannot accurately distinguish the subtle differences in the core semantics. On the other hand, if a single model is aggressively fine-tuned simply to improve domain adaptability, it is very easy for the model to overfit on a specific data distribution, thereby causing it to lose its ability to generalize and understand diverse question formats or cross-domain knowledge.

[0004] The contradiction between domain specificity and semantic generalization inherent in this single model makes it difficult for existing technologies to maintain broad semantic coverage while effectively suppressing noise interference when faced with challenging academic question-answering tasks. This results in low accuracy of search results and a lack of sufficient robustness in the system. Summary of the Invention

[0005] To address the shortcomings of existing technologies, this invention provides a multi-model paper retrieval method oriented towards academic question answering.

[0006] To achieve the above objectives, the technical solution of the present invention is as follows:

[0007] A multi-model paper retrieval method for academic question answering includes the following steps:

[0008] Step S1: Construct a unified corpus containing candidate documents and a training dataset containing query text and its positive document associations;

[0009] Step S2: Pre-set at least two models, one of which is the second target model, and the remaining models construct the first model set. The first model set contains at least one language model that is different from the second target model, thereby constructing a multi-dimensional semantic feature space and minimizing the systematic bias caused by a single architecture; use each model to encode the unified corpus to generate the corresponding document embedding vector set.

[0010] Step S3: Encode each query text using the second target model, and perform an initial retrieval based on each encoding result in the document embedding vector set corresponding to the second target model.

[0011] Step S4: For each query text, based on the results of the initial retrieval and the positive document association, select documents that meet the preset relevance conditions but are not positive documents as the hard samples corresponding to the query text, and construct a comparative learning training sample containing the query text, positive documents and hard samples.

[0012] Step S5: Use contrastive learning training samples to fine-tune the second target model, and use the fine-tuned second target model to re-encode the unified corpus to obtain the document embedding vector set corresponding to the fine-tuned second target model;

[0013] Step S6: The fine-tuned second target model and the models in the first model set are used as the model group; each model in the model group is used to encode the target query, generate the corresponding query vector, and perform similarity retrieval in parallel in the document embedding vector set corresponding to each model to obtain the corresponding original similarity matrix, which is used to characterize the similarity between the target query and each candidate document in each model;

[0014] Step S7: Perform adaptive denoising and normalization on each original similarity matrix to obtain the similarity score of each model in the model group for each candidate document; determine the weight of each model in the model group according to the weight strategy; sum the similarity score of each candidate document in each model in the model group with the weight of each model in the model group to obtain the corresponding fusion score; filter documents according to the fusion score of each candidate document to generate the first target document list for the target query.

[0015] In one specific embodiment of the present invention, in step S1, when constructing the unified corpus, the following operations are performed on each candidate document:

[0016] S11. Extract the title text field and the abstract text field. If the abstract is missing, extract information from the main text introduction as the abstract.

[0017] S12. Clean each extracted text field separately to increase the density of effective information;

[0018] S13. Linearly connect the cleaned title text field and the summary text field to generate a document text sequence that represents the core content of the document.

[0019] In one specific embodiment of the present invention, step S1, when constructing the unified corpus, further includes the following operations:

[0020] S14. A pre-built instruction prompt dictionary containing multiple types of academic intent is provided. The instruction prompt dictionary includes instruction template strings corresponding to each type of academic intent. The types of academic intent include methodology-oriented, experimental result-oriented, and review-oriented.

[0021] S15. For each candidate document, identify its academic intent based on the abstract, and concatenate the academic intent to the front of the document text sequence that represents the core content of the document, generating a model input sequence containing intent semantics to represent the candidate document.

[0022] In one specific embodiment of the present invention, step S2, when encoding the unified corpus, includes the following operations:

[0023] S21. The model input sequence of each candidate document is lexically segmented to obtain the corresponding token identifier sequence; each token identifier sequence is input into the model to perform forward propagation calculation to obtain the corresponding last layer hidden state tensor sequence containing rich semantic information.

[0024] S22. Using the mean convergence strategy, unify the sequence of hidden state tensors of the last layer corresponding to each candidate document into a vector of fixed dimensions.

[0025] S23. Perform L2 norm normalization on the fixed-dimensional vectors obtained in step S22, forcing all vectors to be mapped to the unit hypersphere, thereby unifying the dimensions of each vector and obtaining the document embedding vector of each candidate document.

[0026] In one specific embodiment of the present invention, step S5, the comparative learning fine-tuning includes:

[0027] S51. Keep the backbone network parameters in the second objective model constant during fine-tuning;

[0028] S52. In the attention mechanism module of the second target model, the query projection layer, key projection layer, value projection layer and output projection layer, as well as the upper projection layer, lower projection layer and gated projection layer in the feedforward network module, a trainable low-rank adaptation matrix is ​​connected in parallel.

[0029] S53. Calculate the gradient of the loss function based on the contrastive learning training samples, and update the weight parameters of the low-rank fitting matrix using only the gradient of the loss function.

[0030] The loss function is calculated as follows:

[0031]

[0032] In the formula, This represents the comparative loss value within the batch. The vector representation generated for the query text. The vector representation generated for the positive document. For the first The vector representation generated from each difficult sample The total number of samples, This represents the function for calculating the dot product or cosine similarity. This is the preset temperature coefficient.

[0033] In one specific embodiment of the present invention, step S5 involves recoding the unified corpus using the fine-tuned second target model, including:

[0034] S55. The weight parameters of the updated low-rank adaptation matrix are numerically merged with the backbone network parameters to generate the merged second target model.

[0035] S56. The unified corpus is encoded again using the merged second target model to generate an optimized second document embedding vector set.

[0036] In a specific embodiment of the present invention, in step S6, after each model in the model group performs similarity retrieval on the query vector, it extracts the top K candidate documents with high similarity and constructs the corresponding second candidate document set. The original similarity matrix is ​​used to characterize the similarity between the target query and each candidate document in the second candidate document set corresponding to the model.

[0037] As a specific embodiment of the present invention, step S7 includes:

[0038] S71. Determine the similarity score of each model in the model group for each candidate document: For each original similarity matrix, arrange each element in the original similarity matrix in descending order to generate a score sequence, calculate the second difference sequence of the score sequence, and take the position corresponding to the largest element in the second difference sequence as the cutoff point. Construct a third candidate document set with the candidate documents before the cutoff point in the score sequence, and normalize the similarity score of each candidate document in the third candidate document set as the score of the model corresponding to the original similarity matrix for each candidate document.

[0039] S72. Determine the dynamic weights of each model in the model group: For each model, convert the similarity scores of each candidate document in the second candidate document set into probability values, and obtain the retrieval entropy value of the model based on the probability values; then determine the dynamic weights based on the retrieval entropy values ​​of each model, calculated as follows:

[0040]

[0041]

[0042] In the formula, For the first The dynamic weights of each model The preset sensitivity adjustment factor, Represents the total amount of the model; Indicates the first The retrieval entropy of each model for the target query. For the second candidate document set, Candidate documents In the model The probability distribution value in;

[0043] S73. The similarity score of each candidate document in each model of the model group is weighted and summed with the dynamic weights of each model in the model group to obtain the corresponding fusion score.

[0044]

[0045] In the formula, Indicates candidate documents The fusion score for the target query For the first The model for the first Similarity score of candidate documents;

[0046] S74. Filter documents based on the fusion scores of each candidate document and generate a first target document list for the target query.

[0047] As a specific embodiment of the present invention, it further includes performing secondary filtering on candidate documents in the first target document list to generate a second target document list, specifically including:

[0048] S81. The weighted average of the retrieval entropy values ​​of the target query in each model of the model group is taken as the retrieval entropy value corresponding to the target query. ;

[0049] S82. Set the retrieval entropy value corresponding to the target query. The conversion to a diversity adjustment coefficient is calculated as follows:

[0050]

[0051] In the formula, This is the diversity adjustment coefficient. To retrieve sites in the entropy value, For hyperparameters;

[0052] S83. Based on the diversity adjustment coefficient and the fusion score of each candidate document in the first target document list, the maximum marginal relevance score of each candidate document is calculated using the maximum marginal relevance algorithm. Candidate documents are iteratively selected from high to low according to the maximum marginal relevance score and added to the second target document list until the preset number of documents is reached, and the final second target document list is generated.

[0053] The formula for calculating the maximum marginal correlation score is as follows:

[0054]

[0055] In the formula, Indicates candidate documents The maximum marginal relevance score;

[0056] Indicates the current candidate document With the list of documents already selected as the second target The semantic similarity of the most similar documents represents a redundancy penalty.

[0057] As a specific embodiment of the present invention, it also includes introducing graph computation concepts to perform secondary verification on the first target document list, specifically including:

[0058] S91. Obtain the reference list and citation list of each candidate document in the first target document list, and establish a local citation topology graph with candidate documents as nodes and citation relationships as directed edges.

[0059] S92. In the local citation topology graph, the fusion score of each candidate document in the first target document list is used as the activation energy of the corresponding node. The weighted in-degree centrality of each node is calculated as follows:

[0060]

[0061] In the formula, Represents a node Weighted in-degree centrality Represents a node out-neighbor nodes The fusion score of the corresponding candidate documents. For nodes The set of out-neighbor nodes, The attenuation coefficient;

[0062] S93. Normalize the weighted in-degree centrality of all nodes to generate the topological weight of the candidate document corresponding to the node. Combine the fusion score of the candidate document corresponding to the node to calculate the re-ranking score, as shown in the following formula:

[0063]

[0064] In the formula, Candidate documents The rearranged score, Candidate documents Topological weights, As a regulating factor;

[0065] S94. Based on the final rearrangement score The scores are used to reorder the first list of target documents, and the final list of target documents is output.

[0066] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0067] 1. By comparing the initial search results with the baseline truth, the model accurately identifies the difficult samples that were misjudged as highly relevant by the model. By constructing contrastive learning triples containing these difficult samples and fine-tuning them, the model is forced to reconstruct the decision boundary in the vector space, pushing the query vector away from the interfering documents. This error-correction mechanism significantly improves the model's ability to distinguish subtle semantic differences and ensures the logical accuracy of the search results.

[0068] 2. To address the catastrophic forgetting and cognitive narrowing issues caused by full fine-tuning, this invention employs a heterogeneous model integration strategy: on the one hand, it utilizes a first set of models (such as generative and encoding models) that maintains a pre-trained state to provide broad general semantic support; on the other hand, it fine-tunes the second target model to create a deep domain expert model. This parallel architecture leverages the orthogonality of error planes of different models, significantly improving retrieval accuracy in specific academic fields while perfectly preserving the generalization and understanding capabilities of interdisciplinary terms.

[0069] 3. The determinism of each model is quantified by calculating the retrieval entropy of the retrieval score distribution. Dynamic weight coefficients are generated in real time based on the entropy value, which automatically suppresses the voting power of high-entropy (low-confidence) models and amplifies the contribution of low-entropy (high-confidence) models. This data-driven adaptive fusion mechanism significantly enhances the robustness of the system when facing rare or complex queries.

[0070] 4. This invention uses the second-order difference algorithm to calculate the elbow point (gradient mutation point) of the similarity score curve, and realizes adaptive truncation of the candidate document list. Combined with the alignment and normalization processing of the global sparse tensor, this method ensures that only the head documents located in the high relevance interval participate in the final decision, eliminates invalid tail noise from the source, and further improves the purity and relevance of the final output document list.

[0071] 5. In the re-ranking stage, query semantics are used as the activation energy of the citation network to calculate weighted in-degree centrality. This simulates the researchers' thinking of tracing the source of literature from the surface to the core, that is, finding the core theory behind it by following the clues from the surface-related literature. This enables the retrieval system to not only find relevant papers, but also to dig out the root literature with explanatory power. Detailed Implementation

[0072] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0073] A multi-model paper retrieval method for academic question answering includes the following steps:

[0074] Step S1: Construct a unified corpus containing candidate documents and a training dataset containing query text and its positive document associations;

[0075] Based on existing large-scale academic data processing practices, and considering the characteristics of academic literature being unstructured, semantically fragmented, and question-answering tasks being sensitive to instructions, the purpose of this step is to build a standardized, unified data foundation containing rich semantic information, thereby eliminating the interference caused by differences in the original data format to subsequent multi-model coding.

[0076] First, to address the issue of format heterogeneity in the massive number of candidate documents, a standardized cleaning and structured reorganization process is implemented. Since the core semantics of academic papers are highly condensed in the title and abstract, and redundant information in the full text may introduce noise and increase computational overhead, each candidate document in the original database is traversed to accurately extract its corresponding title and abstract text fields. During the extraction process, to maximize the preservation of core semantics and remove format noise specific to academic texts, a rule-based deep cleaning strategy is executed: First, for LaTeX mathematical formula markers (such as...),... Special symbols are converted into plain text placeholders or natural language descriptions using regular expressions to prevent them from interfering with the semantic segmentation of subsequent word segmenters. Secondly, a template sentence filtering library is established to automatically remove common templated statements in the abstract that lack substantive semantics (such as 'This paper presents a novel method...') to improve the density of effective information. Finally, for documents with empty abstracts due to missing metadata, a fallback strategy is adopted to automatically extract the first 200-300 characters of the introduction as a substitute abstract to avoid data gaps caused by missing fields.

[0077] To ensure efficiency and memory safety during the extraction process, key-value mapping-based streaming parsing or distributed data frame processing frameworks (such as Apache Spark / Pandas) can be used. Through pre-defined field filters, the `title` and `abstract` keys in each paper's data object are precisely located and extracted. To clearly distinguish the document's hierarchical structure while preserving the natural semantics of the text, the system employs specific concatenation rules, using line breaks as semantic delimiters to linearly connect the title and abstract text fields, generating a document text sequence that represents the core content of the document. Its logical construction formula can be expressed as:

[0078]

[0079] In the formula, This represents the generated document text sequence. This indicates the extracted title text field. This indicates the extracted summary text field. This represents a string concatenation operation. This is a newline character.

[0080] Furthermore, considering that the heterogeneous language model set used in subsequent steps is highly sensitive to task instructions, and that different types of academic literature (such as review articles and experimental reports) have significantly different implicit retrieval needs, simply using uniform static instructions cannot fully activate the model's semantic representation capabilities in specific dimensions. Therefore, this embodiment introduces a dynamic adaptation mechanism for instruction templates before encoding.

[0081] First, a pre-built instruction prompt dictionary containing various academic intent types is constructed. This dictionary defines dedicated instruction template strings for different academic document types; that is, each academic intent type corresponds to one instruction template string. The specific academic intents include, but are not limited to:

[0082] Methodology-oriented: For documents that focus on algorithmic innovation, mathematical derivation, or architectural design, the instruction template is set as: "Retrieve papers describing novel algorithms or theoretical frameworks for:" to guide the model to focus on methodological details;

[0083] Result-oriented: For documents that focus on performance comparisons, ablation experiments, or state-of-the-art (SOTA) results, the instruction template is set as: "Find experimental comparisons and benchmark results, regarding:" to guide the model to focus on data and metrics.

[0084] Survey-oriented: For summary, retrospective, or taxonomic documents, the instruction template is set as: "Summarize the development history and research trends of:", to guide the model to focus on the macro knowledge context.

[0085] Next, paragraph-level intent classification is performed on each candidate document in the unified corpus. A lightweight classifier (such as a DistilBERT-based classification model) is used to analyze the document's summary text field to identify the academic intent of the candidate document. For example, if words such as "survey," "review," and "taxonomy" appear frequently in the summary, it is classified as a review-oriented document.

[0086] Finally, based on the identified academic intent, the corresponding instruction template string is matched from the instruction prompt lexicon and concatenated to the front end of the document text sequence representing the core content of the document, generating a model input sequence containing intent semantics. This sequence is stored in a unified corpus as input for subsequent encoding operations. The logical formula for this process is updated as follows:

[0087]

[0088] In the formula, This represents the generated model input sequence. This represents a template string representing the intent instruction for dynamic matching. In this way, the generated model input sequence explicitly incorporates semantic features of intent, thereby implicitly clustering documents with different functions in the vector space, significantly improving the targeting of the retrieval.

[0089] Meanwhile, to construct supervisory signals that can effectively guide model fine-tuning, the system simultaneously processes raw question-and-answer data (such as the OAG-QA dataset) containing query text and its positive document associations. The system parses the query text and the corresponding list of relevant paper IDs in the raw data to establish a baseline truth mapping between the query text and its positively related documents. A "positive document" is strictly defined as an academic paper in the training dataset that has been manually labeled or verified by citation relationships and contains direct answers or core evidence for a specific academic question (Query). In the vector space of metric learning, it acts as an attractive target in the "anchor-positive example" pair and is fundamentally different from samples that only have literal similarity.

[0090] This process not only extracts the semantic features of the query text, but also establishes the positive example boundary in subsequent contrastive learning. The mapping relationship of the constructed training dataset can be defined as follows:

[0091]

[0092] In the formula, Represents the training dataset. Indicates the first A query text, Indicates and The set of document identifiers with positive correlations is given, where the document identifier is the document ID and Z is the total number of training samples. Through the above derivation and processing, the heterogeneous raw data is transformed into a structured unified corpus and a training dataset with clear supervision relationships. This solves the problem of model input alignment difficulties caused by chaotic data formats in existing technologies and lays a solid data logic foundation for subsequent deep optimization based on hard sample mining.

[0093] Step S2: Pre-set at least two models, one of which is the second target model, and the remaining models construct the first model set. The first model set contains at least one language model that is different from the second target model. Encode the unified corpus using each model to generate the corresponding document embedding vector set.

[0094] After constructing the unified corpus, step S2 aims to overcome the limitations of a single model in semantic understanding by transforming the model input sequences of each candidate document into dense vectors capable of representing deep semantics through a parallelized high-dimensional feature extraction mechanism. Based on existing large-scale pre-trained language model techniques, single-architecture models often have specific capability boundaries. For example, some models excel at instruction following but lack the ability to capture long-range contextual dependencies, while others possess powerful retrieval and matching patterns but lack generalization ability for unfamiliar academic terms. To address this issue, this method employs a "heterogeneous model ensemble" encoding strategy, separately retrieving models from a pre-defined first model set and the second target model to process the data in the unified corpus in parallel.

[0095] Specifically, in some embodiments, the first model set is configured to include a collection of at least one (usually multiple, complementary) pre-trained language models that differ from the second target model. Examples include integrating a Transformer-based, fine-tuned general semantic encoding model (e.g., NV-Embed-v1), a discriminative model that explicitly enhances retrieval relevance (e.g., Linq-Embed-Mistral), and a global context-aware model with bidirectional attention (e.g., GritLM-7B). Simultaneously, a general model with good autoregressive generation properties and suitable for efficient subsequent parameter fine-tuning (e.g., SFR-Embedding-Mistral) is selected as the second target model. This differentiated model selection logic ensures that the generated vector space contains both broad general semantic features and retains specific retrieval optimization features, laying a diverse foundation for subsequent complementary fusion.

[0096] It is important to note that this embodiment employs the aforementioned heterogeneous model combination of generative, discriminative, and encoding models, based on the theoretical basis of the orthogonality of error planes. Generative models excel at capturing long-range contextual dependencies but may produce illusions when handling keyword matching in non-natural language; while discriminative models strengthen query-document matching patterns through contrastive training, are sensitive to hard matching but have weak generalization ability. By integrating these models, a multi-dimensional semantic feature space is constructed, where the blind spots (high-entropy regions) of one type of model often fall precisely into the confidence regions (low-entropy regions) of another type of model, thereby minimizing the systematic bias caused by a single architecture.

[0097] In the specific encoding implementation process, the model input sequence containing intent semantics corresponding to each candidate document is lexically segmented to obtain its corresponding token identifier sequence. For each token identifier sequence in the unified corpus, it is first input into the aforementioned models to perform forward propagation computation. The multi-layer Transformer blocks inside the model capture the dependencies between tokens through a self-attention mechanism, and finally output the last layer hidden state tensor sequence containing rich semantic information. In order to unify the variable-length tensor sequence into a fixed-dimensional document-level representation, a mean pooling strategy is adopted to calculate the arithmetic mean of the hidden state vectors corresponding to all valid tokens in the sequence. This process aims to eliminate the influence of text length differences while preserving the global semantic center of gravity of the entire document. Its calculation logic can be expressed as follows:

[0098]

[0099] In the formula, This represents the aggregated candidate document representation vector. Indicates the effective length of the input sequence. Indicates the first position in the sequence The tokens are output as a high-dimensional hidden state vector in the last layer of the model.

[0100] Next, considering the significant differences in the magnitude distribution of vectors output by different models, directly using the original vectors for similarity calculation would lead to inconsistent measurement standards. Therefore, to adapt to the subsequent efficient retrieval algorithm based on inner product, this method performs L2 norm normalization on the vectors after mean aggregation, forcing all vectors to be mapped onto the unit hypersphere to obtain document embedding vectors. This mathematical transformation not only unifies the dimensions but also makes the inner product operation between vectors geometrically equivalent to cosine similarity, thereby significantly improving the computational efficiency and numerical stability of retrieval. The normalization calculation formula is defined as:

[0101]

[0102] In the formula, This represents the final generated document embedding vector. This represents the L2 norm of the original vector. Finally, the vectors generated by each model in the first model set are aggregated to form the corresponding first document embedding vector set (N models in the first model set correspond to N first document embedding vector sets); simultaneously, the vectors generated by the second target model are aggregated to form the initial second document embedding vector set. Through this process, a mapping from the text space to a multi-dimensional heterogeneous vector space is successfully achieved, ensuring that subsequent retrieval steps can be carried out in parallel from multiple complementary semantic perspectives.

[0103] Step S3: Encode each query text using the second target model, and perform an initial retrieval based on each encoding result in the document embedding vector set (second document embedding vector set) corresponding to the second target model;

[0104] Based on the constructed unified corpus vector space, this step further establishes the initial retrieval performance baseline of the second target model in its untuned state, thereby exposing the model's current semantic confusion blind spots and providing accurate targeted data for subsequent hard sample mining. Given that the core idea of ​​dense retrieval is to map queries and documents to the same shared semantic space for measurement, this step first obtains the query text and strictly follows the same encoding logic and preprocessing specifications as when generating the second document embedding vector set in step S2. This is because only when the query text vector and document embedding vector are generated by the same model (i.e., the second target model) based on the same instruction format and parameter distribution can the geometric distance between them truly reflect semantic relevance, avoiding systematic biases introduced by encoder heterogeneity or input format mismatch.

[0105] In practice, the query text is first enhanced with predefined instructions by appending a pre-defined query task instruction (e.g., "Represent this query for retrieving relevant documents:") before the query text, thus constructing a complete model input sequence. Then, the second target model is invoked to perform forward inference on this sequence, extracting the last hidden state and performing mean pooling to aggregate and generate the original query feature vector. To ensure the geometric validity of the similarity measure, the original query feature vector must be L2-norm normalized to obtain a final query text vector with a modulus of 1. The generation process of this query vector can be defined by the following formula:

[0106]

[0107] In the formula, This represents the normalized query text vector. This represents the second objective model. For query terminal commands, The original query text. This indicates a splicing operation.

[0108] After vectorizing the query text, leveraging the efficiency of matrix operations, a full or approximate nearest neighbor search is performed on the second set of document embedding vectors generated in step S2. Specifically, the dot product (equivalent to cosine similarity under normalization conditions) between the query text vector and each document embedding vector in the set is calculated to quantify the semantic association strength between the query text and each document embedding vector. The calculation formula is as follows:

[0109]

[0110] In the formula, Represents the query text vector and the first Similarity scores of document embedding vectors, The second document embeds the first in the vector set Each element.

[0111] Based on the calculated similarity score sequence, a descending sort operation is performed, and a predetermined number (e.g., Top-K) of documents with the highest ranking are extracted to generate an initial candidate document set. This process not only completes the conventional retrieval and recall task, but more importantly, the retrieval results at this stage accurately reflect the cognitive level of the second target model in its "factory state." Documents with high scores but that are actually irrelevant (i.e., the difficult samples to be identified in subsequent steps) represent the "false positive" errors in the model's current semantic space. Therefore, by reproducing the model's misjudgment of irrelevant documents, the most direct data support is provided for subsequent correction of semantic boundaries through contrastive learning.

[0112] Step 4: For each query text, based on the initial search results and the positive document association, select documents that meet the preset relevance conditions but are not positive documents as hard samples, and construct a comparative learning training sample containing query text, positive documents and hard samples.

[0113] Following the initial candidate document set generated based on the second target model in step S3, this step aims to uncover the model's current cognitive blind spots through differential analysis, specifically identifying "hard negatives" that are very close to the query text in the vector space but do not semantically match. In existing practices, randomly sampled samples are often semantically significantly different from the query text, making them easily distinguishable by the model. This results in minimal gradient contribution during training, hindering the model's ability to learn nuanced semantic boundaries. To address this challenge, this method employs an "error-correction" mining strategy. First, it obtains the positive document associations established in step S1 and extracts the set of positive document identifiers (Ground Truth) corresponding to the current query text.

[0114] Then, the method executes a set-based filtering logic, comparing the preliminary candidate document set output in step S3 with the set of positive document identifiers for the query text. Since the preliminary candidate document set is sorted in descending order of similarity score, the documents ranked higher represent the most relevant results currently considered by the model. At this point, the core of the logical judgment lies in identifying "false positive" errors: traversing the candidate set, if a document has a high similarity score (i.e., meets the preset relevance condition, such as being in the Top-K list), but its identifier does not exist in the set of positive document identifiers, it indicates that the document belongs to the interference items misjudged by the model. Such documents are removed from the candidate set and defined as hard samples; the filtering process can be represented by the set difference formula:

[0115]

[0116] In the formula, This represents the set of difficult samples selected. Represents the document identifier. For the generated inclusion before A preliminary candidate document set of highly similar documents. Indicates that, This is the set of forward document identifiers corresponding to the query text.

[0117] In the actual execution of set difference filtering, to prevent false negative noise from disrupting the model's fine-tuning effect (i.e., some documents are substantially relevant but not labeled by the current Ground Truth), the system introduces a similarity safety boundary constraint mechanism. Specifically, an adaptive safety threshold range is set. Only documents with similarity scores within this range are selected as difficult samples. If a document's score is higher than this range, it is considered a difficult sample. (Extremely similar) This document is at high risk of being a missed positive sample; discard it to avoid misleading the model; if the score is lower than Samples that contribute very little to the gradient are considered simple samples and are not included. This interval sampling strategy ensures that the mined samples have both sufficient difficulty and sufficient safety.

[0118] Subsequently, to transform these mined difficult features into supervisory signals that the model can learn, a data recombination operation is performed to construct training samples that conform to the contrastive learning paradigm. Specifically, using the current query text as the anchor, documents are sampled from the set of positive document identifiers as positive examples, and documents are sampled from the selected set of difficult samples as negative examples, combining them to generate data with a triple structure. To enhance the robustness of training, multiple samples are typically included in a single training sample, thus forming a contrastive context that can cover multi-dimensional semantic errors. The final form of the constructed contrastive learning training samples is as follows:

[0119]

[0120] In the formula, Represents a single training sample. For querying text, For positive documents, From Selected from These are challenging samples. Through this deductive construction process, this step successfully transforms the model's "errors" in the initial retrieval into high-value "teaching materials" for subsequent fine-tuning. This forces the model to pay attention to the subtle semantic differences that lead to misjudgments during subsequent training, thus providing crucial data support for breaking through the performance limitations of general-purpose models.

[0121] Step S5: Use contrastive learning training samples to fine-tune the second target model, and use the fine-tuned second target model to re-encode the unified corpus to obtain the document embedding vector set corresponding to the fine-tuned second target model, that is, the optimized second document embedding vector set.

[0122] Based on the high-quality contrastive learning training samples constructed in step S4, this step utilizes this data rich in "error correction information" to perform targeted optimization of the second target model through Parameter Efficient Fine-Tuning (PEFT) technology. This significantly improves its semantic discriminative ability in a specific academic domain while preserving its general language capabilities. Given that full-scale fine-tuning of large-scale language models not only faces high memory and computational costs but also easily leads to catastrophic forgetting, causing the model to lose its basic language comprehension capabilities, this method employs a low-rank adaptation (LoRA) strategy to construct the fine-tuning architecture.

[0123] Specifically, firstly, all pre-trained backbone network parameters in the second target model are configured to be non-trainable (i.e., the parameters remain fixed), keeping them silent during training. Only at the model's key semantic transformation nodes—specifically including the query projection layer (q_proj), key projection layer (k_proj), value projection layer (v_proj), and output projection layer (o_proj) in the attention mechanism module, and the up projection layer (up_proj), down projection layer (down_proj), and gate projection layer (gate_proj) in the feedforward network module—are trainable low-rank adaptation matrices connected in parallel. The mathematical logic involves updating the weight matrix... Decomposed into two low-rank matrices and The product of, i.e. ,in These are the original weights that are frozen. This ensures that the model can flexibly fit complex semantic mapping relationships in academic retrieval tasks by adjusting only a very small proportion of the parameter space.

[0124] During the fine-tuning phase, the contrastive learning training samples generated in step S4 are input into the modified model architecture. To focus on core semantics, strict length truncation is performed on the input query text and document text (e.g., query text truncated to 32 words, candidate documents truncated to 156 words), and the hidden layer vectors corresponding to the sequence end tokens (EOS tokens) are extracted as the overall semantic representation, followed by L2 normalization. When constructing the input batch, to prevent the model from falling into mode collapse caused by purely hard samples during training, this embodiment adopts a hybrid sample sampling strategy. That is, when calculating the loss, the sample term in the denominator not only includes the hard samples selected from the above steps, but also includes positive documents corresponding to other queries in the current batch (i.e., in-batch negatives). This hybrid strategy uses in-batch samples to maintain the global anisotropy of the vector space, while using hard samples to fine-tune the local decision boundary, ensuring the stability of the fine-tuning process.

[0125] The driving force behind model optimization stems from the computation and backpropagation of the contrastive learning loss function (InfoNCE Loss), which aims to maximize the lower bound of the mutual information between the query vector and the forward document vector, while strongly suppressing the similarity between the query and hard samples. Its calculation formula is defined as:

[0126]

[0127] In the formula, This represents the comparative loss value within the batch. The vector representation generated for the query text. The vector representation generated for the positive document. For the first The vector representation generated from each difficult sample The total number of samples, This represents the function for calculating the dot product or cosine similarity. A preset temperature coefficient (e.g., 0.01) is used to adjust the smoothness of the Softmax distribution to focus on difficult samples. Based on the calculated gradient, the optimizer (e.g., AdamW) only updates the weight parameters of the low-rank adaptation matrix, forcing the model to reconstruct the decision boundary in the vector space, pushing away difficult samples that confuse the model in the initial retrieval from the query vector.

[0128] Once the fine-tuning training converges, to ensure computational efficiency in the subsequent inference phase and avoid introducing additional inference latency, a parameter merging operation is performed, merging the trained low-rank adaptation matrix. The numerical results are directly added together and fused back to the original backbone parameters. In this process, a structurally complete but enhanced merged second target model is generated. Finally, it must be noted that because the fine-tuning operation fundamentally alters the model's encoding characteristics, causing a global shift in the original vector space's geometric distribution, the initial vectors generated in step S2 cannot be directly reused. Instead, the merged second target model must be used to reread all document text sequences from the unified corpus and perform a full recoding computation. This process generates a completely new, optimized second document embedding vector set with refined semantic discriminative capabilities, thoroughly replacing the initial set and thus providing a high-precision vector index foundation for subsequent multi-model parallel retrieval.

[0129] Step S6: The fine-tuned second target model and the models in the first model set are used as a model group; each model in the model group is used to encode the target query, generate the corresponding query vector, and perform similarity retrieval in parallel in the document embedding vector set (each first document embedding vector set and the optimized second document embedding vector set) corresponding to each model to obtain the corresponding original similarity matrix, which is used to characterize the similarity between the target query and each candidate document in each model;

[0130] After completing the targeted fine-tuning and full-database recoding of the second target model, a parallel retrieval architecture is further constructed to combine the deep semantic understanding capabilities of a single model with the broad generalization advantages of multiple models. This addresses the issue of missed or false detections in academic question answering caused by a single semantic focus. When responding to a user's target query, a parallel multi-path vector generation process is first initiated. Given the different sensitivities of different language models to input formats, the target query must be standardized and encapsulated strictly according to the pre-defined instruction specifications of each model.

[0131] Specifically, for the models in the first model set that maintain their pre-trained state (including models such as NV-Embed and Linq-Embed that focus on general instruction following or retrieval matching), their original weight parameters are invoked to convert the target query into a query vector adapted to their respective semantic spaces. For the second target model (i.e., the model with domain-specific discrimination capabilities) after fine-tuning in step S5, its updated weights, which incorporate low-rank adaptation parameters, are invoked to generate a dedicated query vector capable of accurately identifying difficult semantic boundaries. To ensure the geometric consistency of the vector metrics, all generated query vectors must undergo L2 norm normalization, the calculation logic of which is as follows:

[0132]

[0133] In the formula, Indicates the first Normalized query vectors generated by the model Indicates the corresponding encoder network. This is a template for query commands. Query the target text.

[0134] After the query vectors are generated, the parallel retrieval phase begins. At this point, the system memory maintains multiple sets of vector indexes with distinct characteristics: one set is an optimized second document embedding vector set index generated by the optimized second target model, representing refined domain features; the others are first document embedding vector set indexes generated by models in the first model set, representing general semantic features (each model in the first model set corresponds to one first document embedding vector set index). The query vectors generated by each model are then... Each query is mapped to its corresponding index space, and the similarity score between the query and a massive number of documents is efficiently calculated using vector inner product operations. Since the vectors are normalized, the inner product value is directly equivalent to the cosine similarity, which can accurately measure the magnitude of the semantic angle. For each model channel, the query with the highest similarity score is independently extracted. For each document in the second candidate document set, an original similarity matrix is ​​generated, containing the document index and its corresponding score (the similarity score between the target query and the document). This parallel retrieval process not only uses a fine-tuned model to correct misjudgments for specific terms, but also uses a general model to prevent semantic narrowing caused by overfitting. The multiple sets of output results can be formally represented as follows:

[0135]

[0136] In the formula, Indicates the first The original similarity matrix output by each model. This is the set of document embedding vectors corresponding to this model. and These are the retrieved document identifiers and their similarity scores. Through this parallel processing logic, the complementary advantages of multi-source heterogeneous models are successfully transformed into specific search result data, providing a decision-making basis that combines accuracy and robustness for the final weighted fusion.

[0137] Step S7: Perform adaptive denoising and normalization on each of the original similarity matrices to obtain the similarity scores of each model for each candidate document; determine the weights of each model according to the weight strategy; sum the similarity scores of each candidate document in each model with the weights of each model to obtain the corresponding fusion score; filter documents according to the fusion scores of each candidate document to generate the first target document list for the target query.

[0138] Based on multiple sets of original similarity matrices output by parallel retrieval, this step further addresses the issues of inconsistent scoring dimensions among heterogeneous models and interference from long-tail retrieval noise. Through an adaptive truncation and normalization fusion mechanism, a final high-confidence target document list is output. Given the significant differences in training objectives and metric spaces between different language models (such as the general model in the first model set and the fine-tuned second target model), their output similarity scores often lack direct comparability in numerical distribution (e.g., some models' scores are concentrated between 0.9 and 1.0, while others may be distributed between 0.5 and 0.8). Furthermore, a fixed Top-K truncation strategy easily introduces a large number of low-relevance tail documents as noise. Therefore, this step first introduces adaptive truncation logic based on similarity gradients to "de-noise" each set of retrieval results.

[0139] Specifically, the original similarity matrix output by any model in the model group is obtained, and its elements (similarity scores) are arranged in descending order to form a score sequence. To identify semantic abrupt changes from "high relevance" to "low relevance," discrete differentiation is performed on this sequence. First, the first difference of the score sequence is calculated to generate the first difference sequence representing the gradient of score descent. Then, the first difference sequence is differentially differentiated again to generate the second difference sequence representing the rate of change of the gradient (i.e., the curvature of the curve). Its mathematical expression is:

[0140]

[0141] In the formula, Indicates the first The second-order difference value at each position, The similarity score is calculated after sorting. The second difference sequence is traversed, and the index position corresponding to the element with the largest value is identified. This position is the semantic relevance "elbow point" and is marked as a valid cutoff point. Based on this, the candidate document list in the second candidate document set is dynamically truncated, retaining only documents ranked before the valid cutoff point to construct a third candidate document set for subsequent calculations, thus eliminating long-tail noise at the source. For certain special distributions of similarity sequences (e.g., extremely flat scores resulting in no obvious peak in the second difference), a double-safety truncation mechanism is further implemented: firstly, a minimum retention constraint, forcibly retaining the top-ranked documents regardless of the elbow point calculation result. (e.g., Top-5) documents to prevent all relevant documents from being filtered out due to excessive truncation; secondly, a maximum candidate number constraint, ensuring that even if the score decreases slowly, the truncation position does not exceed [a certain number]. (e.g., Top-100) This prevents too many noisy documents from entering subsequent fusion calculations. This strategy, which prioritizes dynamic calculation and supplements it with static constraints, ensures robustness under various extreme data distributions.

[0142] After adaptive truncation, row-level normalization and sparse tensor construction are performed to achieve mathematical alignment of multi-model scores. First, a dimension is initialized. The global sparse tensor, in which For the number of queries, To unify the total number of documents in the corpus, This represents the total number of models participating in the fusion. For each truncated result set, row-level normalization (usually Min-Max normalization) is performed, linearly mapping the similarity score corresponding to each target query to a preset numerical range (e.g., [0, 1]) to eliminate dimensional differences. The calculation formula is as follows:

[0143]

[0144] In the formula, The normalized similarity score, i.e., the first... The model for the first Similarity score of candidate documents, For the first Each model is a set of all adaptively truncated similarity scores for the target query.

[0145] Subsequently, the normalized scores are filled into the corresponding coordinates of the global sparse tensor. For document locations missing due to truncation or failure to be retrieved, zero values ​​are explicitly filled in. This operation achieves union alignment of multi-way retrieval results. Finally, based on the preset model weight vectors (usually giving higher weights to the fine-tuned second target model), the global sparse tensor is weighted and summed along the model dimensions to generate the final similarity score matrix.

[0146] Regarding the determination logic of the "model weight vector", in order to overcome the shortcomings of traditional static preset weights that cannot adaptively reflect the confidence differences of heterogeneous models under different queries (for example, although the fine-tuned second target model performs well in the domain, it may "illusion" when facing cross-disciplinary or extremely rare queries, resulting in a flattened search distribution).

[0147] This application further introduces a dynamic confidence weighting mechanism based on retrieval entropy to construct the weight vector in real time. This mechanism is based on the principle of information theory, that is, a retrieval system with high confidence often shows significant discrimination for correct answers (steep score distribution, low information entropy), while a retrieval system with low confidence tends to show convergence in the scores of candidate documents (flat score distribution, high information entropy).

[0148] Specifically, before performing the weighted summation, for each model, the similarity scores of each candidate document in its corresponding second candidate document set are converted into probability values ​​(i.e., the element values ​​(similarity scores) in the original similarity matrix corresponding to the model are converted into probability distributions) to measure the certainty of the model's prediction of the current target query result. To avoid the influence of negative values ​​and ensure probability normalization, the Softmax function or linear normalization is typically used to map the original scores to probability values. Subsequently, the retrieval entropy value of the model under the target query is calculated using the Shannon entropy formula. The calculation logic is as follows:

[0149]

[0150] In the formula, Indicates the first The retrieval entropy of each model for the target query. For the second candidate document set, Candidate documents In the model The probability distribution value in.

[0151] Based on the calculated retrieval entropy, a dynamic weight coefficient is constructed using a negative correlation mapping function. This aims to assign higher decision weights to low-entropy (high-confidence) models while suppressing noise interference from high-entropy (low-confidence) models. To ensure the relative fairness of the weights across multiple models, the negative exponential form of the entropy value is typically normalized using Softmax to generate the final dynamic model weight vector.

[0152]

[0153] In the formula, For dynamically generated first The dynamic weights of each model The preset sensitivity adjustment factor (usually) ), This represents the total weights of the model. Finally, when generating the final score matrix, this dynamically generated weight vector replaces the statically preset weights, performing an adaptive weighted fusion calculation:

[0154]

[0155] In the formula, Indicates candidate documents The fusion score is applied to the target query. Through this dynamic weighting, a successful leap from static fusion based on "empiricism" to adaptive fusion driven by "data" has been achieved. This ensures that the weight of each model is dynamically allocated based on its actual performance on the specific problem, thereby significantly improving the robustness and accuracy of the final retrieval list.

[0156] Based on the final fusion score, all candidate documents are globally sorted in descending order, and a list of first target documents matching the target query is determined. There are many ways to determine the match, such as selecting the top K documents with high fusion scores as matching documents. In the end, a high-precision retrieval output combining the advantages of multiple models is achieved.

[0157] The final fusion score is calculated using the above formula. Subsequently, in order to address the common problem of high relevance but low information increment in academic retrieval (i.e., the content of the top-scoring documents is highly repetitive), this embodiment further introduces a diversity calibration step based on entropy feedback to perform secondary screening of candidate documents in the first target document list to generate a second target document list.

[0158] First, the weighted average of the retrieval entropy values ​​of the target query across all models in the model group is taken as the retrieval entropy value corresponding to the target query. .

[0159] Secondly, the retrieval entropy value is converted into a diversity adjustment coefficient using a preset mapping function. The mapping function uses an inverse sigmoid function:

[0160]

[0161] In the formula, The location in the entropy value, This is a hyperparameter. The function establishes a negative correlation: when retrieving the entropy value... At a higher level (low model confidence), As the entropy value decreases, the system tends to display more diverse results; when the retrieval entropy value... At a lower (model confidence) level, As the value increases, the system tends to make an exact match.

[0162] Based on calculations Combined with existing fusion scores, utilizing the maximum marginal correlation ( MMR The algorithm calculates the maximum marginal relevance score of each candidate document, and iteratively selects candidate documents from high to low based on the maximum marginal relevance score to add them to the second target document list until a preset number of documents is reached, thereby generating the final second target document list.

[0163] The screening formula is defined as follows:

[0164]

[0165] In the formula:

[0166] Indicates candidate documents The maximum marginal relevance score;

[0167] This is the fusion score calculated in step S7, representing the relevance gain;

[0168] Indicates the current candidate document With the list of documents already selected as the second target The semantic similarity of the most similar documents represents a redundancy penalty.

[0169] In summary, through the cascaded processing of S1 to S7, the semantic decision boundary of a specific academic domain is reconstructed at the micro level by utilizing hard sample mining and LoRA fine-tuning; at the macro level, the adaptive balance between general semantics and domain-specific semantics is achieved by using heterogeneous model parallelism and dynamic entropy weighting. This combination of micro-level error correction and macro-level complementarity effectively solves the semantic illusion and cognitive narrowing problems commonly found in existing technologies in academic question answering.

[0170] Considering the rigorous citation networks among academic documents, simple semantic similarity retrieval may recall documents that are similar in content but are academically isolated. In order to further enhance the academic authority and relevance of the retrieval results, some embodiments introduce graph computing concepts for secondary verification.

[0171] First, the metadata of each candidate document in the first target document list generated in step S7 is obtained, and its reference list and citation list are parsed. Using the candidate documents in the first target document list as nodes and the citation relationships as directed edges, a local citation topology graph for the target query is constructed. .

[0172] Next, in this local citation topology graph, the weighted in-degree centrality of each node (document) is calculated. A higher weighted in-degree centrality indicates that the candidate document occupies a more central academic position within the currently retrieved document cluster, and its likelihood of being the correct answer or key evidence is greater. Specifically, in some embodiments, the fusion score of each candidate document in the first target document list is used as the activation energy of that node. If candidate document A cites candidate document B, then document B is considered to be the theoretical basis or logical antecedent of document A, and the activation energy of candidate document A is reduced by a preset decay coefficient (…). The weighted in-degree centrality of document B is calculated by passing the reference edge along the reference edge to candidate document B.

[0173]

[0174] In the formula: Represents a node Weighted in-degree centrality Represents a node out-neighbor nodes The fusion score of the corresponding candidate documents. For nodes The set of out-neighbor nodes, This is the attenuation coefficient.

[0175] This metric characterizes the extent to which document B constitutes the logical root node of the relevant documents for the current query. A higher score indicates that the document is more likely to be the core theoretical source for solving the current problem, rather than merely a superficially relevant application case.

[0176] Normalize the weighted in-degree centrality of all nodes to generate the topological weight of the candidate document corresponding to that node. Using a preset adjustment factor. (For example The topological weights are linearly mixed with the fusion score obtained in step S7 to obtain the rearrangement score. :

[0177]

[0178] Based on the final rearrangement score The scores are used to reorder the initial list of target documents, resulting in a final list of target documents. This mechanism effectively improves the ranking of classic documents or key reviews that both meet semantic requirements and are central to the academic citation network.

[0179] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to any specific implementation. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A multi-model paper retrieval method for academic question answering, characterized in that, Includes the following steps: Step S1: Construct a unified corpus containing candidate documents and a training dataset containing query text and its positive document associations; Step S2: Preset at least two models, one of which is the second target model, and the remaining models construct a first model set. The first model set contains at least one language model that is different from the second target model. Encode the unified corpus using each model to generate a corresponding set of document embedding vectors. Step S3: Encode each query text using the second target model, and perform an initial retrieval in the document embedding vector set corresponding to the second target model based on each encoding result; Step S4: For each query text, based on the results of the initial retrieval and the positive document association, select documents that meet the preset relevance conditions but are not positive documents as difficult samples corresponding to the query text, and construct a comparative learning training sample containing query text, positive documents and difficult samples. Step S5: Use the contrastive learning training samples to perform contrastive learning fine-tuning on the second target model, and use the fine-tuned second target model to re-encode the unified corpus to obtain the document embedding vector set corresponding to the fine-tuned second target model; Step S6: Using the fine-tuned second target model and the models in the first model set as a model group; using each model in the model group to encode the target query, generating the corresponding query vector, and performing similarity retrieval in parallel in the document embedding vector set corresponding to each model to obtain the corresponding original similarity matrix, which is used to characterize the similarity between the target query and each candidate document in each model; Step S7: Perform adaptive denoising and normalization on each of the original similarity matrices to obtain the similarity score of each model in the model group for each candidate document; The weights of each model in the model group are determined according to a weighting strategy; The similarity score of each candidate document in each model of the model group is weighted and summed with the weights of each model in the model group to obtain the corresponding fusion score. Documents are then filtered according to the fusion scores of each candidate document to generate the first target document list for the target query.

2. The multi-model paper retrieval method for academic question answering according to claim 1, characterized in that: In step S1, when constructing the unified corpus, the following operations are performed on each candidate document: S11. Extract the title text field and the abstract text field. If the abstract is missing, extract information from the main text introduction as the abstract. S12. Clean each extracted text field separately to increase the density of effective information; S13. Linearly connect the cleaned title text field and the summary text field to generate a document text sequence that represents the core content of the document.

3. The multi-model paper retrieval method for academic question answering according to claim 2, characterized in that: In step S1, the construction of the unified corpus also includes the following operations: S14. Pre-construct an instruction prompt dictionary containing multiple types of academic intents. The instruction prompt dictionary includes instruction template strings corresponding to each type of academic intent. The types of academic intents include methodology-oriented, experimental result-oriented, and review-oriented. S15. For each candidate document, identify its academic intent based on the abstract, and append the academic intent to the front end of the document text sequence that represents the core content of the document to generate a model input sequence containing intent semantics, which is used to represent the candidate document.

4. The multi-model paper retrieval method for academic question answering according to claim 3, characterized in that: In step S2, encoding the unified corpus includes the following operations: S21. The model input sequence of each candidate document is lexically segmented to obtain the corresponding token identifier sequence; each token identifier sequence is input into the model to perform forward propagation calculation to obtain the corresponding last layer hidden state tensor sequence containing rich semantic information. S22. Using the mean convergence strategy, unify the sequence of hidden state tensors of the last layer corresponding to each candidate document into a vector of fixed dimensions. S23. Perform L2 norm normalization on the fixed-dimensional vector obtained in step S22 to obtain the document embedding vector of each candidate document.

5. The multi-model paper retrieval method for academic question answering according to claim 1, characterized in that: In step S5, the contrastive learning fine-tuning includes: S51. Keep the backbone network parameters in the second target model constant during the fine-tuning process; S52. Trainable low-rank adaptation matrices are connected in parallel to the query projection layer, key projection layer, value projection layer and output projection layer in the attention mechanism module of the second target model, and the upper projection layer, lower projection layer and gated projection layer in the feedforward network module. S53. Calculate the gradient of the loss function based on the contrastive learning training samples, and update the weight parameters of the low-rank adaptation matrix using only the gradient of the loss function. The loss function is calculated as follows: In the formula, This represents the comparative loss value within the batch. The vector representation generated for the query text. The vector representation generated for the positive document. For the first The vector representation generated from each difficult sample The total number of samples, This represents the function for calculating the dot product or cosine similarity. This is the preset temperature coefficient.

6. The multi-model paper retrieval method for academic question answering according to claim 5, characterized in that: In step S5, the unified corpus is recoded using the fine-tuned second target model, including: S55. The weight parameters of the updated low-rank adaptation matrix are numerically merged with the backbone network parameters to generate the merged second target model. S56. The unified corpus is encoded again using the merged second target model to generate an optimized second document embedding vector set.

7. The multi-model paper retrieval method for academic question answering according to claim 1, characterized in that: In step S6, after each model in the model group performs similarity retrieval on the query vector, it extracts the top K candidate documents with high similarity to construct a corresponding second candidate document set. The original similarity matrix is ​​used to characterize the similarity between the target query and each candidate document in the second candidate document set corresponding to the model.

8. The multi-model paper retrieval method for academic question answering according to claim 7, characterized in that: Step S7 includes: S71. Determine the similarity score of each model in the model group for each candidate document: For each original similarity matrix, arrange each element in the original similarity matrix in descending order to generate a score sequence, calculate the second difference sequence of the score sequence, and take the position corresponding to the largest element in the second difference sequence as the cutoff point. Construct a third candidate document set with the candidate documents before the cutoff point in the score sequence, and normalize the similarity score of each candidate document in the third candidate document set as the score of the model corresponding to the original similarity matrix for each candidate document. S72. Determine the dynamic weights of each model in the model group: For each model, convert the similarity scores of each candidate document in the second candidate document set into probability values, and obtain the retrieval entropy value of the model based on the probability values; then determine the dynamic weights based on the retrieval entropy values ​​of each model, calculated as follows: In the formula, For the first The dynamic weights of each model The preset sensitivity adjustment factor, Represents the total amount of the model; Indicates the first The retrieval entropy of each model for the target query. For the second candidate document set, Candidate documents In the model The probability distribution value in; S73. The similarity score of each candidate document in each model of the model group is weighted and summed with the dynamic weights of each model in the model group to obtain the corresponding fusion score. In the formula, Indicates candidate documents The fusion score for the target query For the first The model for the first Similarity score of candidate documents; S74. Filter documents based on the fusion scores of each candidate document to generate the first target document list for the target query.

9. The multi-model paper retrieval method for academic question answering according to claim 8, characterized in that: It also includes performing a secondary filtering of candidate documents in the first target document list to generate a second target document list, specifically including: S81. The weighted average of the retrieval entropy values ​​of the target query in each model of the model group is taken as the retrieval entropy value corresponding to the target query. ; S82, Set the retrieval entropy value corresponding to the target query. The conversion to a diversity adjustment coefficient is calculated as follows: In the formula, This is the diversity adjustment coefficient. To retrieve sites in the entropy value, For hyperparameters; S83. Based on the diversity adjustment coefficient and the fusion score of each candidate document in the first target document list, the maximum marginal relevance score of each candidate document is calculated using the maximum marginal relevance algorithm. Candidate documents are iteratively selected from high to low according to the maximum marginal relevance score and added to the second target document list until the preset number of documents is reached, and the final second target document list is generated. The formula for calculating the maximum marginal correlation score is as follows: In the formula, Indicates candidate documents The maximum marginal relevance score; Indicates the current candidate document With the list of documents already selected as the second target The semantic similarity of the most similar documents represents a redundancy penalty.

10. The multi-model paper retrieval method for academic question answering according to claim 8, characterized in that, It also includes introducing graph computation concepts to perform secondary validation on the first target document list, specifically including: S91. Obtain the reference list and citation list of each candidate document in the first target document list, and establish a local citation topology graph with candidate documents as nodes and citation relationships as directed edges. S92. In the local citation topology graph, the fusion score of each candidate document in the first target document list is used as the activation energy of the corresponding node, and the weighted in-degree centrality of each node is calculated as follows: In the formula, Represents a node Weighted in-degree centrality Represents a node out-neighbor nodes The fusion score of the corresponding candidate documents. For nodes The set of out-neighbor nodes, The attenuation coefficient; S93. Normalize the weighted in-degree centrality of all nodes to generate the topological weight of the candidate document corresponding to the node. Combine the fusion score of the candidate document corresponding to the node to calculate the re-ranking score, as shown in the following formula: In the formula, Candidate documents The rearranged score, Candidate documents Topological weights, As a regulating factor; S94. Reorder the first target document list according to the final reordering score, and output the final target document list.