A medical knowledge enhancement model training method, device and computer equipment

By constructing high-quality medical corpus data and extending the Transformer layer of a general pre-trained model, and combining it with KL-restricted loss function training, a medical knowledge enhancement model is generated. This solves the accuracy and reliability problems of large-scale language models in the medical field and enables efficient application of medical knowledge.

CN122196190APending Publication Date: 2026-06-12BAICHUAN INTELLIGENT TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BAICHUAN INTELLIGENT TECHNOLOGY CO LTD
Filing Date
2026-01-26
Publication Date
2026-06-12

Smart Images

  • Figure CN122196190A_ABST
    Figure CN122196190A_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a medical knowledge enhanced model training method and device and a computer device. The method comprises: constructing medical corpus data; based on the Upcycling technology, expanding the original Transformer layer of the obtained general pre-training model to construct a medical continued pre-training model with larger parameter quantity, the general pre-training model comprising N original Transformer layers, N>3; based on a KL restriction loss function, training the medical corpus data on the medical continued pre-training model to generate a medical knowledge enhanced model. In the technical solution provided by the embodiments of the present application, the model capacity can be expanded to accommodate medical knowledge, and the training stability can be ensured based on high-quality medical corpus data and the KL restriction loss function, which can improve the accuracy, reliability and robustness of the large model in the medical scene, and improve the credibility of the model output result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to a training method, apparatus, and computer equipment for a medical knowledge enhancement model. Background Technology

[0002] In recent years, large language models (LLMs) have made groundbreaking progress in the field of natural language processing. However, the application of general-purpose large models in professional fields such as medicine still faces many challenges. Tasks such as medical question answering and clinical decision support require models to have in-depth medical expertise. However, general-purpose large models are often trained on general internet corpora, which do not cover enough medical knowledge, resulting in low accuracy and making it difficult to meet the accuracy and reliability requirements of large models in medical scenarios. Summary of the Invention

[0003] In view of this, embodiments of the present invention provide a training method, apparatus, and computer device for a medical knowledge enhancement model, which can improve the accuracy, reliability, and robustness of large models in medical scenarios and enhance the credibility of model output results.

[0004] On one hand, embodiments of the present invention provide a training method for a medical knowledge enhancement model, including: Constructing medical corpus data; The original Transformer layers of the obtained general pre-trained model are extended based on the Upcycling technique to construct a medical pre-trained model with a larger number of parameters. The general pre-trained model includes N original Transformer layers, where N>3. The medical corpus data is trained on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model.

[0005] Optionally, the original Transformer layer includes a multi-head attention module and a feedforward network module. The extension of the original Transformer layer of the obtained general pre-trained model based on Upcycling technology to construct a medical pre-trained model with a larger number of parameters includes: While keeping the parameters of the multi-head self-attention module and the feedforward network module in the original Transformer layer unchanged, a new Transformer layer is added to the original Transformer layer by copying one or more intermediate layers in the original Transformer layer. Perform a warm start on the new Transformer layer to initialize it; Based on the initialized new Transformer layer and the original Transformer layer, the medical pre-training model is constructed.

[0006] Optionally, the warm-start process for the new Transformer layer includes: The parameters of the new Transformer layer are initialized by copying the parameters of one or more original Transformer layers corresponding to the position in the general pre-trained model and adding random perturbations.

[0007] Optionally, before training the medical corpus data on the medical pre-training model based on the KL-restricted loss function to generate the medical knowledge enhancement model, the following steps are included: Determine the loss of the conventional language model, the weight coefficients, and the KL divergence penalty loss; A KL-constrained loss function is constructed based on the conventional language model loss, the weight coefficients, and the KL divergence penalty loss. Wherein, the conventional language model loss is used to drive the medical continuing pre-training model to learn and fit the medical corpus data, the weight coefficient is used to adjust the weight of the KL divergence penalty loss in the KL constraint loss function, and the KL divergence penalty loss is used to perform self-constrained training on the medical continuing pre-training model.

[0008] Optionally, determining the conventional language model loss includes: Input the training samples of the medical corpus data into the medical continuing pre-training model to obtain the second conditional probability distribution output by the medical continuing pre-training model; The cross-entropy loss is calculated based on the second conditional probability distribution and the real labels corresponding to the training samples, and the conventional language model loss is generated. With the goal of minimizing the loss of the conventional language model, the parameters of the medical pre-training model are adjusted through backpropagation algorithm so that the output of the second conditional probability distribution is directed toward the set of true labels, thereby determining the loss of the conventional language model.

[0009] Optionally, determining the weighting coefficients includes: Construct a candidate set containing weight coefficients for multiple candidate values; Each candidate value is validated using the reserved validation set to generate a validation result. The validation set includes task samples from the medical corpus data and task samples from the general corpus data. During the verification process, the changing trends of the conventional language model loss and the KL divergence penalty loss are monitored; Based on the verification results and the changing trend, the weight coefficients are selected from the candidate set.

[0010] Optionally, selecting the weight coefficient from the candidate set includes: This ensures that the medical pre-trained model achieves high accuracy on the task corresponding to the medical corpus data in the validation set, and that the performance degradation on the task corresponding to the general corpus data in the validation set is less than a preset threshold.

[0011] Optionally, monitoring the changing trends of the conventional language model loss and the KL divergence penalty loss includes: If the decrease in the loss of the conventional language model is greater than or equal to a set decrease threshold, and the increase rate of the KL divergence penalty loss within a set time period is greater than a set rate, then the candidate weight coefficients are determined to be too small; or, If the decrease in the loss of the conventional language model is less than the set decrease threshold, it is determined that the candidate weight coefficient is too large.

[0012] Optionally, determining the KL divergence penalty loss includes: Obtain the first conditional probability distribution generated by the training samples of the medical corpus data input into the general pre-trained model; Obtain the second conditional probability distribution generated by the training samples of the medical corpus data input into the medical pre-training model; The KL divergence penalty loss is generated based on the first conditional probability distribution and the second conditional probability distribution.

[0013] Optionally, the step of training the medical corpus data on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model includes: During the continued pre-training based on the medical corpus data, the medical continued pre-training model is self-constrained by the KL divergence penalty loss in the KL restricted loss function, so that the second conditional probability distribution is constrained by the first conditional probability distribution, and the structural parameters of the medical continued pre-training model are adjusted with the goal of minimizing the KL restricted loss function, thereby generating a medical knowledge enhancement model.

[0014] Optionally, the structural parameters include: the word embedding matrix of the word embedding module; the query matrix, key matrix, value matrix and output projection matrix of the multi-head self-attention module; the weight matrix and bias vector of the feedforward network module; and the scaling parameters and offset parameters in the layer normalization operation.

[0015] Optionally, the construction of the medical corpus data includes: Obtain raw general corpus data and raw medical corpus data in medicine, pharmacy and public health; The original general corpus data is cleaned using a data cleaning algorithm to generate general corpus data. The general corpus data and the original medical corpus data are mixed according to a set ratio to generate the medical corpus data.

[0016] Optionally, the step of training the medical corpus data on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model includes: The medical corpus data for each batch is trained on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model. Each batch of medical corpus data includes randomly sampled general corpus data mixed according to a set ratio and the original medical corpus data.

[0017] On the other hand, embodiments of the present invention provide a training device for a medical knowledge enhancement model, comprising: The first construction module is used to build medical corpus data; The second construction module is used to extend the original Transformer layers of the obtained general pre-trained model based on Upcycling technology to construct a medical continuing pre-trained model with a larger number of parameters. The general pre-trained model includes N original Transformer layers, where N>3. The generation module is used to train the medical corpus data on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model.

[0018] On the other hand, embodiments of the present invention provide a storage medium including a stored program, wherein, when the program is running, the device where the storage medium is located executes the above-described training method for the medical knowledge enhancement model.

[0019] On the other hand, embodiments of the present invention provide a computer device including a memory and a processor. The memory is used to store information including program instructions, and the processor is used to control the execution of the program instructions. The program instructions are loaded and executed by the processor to implement the steps of the above-described training method for the medical knowledge enhancement model.

[0020] The technical solution provided in this invention involves constructing medical corpus data; extending the original Transformer layers of the acquired general pre-trained model using Upcycling technology to construct a larger medical pre-trained model with more parameters. The general pre-trained model includes N original Transformer layers, where N>3; and training the medical corpus data on the medical pre-trained model using the KL-restricted loss function to generate a medical knowledge augmentation model. This technical solution not only expands the model capacity to accommodate medical knowledge but also ensures training stability based on high-quality medical corpus data and the KL-restricted loss function. This improves the accuracy, reliability, and robustness of large models in medical scenarios and enhances the credibility of the model's output. Attached Figure Description

[0021] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0022] Figure 1 A flowchart illustrating a training method for a medical knowledge enhancement model according to an embodiment of the present invention; Figure 2 A flowchart for constructing a medical corpus is provided in one embodiment of the present invention; Figure 3 This is a schematic diagram illustrating the construction of a medical pre-training model according to an embodiment of the present invention; Figure 4 A flowchart for constructing a KL-constrained loss function is provided in one embodiment of the present invention; Figure 5 A schematic diagram of a training device for a medical knowledge enhancement model provided in an embodiment of the present invention; Figure 6 This is a schematic diagram of a computer device provided in an embodiment of the present invention. Detailed Implementation

[0023] To better understand the technical solution of the present invention, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0024] It should be understood that the described embodiments are merely some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0025] The terminology used in the embodiments of this invention is for the purpose of describing particular embodiments only and is not intended to limit the invention. The singular forms “a,” “the,” and “the” as used in the embodiments of this invention and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise.

[0026] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.

[0027] In related technologies, many mainstream LLMs achieve an accuracy of only around 60% in medical question-answering tasks. This is insufficient to meet the accuracy and reliability requirements of large models in medical scenarios. Furthermore, directly fine-tuning general-purpose large models for medical applications presents the following problems: Forgetting effect and capacity limitation: Simple fine-tuning may cause the general model to forget its original general knowledge, reducing its general language ability. At the same time, if the model parameter size of the general model is fixed, the newly added medical knowledge will compete with the existing knowledge for the limited representation space, affecting the model's ability to fully learn new knowledge.

[0028] Data quality and overfitting: Medical data is highly specialized, and its distribution differs from general corpora. Raw medical texts scraped from the web often contain noise and inaccurate information. If general-purpose models are trained directly on low-quality data, they may learn incorrect knowledge or overfit.

[0029] Training stability: Continuously pre-training a general-purpose model on domain corpora may cause the general-purpose model to become overly biased towards the domain data distribution, resulting in overfitting or even generating overly confident outputs, thus reducing its robustness on unseen data.

[0030] In summary, the low model capacity, low quality of the corpus used for model training, and low training stability in related technologies result in low accuracy and reliability of large models in medical scenarios.

[0031] To address the technical problems in related technologies, one embodiment of the present invention provides a training method for a medical knowledge enhancement model. This method improves the medical knowledge understanding and application capabilities of a large model by refining the model structure, training corpus, and loss function. Figure 1 A flowchart illustrating a training method for a medical knowledge enhancement model according to an embodiment of the present invention is shown below. Figure 1 As shown, the method includes: Step 102: Construct medical corpus data.

[0032] In this embodiment of the invention, each step is performed by a computer device. For example, the computer device includes a computer, a tablet computer, a server, etc.

[0033] Figure 2 A flowchart for constructing medical corpus data is provided as an embodiment of the present invention, such as... Figure 2 As shown, step 102 includes: Step 1022: Obtain raw general corpus data and raw medical corpus data in medicine, pharmacy and public health.

[0034] In this embodiment of the invention, a high-quality corpus (raw medical corpus data) was customized around themes such as clinical medicine, pharmacy, and public health, with a total size of approximately 20B tokens. A significant portion of this data comes from selected content from authoritative medical textbooks and guidelines, such as chapters in internal medicine textbooks, treatment guidelines for common diseases, and authoritative medical encyclopedias and popular science articles. This text has been proofread and cleaned by experts to ensure professionalism and accuracy. On the other hand, a large-scale model can be used to generate a textbook-style synthetic corpus. Specifically, prompts covering a wide range of medical topics are designed, allowing a preliminary model with medical knowledge to generate corresponding chapter explanations, Q&A examples, and exercise solutions—"textbook-like" content. The generated results are then manually reviewed and screened before being incorporated into the corpus, with a total size of approximately 30B tokens. Through this combination of manual review and model analysis, a broad-coverage and extremely low-noise raw medical corpus data is constructed, ensuring that the model can learn accurate and systematic medical knowledge.

[0035] Step 1024: Clean the original general corpus data using a data cleaning algorithm to generate general corpus data.

[0036] In this embodiment of the invention, to maintain the general capabilities of the model, a raw general-purpose corpus of approximately 200 bytes of tokens can be collected. This corpus may include data from multiple domains, such as Wikipedia Chinese entries, encyclopedia Q&A, news articles, and social media texts. The same data cleaning algorithms used for the pre-training of the basic model can be employed to filter low-quality content (such as duplicates, advertisements, and filtering of inappropriate information), and preprocessing of the text, such as sentence segmentation and standardization, can be performed to ensure the quality and diversity of the general-purpose corpus data. This general-purpose corpus data can provide the model with knowledge background in fields outside of medicine, preventing the model from losing its ability to engage in conversations about everyday topics due to its focus on medicine.

[0037] Step 1026: Mix the general corpus data and the original medical corpus data according to a set ratio to generate medical corpus data.

[0038] In this embodiment of the invention, a set ratio can be set according to the actual situation, and the set ratio can be set based on the optimal trade-off determined by Scaling Law theory and experimental optimization. If the proportion of the original medical corpus data is too high, the model may overfit the medical field and weaken its general performance; conversely, if the proportion is too low, the medical knowledge gain will be insufficient.

[0039] In this embodiment of the invention, the proportion of original medical corpus data in the mixed data can be selected within a certain range, such as about 15%-25%. This way, while the model learns medical knowledge, a considerable amount of general language data is still interspersed within it, maintaining the model's familiarity with general language. Alternatively, the ratio of general language data to original medical corpus data can be selected within the range of 3:1 to 5:1.

[0040] Optionally, the proportion of raw medical corpus data in the mixed data can be 20%. Optionally, the ratio of general corpus data to raw medical corpus data is set to 4:1.

[0041] In subsequent training, a hybrid random batch strategy can be employed, where each mini-batch randomly samples raw medical corpus data and general language corpus data according to a set ratio. Model parameter updates are driven by both medical knowledge and calibration with general language, thus achieving balanced training. It's worth noting that a gradual focusing strategy was adopted for medical domain data in the later stages of training: as pre-training neared completion, the sampling weight of raw medical corpus data in the hybrid dataset could be appropriately increased. This ensures more refined tuning of the model to medical knowledge upon training convergence, while leveraging the previously extended hybrid synchronous training to avoid overfitting. This step further solidified the model's medical expertise without significant side effects.

[0042] In this embodiment of the invention, a high-precision knowledge dataset containing 50B (50 billion) tokens in the medical field was synthesized from the original medical corpus data. The content covers multiple subfields including clinical diagnosis and treatment, medical science, and health popularization. Data sources include authoritative medical textbooks and literature, as well as medical question-and-answer pairs and case analyses generated using large-scale model assistance, ensuring the accuracy and reliability of the content. Simultaneously, the aforementioned original medical corpus data can be mixed proportionally with a cleaned 200B general-purpose corpus data for pre-training. The original medical corpus data accounts for approximately 20%, and the general-purpose corpus data accounts for approximately 80%. This ratio allows for a significant infusion of medical expertise while maintaining the model's general language capabilities, achieving a balance between specialization and generalization.

[0043] Step 104: Based on the Upcycling technique, the original Transformer layers of the obtained general pre-trained model are extended to construct a medical continuing pre-trained model with a larger number of parameters. The general pre-trained model includes N original Transformer layers, where N>3.

[0044] In this invention, Upcycling technology is a technique for upgrading and modifying existing general-purpose pre-trained models. Its core idea is to fully utilize the knowledge and feature extraction capabilities accumulated by the pre-trained general-purpose model, and through structural expansion and optimization, adapt it to the needs of specific domains (such as the medical field), constructing domain-specific models with larger parameter sets and superior performance. In the process of extending the general-purpose pre-trained model into a medical pre-trained model, Upcycling technology can extend the original Transformer layers of the general-purpose pre-trained model. For example, by increasing the number of attention heads, neurons, or adding additional network layers in each Transformer layer, the model's ability to capture and learn complex features of medical data can be improved. This extension is not a simple parameter stacking, but rather a targeted optimization based on the original model structure and knowledge to improve model performance and enhance domain adaptability.

[0045] In this embodiment of the invention, continued pre-training refers to the process of further training the model using data from a specific domain (such as the medical field) after it has already been trained on a large-scale general pre-trained model. General pre-trained models have learned common language expressions and knowledge from a wide range of text data, but this general knowledge may not be deep or accurate enough for a specific domain (such as the medical field). Continued pre-training allows the medical continued pre-trained model to undergo additional training based on data from a specific domain (medical corpus data), enabling the model to better understand and process the language patterns, terminology, and specific tasks of that domain, thereby improving its performance in medical text processing tasks. In medical-related natural language processing tasks (such as medical text classification, medical named entity recognition, medical question answering systems, etc.), the medical continued pre-trained model can more accurately understand the semantics of medical text, thereby improving performance metrics such as accuracy and recall.

[0046] In this embodiment of the invention, the general pre-trained model can be a general large model such as the Baichuan model, the deepseek model, or the Qwen model.

[0047] In this embodiment of the invention, the original Transformer layer includes a multi-head attention module and a feedforward network module.

[0048] Figure 3This is a schematic diagram illustrating the construction of a medical pre-training model according to an embodiment of the present invention, as shown below. Figure 3 As shown, step 104 includes: Step 1042: While keeping the parameters of the multi-head self-attention module and feedforward network module in the original Transformer layer unchanged, add a new Transformer layer in the original Transformer layer by copying one or more intermediate layers in the original Transformer layer.

[0049] In this embodiment of the invention, adding a new Transformer layer to the original Transformer layer can increase the overall number of model parameters by approximately 15%-25%. For example, adding a new Transformer layer to the original Transformer layer can increase the overall number of model parameters by approximately 20%.

[0050] Step 1044: Perform a warm start on the new Transformer layer to initialize it.

[0051] Specifically, the parameters of the new Transformer layer are initialized by copying the parameters of one or more original Transformer layers corresponding to the position in the general pre-trained model and adding random perturbations.

[0052] In this embodiment of the invention, the initialization part of the newly added Transformer layer borrows the distribution of model parameters corresponding to the general pre-trained model, copies the original weights and adds small perturbations, and integrates into the general pre-trained model in a warm start manner.

[0053] Step 1046: Based on the newly initialized Transformer layer and the original Transformer layer, construct the medical pre-training model.

[0054] In this embodiment of the invention, the extended general pre-trained model (medical continuing pre-trained model) is equivalent to adding extra capacity on the basis of the original knowledge, so that it can carry more medical knowledge without excessively interfering with the original capacity.

[0055] In this embodiment of the invention, without altering the parameter structure of the general pre-trained model, an Upcycling method is introduced to extend the Transformer layer, increasing the number of model parameters by approximately 20% for continuous pre-training. The original weights of the general pre-trained model are retained and, together with the newly added parameters, constitute the expanded model architecture. This "model dimensionality upgrade" design provides the model with additional capacity to learn new medical knowledge, while maximizing the reuse of the general knowledge already mastered by the original model, achieving rapid convergence and enhanced expressive power.

[0056] In this embodiment of the invention, compared to training a larger model from scratch, the Upcycling method can efficiently utilize the advantages of existing models. The vast amount of general knowledge in the general pre-trained model is seamlessly inherited, while new parameters are focused on learning the medical domain. Experiments show that, using this embodiment, the medical-focused pre-trained model demonstrates a strong understanding of medical content from the initial stages of pre-training, and its convergence speed is significantly faster than that of a randomly initialized model. This confirms the effectiveness of the Upcycling method in rapidly improving model capacity and expressive power while maintaining the performance of the original model.

[0057] Step 106: Train the medical corpus data on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model.

[0058] In this embodiment of the invention, a KL-constrained loss function can be constructed before step 106. Figure 4 A flowchart for constructing a KL-constrained loss function is provided in one embodiment of the present invention, as follows: Figure 4 As shown: Step S1: Determine the conventional language model loss, weight coefficients, and KL divergence penalty loss.

[0059] In this embodiment of the invention, L_LM is the conventional language model loss, λ is the weight coefficient, and D_KL(P_new||P_base) is the KL divergence penalty loss.

[0060] In this embodiment of the invention, the conventional language model loss L_LM measures the difference between the output distribution of the medical knowledge enhancement model when predicting the next word (token) and the true target (i.e., the next word in the corpus). Simply put, its sole purpose is to drive the medical knowledge enhancement model to learn and fit the knowledge in the medical corpus data. It is responsible for teaching the medical knowledge enhancement model the terminology, syntax, facts, and logic of the medical field.

[0061] Determining the conventional language model loss may include: inputting training samples of medical corpus data into the medical pre-training model to obtain the second conditional probability distribution P_new output by the medical pre-training model; calculating the cross-entropy loss based on the second conditional probability distribution P_new and the real labels (one-hot vectors) corresponding to the training samples to generate the conventional language model loss; and adjusting the parameters of the medical pre-training model through backpropagation algorithm with the goal of minimizing the conventional language model loss, so that the output of the second conditional probability distribution P_new is concentrated towards the real labels, thereby determining the conventional language model loss.

[0062] Table 1 is a comparison table showing how the value of the weight coefficient λ affects the model behavior of the medical pre-training model.

[0063] Table 1 Determining the weight coefficients may include: constructing a candidate set of weight coefficients containing multiple candidate values ​​(e.g., a candidate set of 0.1, 0.5, and 1.0); validating each candidate value using a reserved validation set to generate validation results, the validation set containing task samples from both medical corpus data and general corpus data; during validation, monitoring the changing trends of the conventional language model loss and the KL divergence penalty loss (specifically, if the decrease in the conventional language model loss is greater than or equal to a set decrease threshold while the increase rate of the KL divergence penalty loss within a set time period is greater than a set rate, then the candidate weight coefficients are determined to be too small and insufficiently constrained; or, if the decrease in the conventional language model loss is less than a set decrease threshold, then the candidate weight coefficients are determined to be too large and unable to learn new knowledge); based on the validation results and changing trends, selecting weight coefficients from the candidate set (specifically, ensuring that the medical pre-trained model achieves high accuracy on the task corresponding to the medical corpus data in the validation set, and that the performance degradation on the task corresponding to the general corpus data in the validation set is less than a preset threshold).

[0064] Determining the KL divergence penalty loss may include: obtaining a first conditional probability distribution P_base generated from training samples of medical corpus data input to a general pre-trained model; obtaining a second conditional probability distribution P_new generated from training samples of medical corpus data input to a further pre-trained medical model; and generating the KL divergence penalty loss based on the first and second conditional probability distributions P_base and P_new. The first conditional probability distribution P_base can serve as a reference anchor point for the second conditional probability distribution P_new. The first conditional probability distribution P_base can be used to constrain the direction and magnitude of change in the second conditional probability distribution P_new.

[0065] In this embodiment of the invention, by minimizing the KL divergence, the medical knowledge enhancement model can be constrained to avoid generating a distribution that differs too much from the medical pre-training model for a given input.

[0066] Step S2: Construct the KL-constrained loss function based on the conventional language model loss, weight coefficients, and KL divergence penalty loss.

[0067] In this embodiment of the invention, the KL-constrained loss function can be L = L - LM + λ D_KL(P_new||P_base). Where L_LM is the loss of the regular language model, λ is the weight coefficient, and D_KL(P_new||P_base) is the KL divergence penalty loss.

[0068] In this embodiment of the invention, the conventional language model loss is used to drive the medical continuing pre-training model to learn and fit medical corpus data, the weight coefficient is used to adjust the weight of the KL divergence penalty loss in the KL constraint loss function, and the KL divergence penalty loss is used to perform self-constrained training on the medical continuing pre-training model.

[0069] In this embodiment of the invention, the medical corpus data of each batch can be trained on the medical pre-training model based on the KL restriction loss function to generate a medical knowledge enhancement model. Each batch of medical corpus data includes randomly sampled general corpus data mixed according to a set ratio and original medical corpus data.

[0070] In this embodiment of the invention, during the continued pre-training based on medical corpus data, the medical continued pre-training model is self-constrained by the KL divergence penalty loss in the KL restricted loss function, so that the second conditional probability distribution is constrained by the first conditional probability distribution, and the structural parameters of the medical continued pre-training model are adjusted with the goal of minimizing the KL restricted loss function, thereby generating a medical knowledge enhancement model.

[0071] The structural parameters include: the word embedding matrix of the word embedding module; the query matrix, key matrix, value matrix and output projection matrix of the multi-head self-attention module; the weight matrix and bias vector of the feedforward network module; and the scaling and offset parameters in the layer normalization operation.

[0072] In this embodiment of the invention, during continued pre-training, the predicted distribution of the medical pre-trained model is compared with the output distribution of a general pre-trained model (such as the Baichuan4-Turbo model), and the difference between the two is limited by incorporating a KL divergence penalty loss. This KL-Limited Loss mechanism effectively suppresses the overfitting tendency of the model's predicted distribution: when the model attempts to form a narrow distribution of medical data with overconfidence, the KL divergence penalty loss does not deviate from the distribution range of general knowledge in the general pre-trained model. This not only prevents the model from forgetting general knowledge but also improves the stability of the model's answers to unseen medical questions.

[0073] In this embodiment of the invention, experiments showed that the medical knowledge enhancement model with KL-constrained loss function exhibited more stable performance in the later stages of training compared to the unconstrained model, and the perplexity on various validation sets no longer oscillated. Simultaneously, the medical knowledge enhancement model demonstrated moderate conservatism in medical question-answering generation: it did not provide overconfident answers to uncertain medical questions, but rather tended to make robust inferences based on existing knowledge. This stability is particularly important in medical scenarios, helping to reduce model hallucinations and improve the reliability of answers.

[0074] The technical solution provided in this invention involves constructing medical corpus data; extending the original Transformer layers of the acquired general pre-trained model using Upcycling technology to construct a larger medical pre-trained model with more parameters. The general pre-trained model includes N original Transformer layers, where N>3; and training the medical corpus data on the medical pre-trained model using the KL-restricted loss function to generate a medical knowledge augmentation model. This technical solution not only expands the model capacity to accommodate medical knowledge but also ensures training stability based on high-quality medical corpus data and the KL-restricted loss function. This improves the accuracy, reliability, and robustness of large models in medical scenarios and enhances the credibility of the model's output.

[0075] The technical solution provided in this invention has achieved breakthrough results in medical question-answering and other evaluations. The question-answering accuracy of the medical knowledge enhancement model has jumped from approximately 60% of the original baseline to 83%, significantly exceeding the level of existing models with similar parameter scales.

[0076] The technical solution provided in this invention improves professional performance while maintaining the model's general capabilities. This means that users can use the medical knowledge-enhanced model to answer medical questions, and also enjoy a smooth interactive experience comparable to a general pre-trained model in everyday conversations and general knowledge Q&A. The medical knowledge-enhanced model does not neglect everyday knowledge in its focus on medicine, demonstrating good balance and generalization ability.

[0077] The medical knowledge enhancement model provided in this invention possesses powerful medical knowledge modeling capabilities and broad applicability. This model can serve as the core of a medical question-and-answer system, supporting patient self-service consultation and medical advice; as a foundational model for intelligent medical literature retrieval, assisting researchers in quickly acquiring knowledge; for clinical decision support, providing references for treatment plans; and for medical education, providing intelligent teaching and training support for students and doctors. The medical knowledge enhancement model trained through this invention will become an important foundation for the field of medical artificial intelligence, laying a solid foundation for building safe and reliable medical artificial intelligence (AI) applications. It significantly surpasses existing similar technologies, possessing significant innovative value and application prospects.

[0078] One embodiment of the present invention provides a training device for a medical knowledge enhancement model. Figure 5 This is a schematic diagram of a training device for a medical knowledge enhancement model according to an embodiment of the present invention, as shown below. Figure 5 As shown, the device includes: a first building module 11, a second building module 12, and a generation module 13.

[0079] The first building module 11 is used to build medical corpus data.

[0080] The second building module 12 is used to extend the original Transformer layers of the obtained general pre-trained model based on Upcycling technology to build a medical pre-trained model with a larger number of parameters. The general pre-trained model includes N original Transformer layers, where N>3.

[0081] The generation module 13 is used to train the medical corpus data on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model.

[0082] In this embodiment of the invention, the original Transformer layer includes a multi-head attention module and a feedforward network module. The second construction module 12 is specifically used to add a new Transformer layer to the original Transformer layer by copying one or more intermediate layers in the original Transformer layer while keeping the parameters of the multi-head self-attention module and the feedforward network module unchanged; perform a warm start process on the new Transformer layer to initialize the new Transformer layer; and construct a medical pre-training model based on the initialized new Transformer layer and the original Transformer layer.

[0083] In this embodiment of the invention, the second construction module 12 is specifically used to initialize the parameters of the new Transformer layer. The parameters of each new Transformer layer are initialized by copying the parameters of one or more original Transformer layers corresponding to the position in the general pre-trained model and adding random perturbations.

[0084] In this embodiment of the invention, the device further includes a determining module 14 and a third construction module 15.

[0085] The determination module 14 is used to determine the loss of the conventional language model, the weight coefficients, and the KL divergence penalty loss.

[0086] The third building module 15 is used to construct the KL-constrained loss function based on the conventional language model loss, weight coefficients, and KL divergence penalty loss. The conventional language model loss is used to drive the medical continuing pre-training model to learn and fit the medical corpus data, the weight coefficients are used to adjust the weight of the KL divergence penalty loss in the KL-constrained loss function, and the KL divergence penalty loss is used to perform self-constrained training on the medical continuing pre-training model.

[0087] In this embodiment of the invention, the determining module 14 is specifically used to input training samples of medical corpus data into the medical continuing pre-training model to obtain the second conditional probability distribution output by the medical continuing pre-training model; calculate the cross-entropy loss based on the second conditional probability distribution and the real labels corresponding to the training samples to generate the conventional language model loss; and adjust the parameters of the medical continuing pre-training model through the backpropagation algorithm with the goal of minimizing the conventional language model loss, so that the output of the second conditional probability distribution is concentrated on the real labels, thereby determining the conventional language model loss.

[0088] In this embodiment of the invention, the determining module 14 is specifically used to construct a candidate set containing weight coefficients of multiple candidate values; to verify each candidate value using a reserved validation set and generate a validation result, wherein the validation set contains task samples from medical corpus data and task samples from general corpus data; during the validation process, the changing trends of conventional language model loss and KL divergence penalty loss are monitored; and weight coefficients are selected from the candidate set based on the validation results and changing trends.

[0089] In this embodiment of the invention, the determining module 14 is specifically used to ensure that the medical pre-trained model achieves high accuracy on the task corresponding to the medical corpus data in the validation set, and that the performance degradation on the task corresponding to the general corpus data in the validation set is lower than a preset threshold.

[0090] In this embodiment of the invention, the determining module 14 is specifically used to determine that the candidate weight coefficient is too small if the monitored decrease value of the conventional language model loss is greater than or equal to a set decrease threshold and the increase rate of the KL divergence penalty loss within a set time period is greater than a set rate; or, if the monitored decrease value of the conventional language model loss is less than a set decrease threshold, determine that the candidate weight coefficient is too large.

[0091] In this embodiment of the invention, the determining module 14 is specifically used to obtain the first conditional probability distribution generated by the training samples of the medical corpus data input to the general pre-trained model; obtain the second conditional probability distribution generated by the training samples of the medical corpus data input to the medical continuation pre-trained model; and generate the KL divergence penalty loss based on the first conditional probability distribution and the second conditional probability distribution.

[0092] In this embodiment of the invention, the generation module 13 is specifically used to perform self-constrained training on the medical pre-training model by using the KL divergence penalty loss in the KL constraint loss function during continued pre-training based on medical corpus data, so that the second conditional probability distribution is constrained by the first conditional probability distribution, and to adjust the structural parameters of the medical pre-training model with the goal of minimizing the KL constraint loss function, thereby generating a medical knowledge enhancement model.

[0093] In this embodiment of the invention, the structural parameters include: the word embedding matrix of the word embedding module; the query matrix, key matrix, value matrix and output projection matrix of the multi-head self-attention module; the weight matrix and bias vector of the feedforward network module; and the scaling parameters and offset parameters in the layer normalization operation.

[0094] In this embodiment of the invention, the first construction module 11 is specifically used to acquire original general corpus data and original medical corpus data of medicine, pharmacy and public health; to clean the original general corpus data using a data cleaning algorithm to generate general corpus data; and to mix the general corpus data and the original medical corpus data according to a set ratio to generate medical corpus data.

[0095] In this embodiment of the invention, the generation module 13 is specifically used to train the medical corpus data of each batch on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model. The medical corpus data of each batch includes randomly sampled general corpus data mixed according to a set ratio and original medical corpus data.

[0096] The technical solution provided in this invention involves constructing medical corpus data; extending the original Transformer layers of the acquired general pre-trained model using Upcycling technology to construct a larger medical pre-trained model with more parameters. The general pre-trained model includes N original Transformer layers, where N>3; and training the medical corpus data on the medical pre-trained model using the KL-restricted loss function to generate a medical knowledge augmentation model. This technical solution not only expands the model capacity to accommodate medical knowledge but also ensures training stability based on high-quality medical corpus data and the KL-restricted loss function. This improves the accuracy, reliability, and robustness of large models in medical scenarios and enhances the credibility of the model's output.

[0097] The training device for the medical knowledge enhancement model provided in this embodiment of the invention can be used to achieve the above. Figure 1 The training method for the medical knowledge enhancement model is described in detail in the above-described embodiment of the training method for the medical knowledge enhancement model, and will not be repeated here.

[0098] This invention provides a storage medium that includes a stored program. When the program runs, it controls the device where the storage medium is located to execute the steps of the above-described training method for the medical knowledge enhancement model. For a detailed description, please refer to the embodiments of the above-described training method for the medical knowledge enhancement model.

[0099] This invention provides a computer device, including a memory and a processor. The memory is used to store information including program instructions, and the processor is used to control the execution of the program instructions. When the program instructions are loaded and executed by the processor, they implement the steps of the above-described embodiment of the training method for the medical knowledge enhancement model. For a detailed description, please refer to the embodiment of the above-described training method for the medical knowledge enhancement model.

[0100] Figure 6 This is a schematic diagram of a computer device provided in an embodiment of the present invention. Figure 6 As shown, the computer device 20 in this embodiment includes a processor 21, a memory 22, and a computer program 23 stored in the memory 22 and executable on the processor 21. When the processor 21 executes the computer program 23, it implements the training method for the medical knowledge enhancement model in this embodiment. To avoid repetition, it will not be described in detail here. Alternatively, when the processor 21 executes the computer program, it implements the functions of each model / unit in the training device for the medical knowledge enhancement model in this embodiment. To avoid repetition, it will not be described in detail here.

[0101] Computer device 20 includes, but is not limited to, processor 21 and memory 22. Those skilled in the art will understand that... Figure 6This is merely an example of computer device 20 and does not constitute a limitation on computer device 20. It may include more or fewer components than shown, or combine certain components, or different components. For example, computer device may also include input / output devices, network access devices, buses, etc.

[0102] The processor 21 may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.

[0103] The memory 22 can be an internal storage unit of the computer device 20, such as a hard disk or RAM of the computer device 20. The memory 22 can also be an external storage device of the computer device 20, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card equipped on the computer device 20. Furthermore, the memory 22 can include both internal and external storage units of the computer device 20. The memory 22 is used to store computer programs and other programs and data required by the computer device. The memory 22 can also be used to temporarily store data that has been output or will be output.

[0104] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0105] In the several embodiments provided by this invention, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection between devices or units through some interfaces, and may be electrical, mechanical, or other forms.

[0106] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0107] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional units.

[0108] The integrated units implemented as software functional units described above can be stored in a computer-readable storage medium. These software functional units, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute some steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0109] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A training method for a medical knowledge enhancement model, characterized in that, include: Constructing medical corpus data; The original Transformer layers of the obtained general pre-trained model are extended based on the Upcycling technique to construct a medical pre-trained model with a larger number of parameters. The general pre-trained model includes N original Transformer layers, where N>3. The medical corpus data is trained on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model.

2. The method according to claim 1, characterized in that, The original Transformer layer includes a multi-head attention module and a feedforward network module. The original Transformer layer of the obtained general pre-trained model is extended using Upcycling technology to construct a medical pre-trained model with a larger number of parameters, including: While keeping the parameters of the multi-head self-attention module and the feedforward network module in the original Transformer layer unchanged, a new Transformer layer is added to the original Transformer layer by copying one or more intermediate layers in the original Transformer layer. Perform a warm start on the new Transformer layer to initialize it; Based on the initialized new Transformer layer and the original Transformer layer, the medical pre-training model is constructed.

3. The method according to claim 2, characterized in that, The warm-start process for the new Transformer layer includes: The parameters of the new Transformer layer are initialized by copying the parameters of one or more original Transformer layers corresponding to the position in the general pre-trained model and adding random perturbations.

4. The method according to claim 1, characterized in that, Before training the medical corpus data on the medical pre-training model based on the KL-restricted loss function to generate the medical knowledge enhancement model, the process includes: Determine the loss of the conventional language model, the weight coefficients, and the KL divergence penalty loss; A KL-constrained loss function is constructed based on the conventional language model loss, the weight coefficients, and the KL divergence penalty loss. Wherein, the conventional language model loss is used to drive the medical continuing pre-training model to learn and fit the medical corpus data, the weight coefficient is used to adjust the weight of the KL divergence penalty loss in the KL constraint loss function, and the KL divergence penalty loss is used to perform self-constrained training on the medical continuing pre-training model.

5. The method according to claim 4, characterized in that, The determination of the conventional language model loss includes: Input the training samples of the medical corpus data into the medical continuing pre-training model to obtain the second conditional probability distribution output by the medical continuing pre-training model; The cross-entropy loss is calculated based on the second conditional probability distribution and the real labels corresponding to the training samples, and the conventional language model loss is generated. With the goal of minimizing the loss of the conventional language model, the parameters of the medical pre-training model are adjusted through backpropagation algorithm so that the output of the second conditional probability distribution is directed toward the set of true labels, thereby determining the loss of the conventional language model.

6. The method according to claim 4, characterized in that, The determination of the weighting coefficients includes: Construct a candidate set containing weight coefficients for multiple candidate values; Each candidate value is validated using the reserved validation set to generate a validation result. The validation set includes task samples from the medical corpus data and task samples from the general corpus data. During the verification process, the changing trends of the conventional language model loss and the KL divergence penalty loss are monitored; Based on the verification results and the changing trend, the weight coefficients are selected from the candidate set.

7. The method according to claim 6, characterized in that, The step of selecting the weight coefficient from the candidate set includes: This ensures that the medical pre-trained model achieves high accuracy on the task corresponding to the medical corpus data in the validation set, and that the performance degradation on the task corresponding to the general corpus data in the validation set is less than a preset threshold.

8. The method according to claim 6, characterized in that, The monitoring of the changing trends of the conventional language model loss and the KL divergence penalty loss includes: If the decrease in the loss of the conventional language model is greater than or equal to a set decrease threshold, and the increase rate of the KL divergence penalty loss within a set time period is greater than a set rate, then the candidate weight coefficients are determined to be too small; or, If the decrease in the loss of the conventional language model is less than the set decrease threshold, it is determined that the candidate weight coefficient is too large.

9. The method according to claim 4, characterized in that, The determination of the KL divergence penalty loss includes: Obtain the first conditional probability distribution generated by the training samples of the medical corpus data input into the general pre-trained model; Obtain the second conditional probability distribution generated by the training samples of the medical corpus data input into the medical pre-training model; The KL divergence penalty loss is generated based on the first conditional probability distribution and the second conditional probability distribution.

10. The method according to claim 9, characterized in that, The process of training the medical corpus data on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model includes: During the continued pre-training based on the medical corpus data, the medical continued pre-training model is self-constrained by the KL divergence penalty loss in the KL restricted loss function, so that the second conditional probability distribution is constrained by the first conditional probability distribution, and the structural parameters of the medical continued pre-training model are adjusted with the goal of minimizing the KL restricted loss function, thereby generating a medical knowledge enhancement model.

11. The method according to claim 10, characterized in that, The structural parameters include: the word embedding matrix of the word embedding module; the query matrix, key matrix, value matrix and output projection matrix of the multi-head self-attention module; the weight matrix and bias vector of the feedforward network module; and the scaling parameters and offset parameters in the layer normalization operation.

12. The method according to claim 1, characterized in that, The constructed medical corpus data includes: Obtain raw general corpus data and raw medical corpus data in medicine, pharmacy and public health; The original general corpus data is cleaned using a data cleaning algorithm to generate general corpus data. The general corpus data and the original medical corpus data are mixed according to a set ratio to generate the medical corpus data.

13. The method according to claim 12, characterized in that, The process of training the medical corpus data on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model includes: The medical corpus data for each batch is trained on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model. Each batch of medical corpus data includes randomly sampled general corpus data mixed according to a set ratio and the original medical corpus data.

14. A training device for a medical knowledge enhancement model, characterized in that, include: The first construction module is used to build medical corpus data; The second construction module is used to extend the original Transformer layers of the obtained general pre-trained model based on Upcycling technology to construct a medical continuing pre-trained model with a larger number of parameters. The general pre-trained model includes N original Transformer layers, where N>3. The generation module is used to train the medical corpus data on the medical pre-training model based on the KL-restricted loss function to generate a medical knowledge enhancement model.

15. A storage medium, characterized in that, The storage medium includes a stored program, wherein, when the program is executed, the device containing the storage medium is controlled to perform the training method of the medical knowledge enhancement model according to any one of claims 1 to 13.

16. A computer device comprising a memory and a processor, the memory for storing information including program instructions, and the processor for controlling the execution of the program instructions, characterized in that, When the program instructions are loaded and executed by the processor, they implement the steps of the training method for the medical knowledge enhancement model according to any one of claims 1 to 13.