A method for constructing a mongolian medicine large language model combined with a knowledge graph
By constructing an expanded Mongolian vocabulary table and optimizing multilingual weights, and combining knowledge graphs and RAG technology, the problems of low digitization and insufficient research depth in the Mongolian medicine large language model were solved. This enabled the computational representation and deep collaborative mechanism of Mongolian medicine's distinctive theories, ensuring the model's ethical and value compliance in the field of Mongolian medicine.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INNER MONGOLIA UNIVERSITY
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-19
AI Technical Summary
Existing large-scale medical models fail to fully reflect the theoretical characteristics and diagnostic features of Mongolian medicine. The digitization level of the Mongolian medicine knowledge system is low, and the depth of technology application is insufficient. The Mongolian medicine large-scale language model is difficult to integrate knowledge graphs, large-scale models and RAG technology. The accuracy of Mongolian script recognition is low, the labeled data is scarce, and the research depth of the Mongolian medicine large-scale language model is insufficient.
We constructed an expanded Mongolian vocabulary table by combining heuristic rules with a binary classification model. Through multilingual weight optimization and LoRA incremental pre-training, combined with knowledge graph and RAG technology, we built a large language model for Mongolian medicine, including the construction of a Mongolian medicine knowledge graph and reinforcement learning of Mongolian medicine ethical penalty items.
It has improved the digitization level and construction efficiency of the Mongolian medicine knowledge system, realized the computational representation of the unique theoretical system of Mongolian medicine, solved the problem of incompatibility between existing models and Mongolian medicine theories, and constructed a deep collaborative mechanism of knowledge graph, large language model and RAG technology, ensuring that the model output conforms to the ethics and values of the Mongolian medicine field.
Smart Images

Figure CN122242732A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence technology, and in particular relates to a method for constructing a large language model of Mongolian medicine that combines knowledge graphs. Background Technology
[0002] Mongolian medicine, as an important branch of the traditional Chinese medicine system, embodies thousands of years of medical practice and wisdom accumulated by the Mongolian people. However, existing large-scale medical models are mostly based on modern medical theories or mainstream traditional medical paradigms. Their technical architecture and algorithm design fail to fully reflect the theoretical characteristics and diagnostic and treatment features of Mongolian medicine. This makes it difficult to integrate the unique diagnostic and treatment knowledge of Mongolian medicine into modern intelligent medical systems, forming a technical bottleneck that restricts its modernization and knowledge dissemination.
[0003] In recent years, the digitization of Mongolian medicine has gradually attracted academic attention, with existing research mainly focusing on the digitization of literature resources and the construction of knowledge graphs. Regarding literature digitization, some scholars have dedicated themselves to the electronic conversion of traditional Mongolian ancient books, using Optical Character Recognition (OCR) technology to transform Mongolian medical classics into digital text. However, due to the complex character structure and scarce annotated corpus of traditional Mongolian script, existing OCR technology still has significant room for improvement in the accuracy of ancient book recognition, making it difficult to meet the needs of large-scale literature digitization. In the field of knowledge graph construction, researchers have attempted to construct knowledge graphs based on the Mongolian medicine theoretical system, incorporating elements such as the "Three Roots Theory," disease diagnosis, and prescription compatibility. However, these studies mostly remain at the conceptual model level, lacking sufficient semantic modeling of the dynamic diagnostic and treatment logic of Mongolian medicine, and thus failing to effectively support intelligent diagnosis and treatment applications. The training paradigms of existing medical large-scale language models are mostly based on modern medical or traditional Chinese medicine theories, whose knowledge systems are fundamentally different from the unique "Three Roots Theory" and characteristic diagnostic and treatment methods of Mongolian medicine, making direct transfer and application to the field of Mongolian medicine difficult. Furthermore, research on large language models for minority languages is still in its early stages, especially the construction of large medical models based on traditional Mongolian script, which is still a blank.
[0004] The current technology of Mongolian medicine's large language model mainly faces the following challenges: (1) Low level of digitalization of knowledge system In the field of Mongolian medicine knowledge inheritance and application, the digitization of the knowledge system is a crucial foundation for its efficient dissemination, in-depth exploration, and intelligent application. However, Mongolian medicine knowledge is mostly scattered in ancient books, the experience passed down by veteran physicians, and fragmented clinical records. The digitization process faces numerous challenges, including inconsistent terminology, fragmented content, and the difficulty in standardizing highly specialized information. This results in a limited scale and inconsistent quality of digitized Mongolian medicine knowledge, failing to form a complete and systematic digital knowledge system, and severely restricting the training effect and application value of the Mongolian medicine large language model. Therefore, improving the digitization level of the Mongolian medicine knowledge system and constructing a comprehensive, standardized, and high-quality digital knowledge resource repository is an important and urgent task in this field.
[0005] (2) Insufficient depth of technology application In the research and application of the Mongolian medicine large language model, although some advanced technologies have been introduced, the depth of technology integration and application remains insufficient, failing to effectively combine knowledge graphs, large models, and RAG (Retrieval Enhanced Generation) technology. Knowledge graphs provide structured representations and associations for Mongolian medicine knowledge, large models possess powerful semantic understanding and generation capabilities, and RAG technology can enhance the model's accurate retrieval of knowledge in specific domains. The fragmented application of these three technologies makes it difficult for the Mongolian medicine large language model to fully realize its potential. Relying solely on large models may lead to factual errors in the generated content; relying solely on knowledge graphs lacks flexible natural language interaction capabilities; and RAG technology, without the support of large models and knowledge graphs, cannot achieve deep knowledge reasoning and generation. Therefore, how to break down technological barriers, achieve deep integration of knowledge graphs, large models, and RAG technology, and construct a collaborative and efficient Mongolian medicine large language model is the core challenge currently facing the field. Summary of the Invention
[0006] The purpose of this invention is to address the problems of low digitization of the Mongolian medicine knowledge system and low research depth of existing Mongolian medicine language models.
[0007] To achieve the above objectives, the present invention adopts the following technical solution: A method for constructing a large language model for Mongolian medicine that combines knowledge graphs includes the following steps: S1. Collect Mongolian data and preprocess it using heuristic rules and a binary classification model; S2. The BPE algorithm is used to process Mongolian data to generate a Mongolian vocabulary; and natural language processing methods are used to process Mongolian medicine professional data to obtain professional corpus, thereby constructing an expanded Mongolian vocabulary. S3. Optimize the proportion of training data in the embedding layer of the large language model by using the objective function of multilingual weight optimization. Use Mongolian data, combined with Chinese and English data, and use LoRA to optimize and incrementally pre-train the large language model. Dynamically adjust the proportion of various languages through the course learning method. Adjust the weight coefficients of each language loss term in the loss function of the large language model through a dynamic weight adjustment strategy. S4. Construct parallel corpora of Chinese-Mongolian and English-Mongolian languages, and introduce an alignment learning objective function based on KL divergence to train the model to achieve semantic alignment between high-resource languages and Mongolian. S5. Obtain Mongolian dialogue data; perform supervised fine-tuning of the large language model using Mongolian dialogue data to improve the model's command comprehension and dialogue capabilities in Mongolian, Chinese and English. S6. A two-stage supervised fine-tuning strategy is adopted, using Mongolian medicine encyclopedia knowledge and Mongolian medicine clinical data, and LoRA is used to perform domain-specific optimization of the large language model. S7. Obtain Mongolian language preference data, and use the improved DPO algorithm, which incorporates Mongolian medicine ethics penalty items, to fine-tune the large language model through reinforcement learning to ensure that its output conforms to values and domain ethics. S8. The BERT model is used to extract entities and attributes from the Mongolian medicine professional data. Semantic relationships between entities are constructed through rule matching and the RoBERTa relationship classification model. The entities are then fused to construct triples including the fused entities and their semantic relationships, generating a Mongolian medicine knowledge graph. This graph is stored in a graph database, and a vector index is constructed to support retrieval. S9. Using the knowledge graph constructed in step S8 as an external knowledge source, RGA technology is used to provide factual evidence for the large language model. Combined with the re-ranking mechanism, the retrieval accuracy is improved, forming a domain-specific large language model for Mongolian medicine question answering and generation.
[0008] Furthermore, step S1 specifically includes: (1) Use heuristic rules to perform coarse screening of Mongolian data; (2) The Mongolian data after the initial screening is screened a second time according to the binary classification model; a portion of the samples are randomly selected from the initial screening Mongolian data and scored by professionals according to quality standards. Based on the scoring results, the samples are divided into positive and negative samples, thus constructing labeled data for training the binary classification model; the training objective function of the binary classification model is as follows: in, Represents the parameters of the binary classification model. Represents the total number of samples. The true label of the sample, and This is the probability that the binary classification model predicts a sample as high-quality Mongolian data; During the training of the binary classification model, cross-validation was used to evaluate the performance of the binary classification model, and the model was continuously optimized based on the evaluation results. After the binary classification model was trained, the Mongolian data after the initial screening was further filtered.
[0009] Furthermore, step S2 specifically involves: The preprocessed Mongolian data was pre-segmented by word segmentation based on syllable boundaries using BPE, and a Mongolian vocabulary was generated using the iterative BPE algorithm. Professional corpora were obtained from Mongolian medicine professional data by word frequency statistics or TF-IDF method, and the professional corpora were added to the Mongolian word list to obtain the Mongolian word expansion table; Feature selection for the Mongolian word expansion table is performed using TF-IDF and mutual information methods.
[0010] Furthermore, the objective function for multilingual weight optimization in step S3 is... : in, This represents the model parameters, where N is the total number of samples. , , The samples are genuine labels in Mongolian, Chinese, and English, respectively. , , These represent the probabilities that the model predicts the sample to be in Mongolian, Chinese, and English, respectively; weighting coefficients. , and It can be flexibly adjusted according to actual needs.
[0011] Furthermore, in step S3, LoRA is used to optimize the large language model, specifically as follows: LoRA is used to introduce a low-rank decomposition increment matrix into the Q / K / V matrix of the Attention layer and the FFN layer in the large language model, updating the original weight matrix: in, Let h be the original weight matrix, d be the hidden layer dimension, r be the rank, and k be the input dimension. and A trainable low-rank matrix, where x is the input.
[0012] Furthermore, the dynamic weight adjustment strategy in step S3 is as follows: in, , , The Mongolian, Chinese, and English characters respectively represent the first... Weight coefficients for each training phase , , These are the initial weighting coefficients. , , The decay factor controls the rate at which weights decay during training; the dynamic weight adjustment strategy dynamically adjusts each language loss term at the level of the loss function in the large language model. Weighting coefficients , , This allows for real-time responses to the model's mastery of each language at different training stages.
[0013] Furthermore, step S4 specifically involves: (1) Construct parallel data for Chinese-Mongolian and English-Mongolian languages. The parallel data is obtained from translated texts, bilingual news reports, and multilingual documents. (2) Introduce the alignment learning objective function L into the large language model: in, Indicates model parameters, The total number of samples, This is a loss function used to measure the difference between the probability distributions of Chinese and Mongolian characters. and , respectively, are the loss functions for Mongolian and Chinese, where K represents the Kullback-Leibler divergence, used to measure the semantic alignment between the high-resource language and Mongolian; weight coefficients , and It can be flexibly adjusted according to actual needs; (3) Input parallel data into the large language model for pre-training, so that the large language model can learn the semantic alignment relationship between different languages in depth.
[0014] Furthermore, step S5 specifically involves: (1) Collect Chinese and English dialogue data and translate them into Mongolian to obtain Mongolian dialogue data; (2) Preprocess the Mongolian dialogue data; (3) Using the selected Mongolian dialogue data, the large language model was fine-tuned in a supervised manner by optimizing the objective function. Represented as: in, Indicates model parameters, The training dataset contains dialogue data in Mongolian, Chinese, and English. , , The loss functions are for Mongolian, Chinese, and English respectively; , and These are weighting coefficients, which can be adjusted according to actual needs.
[0015] Furthermore, step S7 specifically includes: (1) Collect Chinese user preference data and preference data involving national policies, and translate the user preference data and national policy preference data into Mongolian to obtain Mongolian preference data; (2) Heuristic rules were used to preprocess the Mongolian preference data; (3) The improved DPO algorithm is used to directly train the large language model using Mongolian preference data; the improved DPO algorithm is as follows: based on the original DPO loss function, a Mongolian medicine ethics penalty term is added. This leads to an improved DPO algorithm, where the optimized loss function... for: in, It is the sigmoid function. and These are the capability parameters of the two objects; The function for verifying the ethics of Mongolian medicine is as follows: If the output of the large language model violates the rules of Mongolian medicine, then P=1, triggering a penalty; if it conforms to the rules, P=0, with no penalty. This is the penalty coefficient.
[0016] Furthermore, step S9 specifically includes: RAG breaks down the question-answering task into two phases: retrieval and generation; During the retrieval phase, the user's input query is converted into a semantic vector using vectorization technology, and then matched with document fragments in the vector database of the knowledge graph in step S8 to quickly locate the most relevant knowledge fragments. During the generation phase, the large language model combines the retrieved contextual information with the original query to generate structured or natural language output. Using vector indexing, the Top-K candidate knowledge fragments that are semantically related to the query are quickly retrieved through the approximate nearest neighbor algorithm. Then, a cross encoder is used to perform deep interactive calculations on the query and each candidate fragment to obtain a more accurate relevance score. Based on this score, the Top-K results are sorted in descending order, and finally, the relevant fragments are fed into the generative model.
[0017] The beneficial effects of this invention are as follows: 1. This invention solves the technical problems of Mongolian digitization, improving the efficiency and quality of constructing a Mongolian medicine knowledge system: Addressing the low accuracy of Mongolian word recognition and the scarcity of labeled corpora in the background technology, which leads to low digitization levels, this invention effectively solves the fragmentation problem of Mongolian word segmentation by constructing a domain-adaptive Mongolian word expansion table and a syllable boundary-based pre-segmentation method. Employing a data preprocessing scheme combining heuristic rules and a binary classification model, it efficiently extracts 20GB of high-quality Mongolian corpus from 30GB of raw data, significantly improving the digitization efficiency of Mongolian medicine literature and laying the foundation for constructing a systematic digital knowledge system, even in the absence of a large amount of labeled data.
[0018] 2. This invention achieves a computational representation of the unique theoretical system of Mongolian medicine, resolving the incompatibility between existing models and Mongolian medicine theory: By constructing an expanded Mongolian glossary containing basic theoretical terms, diagnostic terms, and treatment terms, and employing a feature selection method combining TF-IDF and mutual information, the accurate extraction of professional terms is ensured. Combined with a two-stage Mongolian medicine-specific fine-tuning strategy, the model can deeply understand the unique theoretical system and diagnostic and treatment logic of Mongolian medicine, achieving effective simulation of the dynamic reasoning process.
[0019] 3. A deep collaborative mechanism integrating knowledge graphs, large language models, and RAG technology was constructed, overcoming the deficiency of insufficient depth in technology application: This invention provides structured knowledge support by constructing a Mongolian medicine knowledge graph, uses RAG technology to achieve dynamic knowledge retrieval, and combines the semantic understanding capabilities of large language models to form a complete technical closed loop of "knowledge storage-retrieval-understanding-generation". This deep collaborative mechanism not only solves the factual error problem of large language models, but also breaks through the rigid limitations of traditional knowledge graphs, achieving the organic unity of knowledge reasoning and natural language generation.
[0020] 4. A multilingual collaborative training paradigm for low-resource languages was established, addressing the gap in research on large language models for Mongolian: Addressing the current lack of research on large language models for minority languages in the background, this invention innovatively proposes a multilingual weight optimization method based on curriculum learning. By dynamically adjusting the training ratios of Mongolian, Chinese, and English, it enhances Mongolian capabilities while maintaining multilingual performance. Combined with semantic alignment learning between high-resource languages and Mongolian, effective knowledge transfer is achieved, providing a feasible technical path for constructing large language models for low-resource languages.
[0021] 5. A value alignment scheme that conforms to the characteristics of Mongolian medicine has been formed, ensuring the reliable application of the technology in professional medical scenarios: In response to the lack of ethical constraints in the background technology of Mongolian medicine, this invention designs a DPO algorithm that integrates ethical penalty items of Mongolian medicine, transforming professional rules such as "contraindications of prescriptions" and "matching of constitution and treatment" into a computable form, so that the model strictly adheres to domain ethics while following human preferences, providing an important guarantee for the safe application of Mongolian medicine intelligent systems in real medical environments. Attached Figure Description
[0022] Figure 1 This is a flowchart of the method provided by the present invention; Figure 2 This is the key decision matching rate calculation diagram provided by the present invention. Detailed Implementation
[0023] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0024] The application principle of the present invention will be further described below with reference to the accompanying drawings and specific embodiments.
[0025] Please refer to Figure 1 A method for constructing a large language model for Mongolian medicine that combines knowledge graphs includes the following steps: S1. Data Acquisition and Preprocessing S101, Data Acquisition Mongolian data was collected. This data includes news reports, online texts, text data obtained collaboratively from Mongolian-speaking educational institutions and cultural departments, Mongolian medicine encyclopedic knowledge, and clinical data related to Mongolian medicine. Approximately 30GB of Mongolian data was collected through multiple channels. For specific adjustments to the Mongolian medicine data, 1.5GB of Mongolian medicine encyclopedic knowledge from various encyclopedia websites and clinical data from the Inner Mongolia International Mongolian Medicine Hospital were acquired. The main sources and related data volumes of the Mongolian data are shown in Table 1.
[0026] Table 1. Statistical information on the main sources of the dataset S102, Data Preprocessing After data collection was completed, a method combining heuristic rules and a binary classification model was used to preprocess the Mongolian data. The specific method is as follows:
[0027] (1) Heuristic rules are used to perform coarse screening of Mongolian data, and invalid text fragments that are too short (e.g., less than 5 words), interference texts in non-Mongolian languages, and repeated redundant text fragments are accurately removed, so as to initially ensure the purity and relevance of Mongolian data.
[0028] (2) The data is screened a second time based on the binary classification model. Specifically, a portion of the samples are randomly selected from the initial Mongolian data and scored by professionals according to strict quality standards. The maximum score is set at 10 points, and the score increases accordingly as the quality of the Mongolian data improves. Based on the scoring results, the samples are divided into positive samples (high-quality Mongolian data) and negative samples (low-quality Mongolian data) to construct high-quality labeled data for training the binary classification model. The training objective function of the binary classification model is as follows:
[0029] in, Represents the parameters of the binary classification model. Represents the total number of samples. The true labels for the samples are 1 (representing high-quality Mongolian data, 0 representing low-quality Mongolian data), and This refers to the probability that the binary classification model predicts a sample as high-quality Mongolian data. During the training process of the binary classification model, methods such as cross-validation are used to comprehensively and objectively evaluate the performance of the binary classification model, and the model is continuously optimized based on the evaluation results to ensure that the binary classification model has the ability to filter data. After the binary classification model completes training, it is pre-trained on the coarsely filtered Mongolian data. Based on the rich feature recognition accumulated during its training, the binary classification model strictly judges each piece of Mongolian data, identifying and eliminating Mongolian data predicted as low-quality. Through heuristic rule coarse filtering and binary classification fine filtering, the quality of Mongolian data is significantly improved. Based on the filtering, this invention further organizes professionals to manually proofread the filtered corpus to ensure the accuracy and consistency of the Mongolian data. After data preprocessing, this invention efficiently filters 20GB of high-quality Mongolian data from 30GB of Mongolian data.
[0030] S2. Construction of the Mongolian Vocabulary In the current field of open-source large language models, BBPE segmentation is widely used. However, due to the Unicode encoding characteristics of Mongolian, BBPE segmentation often results in overly fragmented segmentation results when processing Mongolian text, severely impacting the model's encoding / decoding performance and its ability to understand and generate Mongolian text. To address this issue, this invention employs the Gemma2-9B-it model (Gemma 2 for short). Although the basic BPE segmentation framework of this model is disclosed in an existing patent (CN110674646A), it offers superior adaptability and stability in Mongolian text processing scenarios, thus alleviating the fragmentation problem of Mongolian text segmentation at the application level.
[0031] As shown in Table 2, this invention compares the vocabulary compression ratio (i.e., the average number of tokens required per word) of several open-source large language models. The results show that Gemma2-9B-it's BPE segmentation method performs outstandingly in Mongolian language processing tasks, with significantly better stability of segmentation results than other models, and can effectively avoid excessive fragmentation of Mongolian language segmentation.
[0032] Table 2. Lexical compression ratio of large language models To further optimize the word segmentation effect of Mongolian (Mongolian medicine), improve the efficiency of professional text training and inference, and enhance the model's ability to express features in the Mongolian medicine field, this invention expands the domain-adaptive vocabulary on the Gemma2-9B-it-based BPE framework. The specific scheme is as follows: (1) Construction of a general vocabulary and expansion of a specialized terminology vocabulary: When constructing the Mongolian BPE vocabulary, BPE was first used to pre-segment the 20GB of preprocessed Mongolian data based on syllable boundaries. Then, an iterative BPE algorithm was used, initially setting 4000 target lexical units, and iteratively counting the frequency of consecutive byte pairs, merging the most frequent byte pair combinations each time. Finally, a Mongolian vocabulary of length 4000 was generated. Then, from 1.5GB of Mongolian medicine professional data, the 1000 most frequent professional terms were obtained through natural language processing methods such as word frequency statistics and TF-IDF (word frequency-inverse document frequency), and these professional terms were added to the Mongolian vocabulary to expand the Mongolian vocabulary and obtain the Mongolian vocabulary expansion table.
[0033] (2) Feature Selection: TF-IDF and mutual information methods were used to select features from the Mongolian vocabulary expansion table, taking into account both the frequency weight of terms and domain relevance to improve the accuracy and representativeness of newly added vocabulary. The main purpose of feature selection was to screen out terms highly relevant to the field of Mongolian medicine and avoid interference from general vocabulary. Methodologically, TF-IDF was used to calculate the weight of terms in the professional corpus, and mutual information was combined to measure the correlation strength between terms and the domain. Mutual information was used to calculate the correlation strength between a term and the specific domain of "Mongolian medicine". The calculation required statistics on: 1) the probability of the term appearing in the Mongolian medicine professional data; 2) the probability of the term appearing in the general Mongolian data. The formula was simplified to:
[0034] In the formula, where The probability of the term in Mongolian medicine professional data. These represent the probabilities in general Mongolian language data. Words with higher PMI values are more closely related to the Mongolian medicine field and are more likely to be specialized terms within that field (e.g., "Hei") rather than general vocabulary (e.g., "de" or "shi"). Using a combination of TF-IDF and mutual information can filter out high-frequency but non-domain-specific words.
[0035] (3) The vocabulary in the Mongolian glossary is classified into basic theoretical terms, diagnostic terms, and treatment terms, as follows: Basic theoretical terms: Hei, Shira, Badagan; Diagnostic terminology: deep and rapid pulse, white and greasy tongue coating; Treatment terms: Qingxila Decoction, Qubada Gan Powder.
[0036] During the vocabulary expansion process, the 5,000 words in the Mongolian vocabulary expansion table were initialized with mean values based on the Gemma2-9B-it model. This avoids both over-expansion of the vocabulary table leading to a decrease in model performance due to a lack of professional pre-training data for fine-tuning, and insufficient expansion that fails to cover the core terms of Mongolian medicine.
[0037] S3, Mongolian Incremental Pre-training (1) After expanding the Mongolian vocabulary, to ensure the model's word vectors are more accurately adapted to Mongolian and to avoid negative impacts on the performance of other languages, this invention employs a word vector fine-tuning strategy that proportionally combines Mongolian, Chinese, and English. Specifically, the proportion of the three languages in the training data is finely adjusted through a multilingual weight optimization objective function to ensure that the large language model does not suffer catastrophic forgetting of Chinese and English while deeply learning Mongolian feature representations. The multilingual weight optimization objective function... as follows:
[0038] in, This represents the model parameters, where N is the total number of samples. , , The samples are genuine labels in Mongolian, Chinese, and English, respectively. , , These represent the probabilities that the model predicts the sample to be in Mongolian, Chinese, and English, respectively. Weighting coefficients. , and It can be flexibly adjusted according to actual needs to achieve the best balance between multiple languages.
[0039] (2) In order for the entire large language model to understand and generate Mongolian text more accurately, the overall parameters of the large language model must be fine-tuned in Mongolian. Considering the complexity and computational cost of full parameter fine-tuning, this invention chooses to use LoRA for efficient parameter fine-tuning, thereby quickly and efficiently optimizing the performance of the Mongolian large language model.
[0040] In the construction and optimization of the Mongolian vocabulary, word vector fine-tuning is only performed on the static embedding layer adaptation, optimizing only the vector representation of words. The incremental pre-training in this part is for dynamic learning of the entire model, adjusting parameters through LoRA to master Mongolian grammar, semantics, and generation rules. The Gemma2-9B-it large language model typically consists of an embedding layer, multiple Transformer blocks (including self-attention layers and FFN layers), and an output layer.
[0041] This invention employs LoRA (Low-Rank Adaptation), which injects a low-rank decomposition increment matrix into the Q / K / V matrix of the Attention layer and the FFN layer in the Gemma2-9B-it large language model, rather than updating all the original parameters. Specifically, for the original weight matrix... The update process is represented as follows:
[0042] Where h represents the updated weight matrix, d represents the hidden layer dimension, r represents the rank, and k represents the input dimension. and A trainable low-rank matrix (rank) is used, with x as input. During training, only B and A are optimized, while W remains frozen. This method reduces the number of training parameters, lowers computational cost and memory consumption, avoids catastrophic forgetting, and efficiently achieves Mongolian language capability injection.
[0043] (3) The Mongolian language model developed in this invention supports not only Mongolian, but also English and Chinese. Therefore, while enhancing the model's understanding of Mongolian, it is essential to avoid catastrophic forgetting of English and Chinese. To this end, this invention incorporates a large amount of Chinese and English into the incremental pre-training Mongolian data and employs a curriculum learning method to gradually adjust the proportion of each language. By gradually reducing the proportion of Chinese and English, the model's Mongolian language capability is gradually strengthened while maintaining its understanding of Chinese and English. The specific steps of the curriculum learning method are as follows:
[0044] Initially, the ratio of Chinese, English, and Mongolian languages was 2:2:1. In this phase, the Gemma2-9B-it large language model further strengthened its understanding of the three languages based on existing pre-training, ensuring good processing performance for each language.
[0045] Intermediate stage: The ratio is adjusted to 1:1:1. While maintaining its ability to process Chinese and English, the Gemma2-9B-it large language model begins to gradually enhance its ability to understand and generate Mongolian, achieving a balanced development of its three language processing capabilities.
[0046] Final stage: Gradually reduce the ratio of Chinese to English, eventually reaching 1:1:2. The Gemma2-9B-it large language model mainly focuses on Mongolian language learning, further improving the understanding and generation capabilities of Mongolian, ensuring that while strengthening Mongolian language skills, the understanding capabilities of Chinese and English are not lost.
[0047] (4) In the incremental pre-training process, in order to better balance the performance of each language, this invention adopts a dynamic weight adjustment strategy. Specifically, by dynamically adjusting the weight coefficients of each language, it is ensured that the model gives appropriate attention to each language at different stages. The weight adjustment formula is as follows:
[0048] in, , , The Mongolian, Chinese, and English characters respectively represent the first... Weight coefficients for each training phase , , These are the initial weighting coefficients. , , This is a decay factor used to control the rate at which weights decay during training. The dynamic weight adjustment strategy dynamically adjusts each language loss term at the level of the loss function in the large language model. Weighting coefficients , , This allows for real-time responses to the model's mastery of each language at different training stages.
[0049] S4. Alignment Learning of High-Resource Languages and Mongolian Script To further improve the model's ability to understand Mongolian, this invention employs an alignment learning strategy between high-resource languages and Mongolian. Specifically:
[0050] (7) First, it is necessary to construct 1.5G of high-quality Chinese-Mongolian and English-Mongolian parallel data. Parallel data can be obtained from sources such as translated texts, bilingual news reports, and multilingual documents. Parallel data specifically refers to a collection of bilingual texts that have been professionally translated and aligned, with each sentence corresponding to the previous one, such as a sentence pair in Chinese corresponding to a sentence in Mongolian.
[0051] (8) An alignment learning objective function is introduced into the large language model to achieve semantic alignment between the high-resource language and Mongolian. The alignment learning objective function L is:
[0052] in, Indicates model parameters, The total number of samples, This is a loss function used to measure the difference between the probability distributions of Chinese and Mongolian characters. and , respectively, are the loss functions for Mongolian and Chinese, where K represents the Kullback-Leibler divergence, used to measure the semantic alignment between the high-resource language and Mongolian. Weight coefficients , and It can be flexibly adjusted according to actual needs to achieve the best alignment effect.
[0053] (9) Input parallel data into the large language model for pre-training, so that the large language model can learn the semantic alignment relationship between different languages in depth and achieve effective knowledge transfer.
[0054] S5, Multi-dimensional fine-tuning of instruction compliance capabilities To enable the Mongolian language model to not only perform text continuation but also engage in efficient human dialogue, this invention has fine-tuned the model's instruction-following capabilities across multiple dimensions. Specifically:
[0055] (1) Collect Chinese and English dialogue data, with a particular focus on data in the fields of Chinese-English common tasks, entity recognition, and translation, and translate them into Mongolian to obtain Mongolian dialogue data.
[0056] (2) Preprocessing the Mongolian dialogue data. First, heuristic rules are used to remove excessively short segments to ensure the usability of the data. Next, the Mongolian dialogue data is translated back into English or Chinese, and a similarity and semantic comparison analysis is performed with the original text. If the translated data has significant semantic differences from the original text, it is removed from the Mongolian data. Finally, manual screening is used to further remove dirty data to ensure the high quality of the dataset.
[0057] (3) Supervised fine-tuning of the large language model was performed using the selected Mongolian dialogue data and by optimizing the objective function. Considering that the large language model supports not only Mongolian but also Chinese and English, Chinese and English dialogue data were also included in the fine-tuning process. Optimization objective function It can be represented as:
[0058] in, Indicates model parameters, The training dataset contains dialogue data in Mongolian, Chinese, and English. , , These are the loss functions for Mongolian, Chinese, and English, respectively. , and These are weighting coefficients, which can be adjusted according to actual needs.
[0059] S6, Special Adjustments to Mongolian Medicine Mongolian medicine, as an important component of traditional Mongolian medicine, requires models with specialized terminology comprehension, logical reasoning, and generative capabilities due to its unique theoretical system, diagnostic methods, and prescription knowledge. However, high-quality data in the field of Mongolian medicine is relatively scarce, and the specialized terminology differs significantly from standard Mongolian. Therefore, this invention employs a hierarchical fine-tuning approach to enhance the model's expertise in Mongolian medicine while ensuring its multilingual compatibility remains unaffected. The specific method is as follows:
[0060] This invention employs a two-stage supervised fine-tuning strategy, optimizing a large language model for the Mongolian medicine domain based on a dataset of 127,026 Mongolian medicine encyclopedia entries and clinical data. In the first stage, the basic large language model obtained in step S5, possessing multilingual and instruction-following capabilities, is efficiently trained using the Mongolian medicine encyclopedia data. To accommodate the short text length of the Mongolian medicine encyclopedia data, a larger batch size and a higher learning rate are set, with the input text length limited to 2048 characters to match the characteristics of the Mongolian medicine encyclopedia data. The second stage shifts to training with clinical data. To adapt to the longer text length of clinical data, the input text length limit is adjusted to 8192 characters, while the batch size and learning rate are reduced to improve training stability. Both stages utilize LoRA fine-tuning. By setting the rank (lora_rank=4) to limit the number of trainable parameters and updating only the low-rank matrix, the number of parameters is controlled to within 0.1% of the original model. Gradient accumulation and mixed precision further optimize memory usage. Combined with fp16 mixed precision calculation, memory consumption is significantly reduced while maintaining the performance of the large language model. The core principle of FP16 mixed-precision training is to convert the weights, activation values, and gradients of a large language model into 16-bit floating-point numbers (FP16) for calculation during forward and backward propagation, significantly improving computation speed and reducing memory usage. Simultaneously, to avoid precision loss due to gradient underflow, an additional 32-bit floating-point (FP32) copy of the weights is maintained for gradient updates, and loss scaling techniques are used to protect small gradient values. Specifically, NVIDIA's APESX library automatically manages precision conversion and gradient scaling. The evaluation scheme sets validation every 500 training steps, reserving 5% of the data as a validation set. The parameter selections are shown in Table 3.
[0061] Table 3 Selection of some parameters during fine-tuning S7, Value-Oriented Reinforcement Learning Fine-Tuning To ensure that the responses generated by the Mongolian language model align with national values and avoid inappropriate content, this invention employs a value-oriented reinforcement learning fine-tuning technique. However, preference data in the Mongolian language domain is extremely scarce, almost nonexistent. Therefore, this invention devises and implements an efficient solution.
[0062] (5) Collect a large amount of Chinese user preference data and preference data involving national policies, and translate the user preference data and national policy preference data into Mongolian to obtain Mongolian preference data, providing basic corpus for training large language models.
[0063] (6) Heuristic rules are used to remove short data fragments with fewer than 10 words from the Mongolian preference data to avoid semantic ambiguity or information loss caused by excessively short data fragments. At the same time, a professional human team is arranged to conduct further detailed review of the filtered user preference data and national policy preference data to accurately identify and exclude low-quality data that was not identified during the automatic screening process, such as data fragments with semantic ambiguity, logical inconsistencies, or that do not conform to Mongolian language habits.
[0064] (3) An improved DPO (Direct Preference Optimization) algorithm is used to directly train the large language model using user preference data and national policy preference data, without the need to explicitly define a reward function. A Mongolian medicine ethics penalty term is added to the original DPO loss function. This leads to an improved DPO algorithm, where the optimized loss function... for:
[0065] in, It is the sigmoid function. and These are the capability parameters of the two objects. The function for verifying the ethics of Mongolian medicine is as follows: If the output of the large language model violates the rules of Mongolian medicine (such as "recommending cold prescriptions to people with cold constitutions"), then P=1 (triggering a penalty); if it conforms to the rules, then P=0 (no penalty). The penalty coefficient is 0.2-0.4. By combining hard domain rules with soft human preferences, the optimization direction of the model is guided, ensuring that its output reflects preferences while strictly adhering to Mongolian medicine ethics.
[0066] S8. Constructing a knowledge graph for Mongolian medicine Knowledge graphs are constructed through entity recognition, relation extraction, and fusion. The specific steps include: (1) In the entity recognition and attribute extraction stage, a BERT-based named entity recognition model was adopted. BERT is a pre-trained language model based on the Transformer architecture, which captures the deep semantics of words through bidirectional context encoding. The BERT model was used to extract entities such as diseases, symptoms, and medicinal materials from Mongolian medicine professional data, and the key attributes of the entities were extracted by combining the preset feature templates. Then, the semantic relationships between entities were constructed by combining rule matching with the RoBERTa relationship classification model.
[0067] (2) To address the common entity alignment and disambiguation issues in multi-source data, a three-level matching strategy was designed. The three-level matching strategy is as follows: First, preliminary screening is performed based on the similarity of entity name strings. Second, the semantic similarity of entities is calculated: entity vectors are generated using the Sentence-BERT model, and then the cosine value between entity vectors is calculated. The closer the value is to 1, the higher the similarity. Finally, a comprehensive judgment is made based on contextual consistency.
[0068] A weighted scoring mechanism is adopted, and entities are fused based on their semantic similarity. An authority-based weighted voting algorithm is introduced for attribute conflicts, and an expert arbitration mechanism is initiated when the disagreement is small (e.g., the maximum difference < 0.15) to ensure the accuracy of knowledge integration.
[0069] (3) Construct triples that include the merged entities and their semantic relationships and generate a Mongolian medicine knowledge graph. Form triples with confidence (head entity, relation, tail entity, confidence).
[0070] (4) At the knowledge storage and indexing level, all fused entities and relations are persistently stored in the Neo4j graph database as triples, fully utilizing its powerful graph traversal capabilities to support complex queries. Simultaneously, to support subsequent retrieval enhancement applications, a Sentence-BERT model fine-tuned from the Mongolian medicine corpus is used to generate vector representations of the text, and an efficient vector indexing system is built based on FAISS. This system supports a hybrid retrieval mode combining keywords (BM25) and semantic vectors, significantly improving the recall and accuracy of knowledge retrieval.
[0071] S9, incorporates RAG technology RAG technology enhances the generation capabilities of large language models by retrieving external knowledge graphs. The improvement lies in addressing the model illusion problem and improving factual accuracy. The goal is to achieve dynamic knowledge updates and precise generation, supporting applications in the field of Mongolian medicine.
[0072] In this invention, RAG (Retrieval-Augmented Generation) technology addresses the limitations of traditional large language models in terms of knowledge solidification, domain adaptability, and factual accuracy by combining real-time retrieval from an external knowledge base with the generative capabilities of a language model. Specifically, RAG breaks down the question-answering task into two stages: Retrieval and Generation. In the retrieval stage, vectorization techniques are used to convert the user's input query into a semantic vector, which is then matched with document fragments in the vector database of the knowledge graph in step S8 to quickly locate the most relevant knowledge fragments. In the generation stage, the large language model combines the retrieved contextual information with the original query to generate structured or natural language output.
[0073] To balance the diversity of search results with the coherence of generated content, this invention introduces a re-ranking mechanism. It utilizes BM25 or a cross-encoder to perform a secondary ranking of the search results, ensuring that highly relevant segments are prioritized for input into the generative model. Specifically, firstly, a vector index is used to quickly recall the Top-K candidate knowledge segments semantically relevant to the query using an approximate nearest neighbor algorithm. Then, a cross-encoder performs deep interactive calculations between the query and each candidate segment to obtain a more accurate relevance score. Based on this score, the Top-K results are re-ranked in descending order, and finally, the most relevant segments are fed into the generative model. This method balances search efficiency and result accuracy.
[0074] Example 2: Since there is currently a lack of comparable baseline models in the field of Mongolian language model research, this invention innovatively designs a multi-dimensional evaluation scheme to construct a systematic evaluation system from the two levels of knowledge memorization and practical application.
[0075] First, we verified the model's grasp of fundamental knowledge. A closed test set containing 100 multiple-choice questions on Mongolian medicine encyclopedias was constructed to verify the model's understanding and memorization of basic domain concepts. All questions underwent manual cross-validation to ensure they were not included in the training data, avoiding data overlap that could distort evaluation results. The test content covered core knowledge points such as basic Mongolian medicine theory, medicinal material properties, and prescription compatibility, using a standardized four-choice question format. For example, the sample question in Table 5 focuses on Mongolian medicine prescription theory: "Which of the following options is NOT a component of Bai Tan-3 Tang?" Four single-herb names were provided as options, requiring the model to select the only correct answer. This part of the evaluation quantifies the model's knowledge coverage and memorization accuracy using an unbiased dataset, providing a measurable benchmark for basic question-answering capabilities in the Mongolian medicine field.
[0076] Table 5. Example of a multiple-choice test set Secondly, the clinical application capabilities of the model were validated. Given the complexity of Mongolian medicine clinical diagnosis and treatment, 100 simulated cases of real patients were designed to examine the model's practicality in the dynamic reasoning of "symptom-cause-treatment." Each case included the patient's chief complaint, physical signs, and auxiliary examination information. The model was required to output a diagnostic conclusion, etiological analysis, and treatment plan (including Mongolian medicine prescriptions, external treatments, and dietary recommendations). In the evaluation phase, five Mongolian medicine experts with extensive clinical experience were invited to score the model on a 5-point scale across three dimensions: ① Accuracy of diagnosis, i.e., whether the model can accurately diagnose the symptoms based on Mongolian medicine theory and avoid misleading patients; ② Rationality of treatment plan, including whether the combination of Mongolian medicines conforms to traditional Mongolian medicine knowledge and whether the dosage is safe and compliant; ③ Cultural adaptability, whether it integrates traditional Mongolian medicine diagnostic and treatment elements and avoids being dominated by Western medicine thinking.
[0077] In addition to the subjective scoring mentioned above, this paper proposes an objective metric called Key Decision Matching Rate (KDP−MR), calculated using a weighted decision tree analysis method. For each of the 100 simulated case responses, the completeness of each dimension in each response is independently scored with weights of 25%, 25%, 20%, 25%, and 5%, respectively. Finally, a weighted average algorithm is used to calculate the overall matching rate. The calculation formula is shown below:
[0078] Where N represents the total number of simulated cases, N=100 in this study, j is the case index (j=1,2,⋯,N), used to distinguish different cases; i is the evaluation section index (i=1,2,3,4,5), corresponding to the five core dimensions of Mongolian medicine diagnosis and treatment: diagnosis results, treatment methods, treatment principles, drugs and usage methods, and nursing and prevention. is the weight coefficient of the i-th segment, where the weight of diagnosis, treatment, drugs and usage is 25%, the weight of treatment principles is 20%, and the weight of nursing and prevention is 5%. These weight settings reflect the differences in importance of each dimension in Mongolian medicine clinical practice. This represents the completeness score of the i-th section in the j-th case, with a value ranging from {0,1}, where 1 indicates that the model response includes the content of that section, and 0 indicates that it is not included. (Inner formula) First, a weighted sum is applied to the five major components of a single case j to calculate the integrity score of the key decisions for that case; outer layer The arithmetic mean of the individual case scores for all N cases is then calculated. A higher KDP-MR value indicates a higher overall coverage and completeness of key decision-making information in the Mongolian medicine clinical diagnosis and treatment simulation task, and a better match with clinical practice needs. A graph showing the calculation of the key decision matching rate is attached. Figure 2 As shown.
[0079] Due to the lack of comparable large-scale language models for Mongolian medicine, this paper conducts comparative experiments on currently available large-scale models with some support for traditional Mongolian script, including leading models such as GPT-4 (OpenAI, 2023), Llama3 (Meta AI, 2024), and Claude3.5 (Anthropic, 2024). The test results are shown in Table 6. The model of this invention ranks first with a 75% accuracy rate on Mongolian medicine encyclopedia test questions and an 82.7% key decision matching rate, with an average expert score of 4.2.
[0080] Table 6 Comparative Experiments These data fully demonstrate that the Mongolian medicine knowledge fine-tuning method used in this study can effectively improve the model's capabilities in areas such as understanding professional terminology, reasoning about diagnostic and treatment logic, and generating clinical protocols. In particular, its excellent performance in key decision matching rate indicates that the model can accurately capture the core elements of Mongolian medicine diagnosis and treatment.
[0081] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for constructing a large language model for Mongolian medicine combining knowledge graphs, characterized in that, Includes the following steps: S1. Collect Mongolian data and preprocess it using heuristic rules and a binary classification model; S2. Use the BPE algorithm to process the Mongolian data and generate a Mongolian vocabulary; Furthermore, natural language processing methods were used to process Mongolian medicine professional data to obtain professional corpus, thereby constructing an expanded Mongolian vocabulary table; S3. Optimize the proportion of training data in the embedding layer of the large language model by using the objective function of multilingual weight optimization. Use Mongolian data, combined with Chinese and English data, and use LoRA to optimize and incrementally pre-train the large language model. Dynamically adjust the proportion of various languages through the course learning method. Adjust the weight coefficients of each language loss term in the loss function of the large language model through a dynamic weight adjustment strategy. S4. Construct parallel corpora of Chinese-Mongolian and English-Mongolian languages, and introduce an alignment learning objective function based on KL divergence to train the model to achieve semantic alignment between high-resource languages and Mongolian. S5. Obtain Mongolian dialogue data; Supervised fine-tuning of the large language model was performed using Mongolian dialogue data to improve the model’s command comprehension and dialogue capabilities in Mongolian, Chinese and English. S6. A two-stage supervised fine-tuning strategy is adopted, using Mongolian medicine encyclopedia knowledge and Mongolian medicine clinical data, and LoRA is used to perform domain-specific optimization of the large language model. S7. Obtain Mongolian language preference data, and use the improved DPO algorithm, which incorporates Mongolian medicine ethics penalty items, to fine-tune the large language model through reinforcement learning to ensure that its output conforms to values and domain ethics. S8. The BERT model is used to extract entities and attributes from the Mongolian medicine professional data. Semantic relationships between entities are constructed through rule matching and the RoBERTa relationship classification model. The entities are then fused to construct triples including the fused entities and their semantic relationships, generating a Mongolian medicine knowledge graph. This graph is stored in a graph database, and a vector index is constructed to support retrieval. S9. Using the knowledge graph constructed in step S8 as an external knowledge source, RGA technology is used to provide factual evidence for the large language model. Combined with the re-ranking mechanism, the retrieval accuracy is improved, forming a domain-specific large language model for Mongolian medicine question answering and generation.
2. The method for constructing a large language model of Mongolian medicine combined with knowledge graphs according to claim 1, characterized in that, Step S1 is as follows: (1) Use heuristic rules to perform coarse screening of Mongolian data; (2) The Mongolian data after the initial screening is screened a second time according to the binary classification model; a portion of the samples are randomly selected from the initial screening Mongolian data and scored by professionals according to quality standards. Based on the scoring results, the samples are divided into positive and negative samples, thus constructing labeled data for training the binary classification model; the training objective function of the binary classification model is as follows: in, Represents the parameters of the binary classification model. Represents the total number of samples. The true label of the sample, and This is the probability that the binary classification model predicts a sample as high-quality Mongolian data; During the training of the binary classification model, cross-validation was used to evaluate the performance of the binary classification model, and the model was continuously optimized based on the evaluation results. After the binary classification model was trained, the Mongolian data after the initial screening was further filtered.
3. The method for constructing a large language model of Mongolian medicine combined with knowledge graphs according to claim 1, characterized in that, Step S2 is as follows: The preprocessed Mongolian data was pre-segmented by word segmentation based on syllable boundaries using BPE, and a Mongolian vocabulary was generated using the iterative BPE algorithm. Professional corpora were obtained from Mongolian medicine professional data by word frequency statistics or TF-IDF method, and the professional corpora were added to the Mongolian word list to obtain the Mongolian word expansion table; Feature selection for the Mongolian word expansion table is performed using TF-IDF and mutual information methods.
4. The method for constructing a large language model of Mongolian medicine combined with knowledge graphs according to claim 1, characterized in that, The objective function for multilingual weight optimization in step S3 : in, This represents the model parameters, where N is the total number of samples. , , The samples are genuine labels in Mongolian, Chinese, and English, respectively. , , These represent the probabilities that the model predicts the sample to be in Mongolian, Chinese, and English, respectively; weighting coefficients. , and It can be flexibly adjusted according to actual needs.
5. The method for constructing a large language model of Mongolian medicine combined with knowledge graphs according to claim 4, characterized in that, Step S3 involves optimizing the large language model using LoRA as follows: LoRA is used to introduce a low-rank decomposition increment matrix into the Q / K / V matrix of the Attention layer and the FFN layer in the large language model, updating the original weight matrix: in, Let h be the original weight matrix, d be the hidden layer dimension, r be the rank, and k be the input dimension. and A trainable low-rank matrix (rank), with x as the input.
6. The method for constructing a large language model of Mongolian medicine combined with knowledge graphs according to claim 4, characterized in that, The dynamic weight adjustment strategy in step S3 is as follows: in, , , The Mongolian, Chinese, and English characters respectively represent the first... Weight coefficients for each training phase , , These are the initial weighting coefficients. , , The decay factor controls the rate at which weights decay during training; the dynamic weight adjustment strategy dynamically adjusts each language loss term at the level of the loss function in the large language model. Weighting coefficients , , This allows for real-time responses to the model's mastery of each language at different training stages.
7. The method for constructing a large language model of Mongolian medicine combined with knowledge graphs according to claim 1, characterized in that, Step S4 is as follows: (1) Construct parallel data for Chinese-Mongolian and English-Mongolian languages. The parallel data is obtained from translated texts, bilingual news reports, and multilingual documents. (2) Introduce the alignment learning objective function L into the large language model: in, Indicates model parameters, The total number of samples, This is a loss function used to measure the difference between the probability distributions of Chinese and Mongolian characters. and , respectively, are the loss functions for Mongolian and Chinese, where K represents the Kullback-Leibler divergence, used to measure the semantic alignment between the high-resource language and Mongolian; weight coefficients , and It can be flexibly adjusted according to actual needs; (3) Input parallel data into the large language model for pre-training, so that the large language model can learn the semantic alignment relationship between different languages in depth.
8. The method for constructing a large language model of Mongolian medicine combined with knowledge graphs according to claim 1, characterized in that, Step S5 is as follows: (1) Collect Chinese and English dialogue data and translate them into Mongolian to obtain Mongolian dialogue data; (2) Preprocess the Mongolian dialogue data; (3) Using the selected Mongolian dialogue data, the large language model was fine-tuned in a supervised manner by optimizing the objective function. Represented as: in, Indicates model parameters, The training dataset contains dialogue data in Mongolian, Chinese, and English. , , The loss functions are for Mongolian, Chinese, and English respectively; , and These are weighting coefficients, which can be adjusted according to actual needs.
9. The method for constructing a large language model of Mongolian medicine combined with knowledge graphs according to claim 1, characterized in that, Step S7 is as follows: (1) Collect Chinese user preference data and preference data involving national policies, and translate the user preference data and national policy preference data into Mongolian to obtain Mongolian preference data; (2) Heuristic rules were used to preprocess the Mongolian preference data; (3) The improved DPO algorithm is used to directly train the large language model using Mongolian preference data; the improved DPO algorithm is as follows: based on the original DPO loss function, a Mongolian medicine ethics penalty term is added. This leads to an improved DPO algorithm, where the optimized loss function... for: in, It is the sigmoid function. and These are the capability parameters of the two objects; For the Mongolian medicine ethics verification function: if the output of the large language model violates the Mongolian medicine rules, then P=1, triggering a penalty; If rule P=0 is met, there is no penalty. This is the penalty coefficient.
10. The method for constructing a large language model of Mongolian medicine combined with knowledge graphs according to claim 1, characterized in that, Step S9 is as follows: RAG breaks down the question-answering task into two phases: retrieval and generation; During the retrieval phase, the user's input query is converted into a semantic vector using vectorization technology, and then matched with document fragments in the vector database of the knowledge graph in step S8 to quickly locate the most relevant knowledge fragments. During the generation phase, the large language model combines the retrieved contextual information with the original query to generate structured or natural language output. Using vector indexing, the Top-K candidate knowledge fragments that are semantically related to the query are quickly retrieved through the approximate nearest neighbor algorithm. Then, a cross encoder is used to perform deep interactive calculations on the query and each candidate fragment to obtain a more accurate relevance score. Based on this score, the Top-K results are sorted in descending order, and finally, the relevant fragments are fed into the generative model.